Closed alexbfree closed 2 years ago
Mentions only work from discussions to issues, not the other way around. Or maybe it is a directory thing. In any case this one is relevant as well https://github.com/hestiaAI/Argonodes/discussions/58
Mentions only work from discussions to issues, not the other way around. Or maybe it is a directory thing.
I totally didn't understand these sentences.
I totally didn't understand these sentences.
Talking about GitHub functionnalities. Basically, "related to https://github.com/hestiaAI/Argonodes/discussions/58"
Okay so there's a bit of context and discussion to refocus here.
From a discussion on Signal.
Longue discussion avec @emmanuel-hestia.
Sum-up:
Goals:
create and update a Twitter model
& a) a subset of a model for a Twitter
, it is available here.
b) a worked example for creating such model
, it is available here.Benchmark/stresstest/buzzword the tool
, please use the following discussion: https://github.com/hestiaAI/Argonodes/discussions/74.See if the tool is suited for creating a "real life" model
, please open issues or suggestions where needed.Fix #767
, it is a work in progress within the data-catalog.Looking forward to catching up about this.
One thing I want to understand, what discussions have you had so far about ensuring the model is abstract / generalised? ie. suitable for data originating from other platforms too, not just Twitter specific. I imagine we have some information model design work to do there? I am not sure if you have already considered this but I think this is where I can be helpful....
One thing I want to understand, what discussions have you had so far about ensuring the model is abstract / generalised? ie. suitable for data originating from other platforms too, not just Twitter specific.
I believe that this issue is about Twitter, but the tool, methodology, and knowledge can be applied to other platforms as well - by design.
Did I answer your question correctly?
By design this is a technical discussion distinct from the discussion with Alex on what it is we are doing as a broad goal. Distinct because it involves different (overlapping) actors ans different cost-utility calculus.
Hmm now I am no longer sure we are all on the same page. I thought the design of the schema for the Twitter data (ie this ticket) was to be informed by a top down / abstracting commonalities approach not just bottom up / tightly mirroring the data. I feel like it's not two separate discussions but one; we need to have a general design that works in practice/at the specific level.
And to address Hugo's point, I think the goal would be that we would have one model that can support multiple platforms, not just apply the same approach elsewhere.
I may be wrong! Will need to discuss with Paul.
It's complicated, Alex. My approach is always dual bottom-up and top-down (and this is why it can be confusing to everyone else), because it is the most productive of long term value for Hestia.ai.
So you are not wrong with what you are saying on the overall strategy, but the way you say it is too imprecise to be helpful in order to align different teams well together, in my view, at this stage. You do identify a friction point, and this is helpful. I see more the problem with the prioritisation of solving an actual problem bottom up localized at a "site" with evidenciary value (here: Twitter), compared to a top down approach (especially given your absence for a few days, and the lack of responses on your top-down document). I also see the big value of doing it once better so it can be discussed more concretely and improved the next time around.
This being said keeping with the "pressing issue on Twitter" team, I do have issues with the fact that @Amustache and @emmanuel-hestia seem not to have started at the same point as I would have. It's a recurring issue with @Amustache's tool in my view: it provides you with a table that you naturally feel you have to fill (like Thomas did for Google Location History), while in fact I think the main value is elsewhere. Let me try to explain it in a programmatic way:
"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."
There are different levels for reading this sentence, with X changing from more concrete to more abstract, so let me break it down into distinct Github comments so you can ask follow up questions and at least emoji separately.
"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."
with X = SQL tables
Indeed, through Argonodes we could quickly identify a normalized structure for the SQL tables. This would be done by tagging relative accessors as corresponding to SQL foreign keys, and, going further, and as needs arise (i.e. later) defining import functions from more basic JSON to SQL tables.
"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."
with X = experiences
Indeed, building on the previous comment, we could quickly identify a less normalized structure more appropriate for experiments. This could be done by tagging relative accessors as corresponding to SQL foreign keys and being or not kept within a single table. As before we could as needs arise define more complex import functions.
"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."
with X = high-level relational semantics (how concepts interrelate).
These high level semantics are structuring of everything else, and do not cover all the semantics. Some lower level semantics ("this is a date") can be covered further down the toolchain, in duplicate or triplicate (around filters for instance). But once we have done this high-level relational semantics, it should be much clearer how to do the lower level semantics beneficially and agile.
"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."
with X = SQL tables
Indeed, through Argonodes we could quickly identify a normalized structure for the SQL tables. This would be done by tagging relative accessors as corresponding to SQL foreign keys, and, going further, and as needs arise (i.e. later) defining import functions from more basic JSON to SQL tables.
I will try to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording):
Is this a fair understanding of your point, @pdehaye ?
"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."
with X = experiences
Indeed, building on the previous comment, we could quickly identify a less normalized structure more appropriate for experiments. This could be done by tagging relative accessors as corresponding to SQL foreign keys and being or not kept within a single table. As before we could as needs arise define more complex import functions.
As before, I am trying to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording):
If I understand correctly, the attention that @Amustache and myself have to supporting various versions of the data format of specific actors (for instance, the ever-changing data format of Facebook) would be functionally equivalent at least to more restrictive version of this idea.
"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."
with X = high-level relational semantics (how concepts interrelate).
These high level semantics are structuring of everything else, and do not cover all the semantics. Some lower level semantics ("this is a date") can be covered further down the toolchain, in duplicate or triplicate (around filters for instance). But once we have done this high-level relational semantics, it should be much clearer how to do the lower level semantics beneficially and agile.
As before, I am trying to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording), and I am not sure I understand but here goes:
Is that a generally correct understanding of your idea, @pdehaye ?
"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream." with X = SQL tables Indeed, through Argonodes we could quickly identify a normalized structure for the SQL tables. This would be done by tagging relative accessors as corresponding to SQL foreign keys, and, going further, and as needs arise (i.e. later) defining import functions from more basic JSON to SQL tables.
I will try to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording):
* by processing a data sample from a Download Form (such as Twitter archives, Google takeouts etc., which are typically in JSON), Argonodes can identify the column names of the SQL database that could hold the same information as the JSON * as a second stage, Argonodes can also generate accessors, i.e. paths to the JSON entities corresponding to cells of the SQL database, which we can then use to actually fill the database * this is the focus of how @Amustache and @emmanuel-hestia (myself) have considered Argonodes, but Argonodes holds even more potential than that.
Is this a fair understanding of your point, @pdehaye ?
Yes, although the names of SQL columns need not be directly specified from the words that appear in the JSON as keys/accessors (which sometimes change language!).
"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream." with X = experiences Indeed, building on the previous comment, we could quickly identify a less normalized structure more appropriate for experiments. This could be done by tagging relative accessors as corresponding to SQL foreign keys and being or not kept within a single table. As before we could as needs arise define more complex import functions.
As before, I am trying to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording):
* by processing data samples not from one single actor (e.g. Twitter OR Google but not both), but several at the same time (e.g. both samples from Twitter and samples from Google), we could find structure common to both sample families (e.g.: user e-mail, or list of user positions across time) and make the correspondence between them, either by putting the data in one single SQL database table, or in several distinct tables (one for each sample family) with correspondence between table columns (e.g. the column names could be identical if they describe the same concept, or more sophisticated means such as Linked Data if required)
If I understand correctly, the attention that @Amustache and myself have to supporting various versions of the data format of specific actors (for instance, the ever-changing data format of Facebook) would be functionally equivalent at least to more restrictive version of this idea.
That is not what I meant. I had in mind the situation of Thomas with Google Location History, where to produce some visualizations due to what we have in place now he had to have specific CSV in place. That in turn required to have data that could be extracted with one query fast, which required him to populate the database in ways that were not normalized (or maybe Francois had already done that?). That in turn would require mapping where the data is normalized in SQL and where it isn't. I was pointing out that this information could be concentrated on a few of the line outputs of Argonodes.
"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream." with X = high-level relational semantics (how concepts interrelate). These high level semantics are structuring of everything else, and do not cover all the semantics. Some lower level semantics ("this is a date") can be covered further down the toolchain, in duplicate or triplicate (around filters for instance). But once we have done this high-level relational semantics, it should be much clearer how to do the lower level semantics beneficially and agile.
As before, I am trying to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording), and I am not sure I understand but here goes:
* Processing a wide range of samples from various sources, we could uncover a hierarchy of concepts that are fundamental to the very exercise of building a social network or similar applications. Such concepts would thus always be present, under one form or another, in the datasets of any actor we could encounter. * The actual data we see is a manifestation of this underlying structure. The differences between dataset families reflect differences in implementing the structure; Argonodes can be made to recognise these various implementations, automatically describe them (i.e. build accessors), and identify them as variants of known underlying concepts (like adding a new leaf on the appropriate branch of a concept tree) * The product of this inference is a concept tree can than be used to parse datasets — filling databases that we can easily explore even if the information comes from datasets with widely varying formats.
Is that a generally correct understanding of your idea, @pdehaye ?
Yes, Argonodes could do this (concrete example: parsing FAIRTIQ data to find the raw Google Location History data essentially untouched for each trip, but with of course each trip indexed at FAIRTIQ-specific accessors).
BUT the reality is that we shouldn't be doing it in Argonode now, for two reasons:
I'm taking the time to respond now, based on my current understanding of things - feel free to correct me.
this collective Twitter thing is a way of introducing more semantics as well, around a clear business need across data sources.
If I understand correctly, the idea is to have a semantic model that allows for the correct description of at least Twitter data, but also data from other sources.
My answer is that it's actually "by design". Example: If we have a semantic for "this is a location", this semantic will be identical for Twitter, Facebook, Google, ...
Nevertheless, the task assigned to @emmanuel-hestia and me is to focus on a Twitter description, to start with (this ticket).
I imagine we have some information model design work to do there?
Yes. In idea, I would like to populate the https://github.com/hestiaAI/Argonodes repository with semantic descriptions (e.g., https://github.com/hestiaAI/Argonodes/wiki/Argonodes%3AfoundType), whene these do not exist elsewhere (e.g., https://schema.org/name).
These descriptions can then be reused in other models from other data sources.
I thought the design of the schema for the Twitter data (ie this ticket) was to be informed by a top down / abstracting commonalities approach not just bottom up / tightly mirroring the data.
I think the goal would be that we would have one model that can support multiple platforms
For me, the nuance comes here: the models are per-source, but the (semantic) descriptions used in the models are global (e.g., a location is still a location, regardless of its name in a model).
At that point I think we should convert/move the present discussion to, well, a discussion in the Argonodes repo, because we are moving away from "an issue" (= technical, goal). I'm waiting for @alexbfree and @pdehaye's endorsement (thumbs up 👍 on this post), and I'm doing it.
tl;dr :
flowchart LR
A[Data source] --> B(Raw data)
B --> |Potential Parser| C(JSON)
C --> D(Model)
D --> E[SQL instructions/tables]
This has already been discussed with @fquellec, and we will be working closely together to reach that goal.
This has already been discussed with @fquellec, and we will be working closely together to reach that goal.
tl;dr :
flowchart LR
A[Data sourceS] --> B(ModelS)
B --> C[Concept trees]
Argonodes still has the wrong architecture, I suspect.
Yes, this has already been discussed with @fquellec, the fact that the "final" output structure will/should evolve further.
Please, feel free to a) see the examples and b) describe a correct architecture, before we start implementing to much.
The Twitter model, along with its semantic descriptions, that will be later reused for other models from other sources, is on its way.
Dropping that here because links can be made https://github.com/hestiaAI/hestialabs-experiences/discussions/578
I think this has been superceded by hestiaAI/clients/issues/35 and also #1036 - closing this, bizdev please reopen if you disagree
It is not quite superceded, because we discussed more here the nested structure. But ok to close.
As part of #764, we want to seize the opportunity to start to move building visualisations of abstracted information rather than source data files. Alex, Hugo and Emmanuel to work together to design a schema for this.