Create an abstracted semantic schema for Twitter experiences for Reflets.info

alexbfree commented 2 years ago

As part of #764, we want to seize the opportunity to start to move building visualisations of abstracted information rather than source data files. Alex, Hugo and Emmanuel to work together to design a schema for this.

pdehaye commented 2 years ago

Mentions only work from discussions to issues, not the other way around. Or maybe it is a directory thing. In any case this one is relevant as well https://github.com/hestiaAI/Argonodes/discussions/58

alexbfree commented 2 years ago

Mentions only work from discussions to issues, not the other way around. Or maybe it is a directory thing.

I totally didn't understand these sentences.

Amustache commented 2 years ago

I totally didn't understand these sentences.

Talking about GitHub functionnalities. Basically, "related to https://github.com/hestiaAI/Argonodes/discussions/58"

Amustache commented 2 years ago

Okay so there's a bit of context and discussion to refocus here.

2022.06.16

From a discussion on Signal.

Longue discussion avec @emmanuel-hestia.

Sum-up:

We discussed about the use of Argonodes to create models, its place in our general workflow, and what should be done with/about it.
Re: https://github.com/hestiaAI/hestialabs-experiences/issues/767, we are thus going to create and update a Twitter model, essencially merging what is already existing (i.e., data-descriptions/twitter.md, data-catalog/shex/twitter, @emmanuel-hestia's brain).
Ideally, by Monday, we have a) a subset of a model for a Twitter and b) a worked example for creating such model.

Goals:

Benchmark/stresstest/buzzword the tool.
See if the tool is suited for creating a "real life" model (i.e., Twitter).
Fix #767 (duh).
- Subsequently, the model can and will be used for later reuse (e.g., create SQL requests for Experiences).

2022.06.20 (today!)

Re: create and update a Twitter model & a) a subset of a model for a Twitter, it is available here.
- Particularly, see this.
- Please keep in mind that it is a work in progress.
Re: b) a worked example for creating such model, it is available here.
Re: Benchmark/stresstest/buzzword the tool, please use the following discussion: https://github.com/hestiaAI/Argonodes/discussions/74.
Re: See if the tool is suited for creating a "real life" model, please open issues or suggestions where needed.
Re: Fix #767, it is a work in progress within the data-catalog.
Please also note that a general rework of data-catalog will be made, see https://github.com/hestiaAI/data-catalog/issues/5.

What's next?

We will continue to work with @emmanuel-hestia on making a complete semantic model for Twitter.
- Follow-up: data-catalog/argonodes/Twitter.
We will take the opportunity to improve the Argonodes tool.
- Follow-up: Argonodes.
We will start an overdue refactoring of the data-catalog repo.
- Follow-up: https://github.com/hestiaAI/data-catalog/issues/5.

alexbfree commented 2 years ago

Looking forward to catching up about this.

One thing I want to understand, what discussions have you had so far about ensuring the model is abstract / generalised? ie. suitable for data originating from other platforms too, not just Twitter specific. I imagine we have some information model design work to do there? I am not sure if you have already considered this but I think this is where I can be helpful....

Amustache commented 2 years ago

One thing I want to understand, what discussions have you had so far about ensuring the model is abstract / generalised? ie. suitable for data originating from other platforms too, not just Twitter specific.

I believe that this issue is about Twitter, but the tool, methodology, and knowledge can be applied to other platforms as well - by design.

Did I answer your question correctly?

pdehaye commented 2 years ago

By design this is a technical discussion distinct from the discussion with Alex on what it is we are doing as a broad goal. Distinct because it involves different (overlapping) actors ans different cost-utility calculus.

alexbfree commented 2 years ago

Hmm now I am no longer sure we are all on the same page. I thought the design of the schema for the Twitter data (ie this ticket) was to be informed by a top down / abstracting commonalities approach not just bottom up / tightly mirroring the data. I feel like it's not two separate discussions but one; we need to have a general design that works in practice/at the specific level.

And to address Hugo's point, I think the goal would be that we would have one model that can support multiple platforms, not just apply the same approach elsewhere.

I may be wrong! Will need to discuss with Paul.

pdehaye commented 2 years ago

It's complicated, Alex. My approach is always dual bottom-up and top-down (and this is why it can be confusing to everyone else), because it is the most productive of long term value for Hestia.ai.

So you are not wrong with what you are saying on the overall strategy, but the way you say it is too imprecise to be helpful in order to align different teams well together, in my view, at this stage. You do identify a friction point, and this is helpful. I see more the problem with the prioritisation of solving an actual problem bottom up localized at a "site" with evidenciary value (here: Twitter), compared to a top down approach (especially given your absence for a few days, and the lack of responses on your top-down document). I also see the big value of doing it once better so it can be discussed more concretely and improved the next time around.

This being said keeping with the "pressing issue on Twitter" team, I do have issues with the fact that @Amustache and @emmanuel-hestia seem not to have started at the same point as I would have. It's a recurring issue with @Amustache's tool in my view: it provides you with a table that you naturally feel you have to fill (like Thomas did for Google Location History), while in fact I think the main value is elsewhere. Let me try to explain it in a programmatic way:

"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."

There are different levels for reading this sentence, with X changing from more concrete to more abstract, so let me break it down into distinct Github comments so you can ask follow up questions and at least emoji separately.

pdehaye commented 2 years ago

"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."

with X = SQL tables

Indeed, through Argonodes we could quickly identify a normalized structure for the SQL tables. This would be done by tagging relative accessors as corresponding to SQL foreign keys, and, going further, and as needs arise (i.e. later) defining import functions from more basic JSON to SQL tables.

pdehaye commented 2 years ago

"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."

with X = experiences

Indeed, building on the previous comment, we could quickly identify a less normalized structure more appropriate for experiments. This could be done by tagging relative accessors as corresponding to SQL foreign keys and being or not kept within a single table. As before we could as needs arise define more complex import functions.

pdehaye commented 2 years ago

"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."

with X = high-level relational semantics (how concepts interrelate).

These high level semantics are structuring of everything else, and do not cover all the semantics. Some lower level semantics ("this is a date") can be covered further down the toolchain, in duplicate or triplicate (around filters for instance). But once we have done this high-level relational semantics, it should be much clearer how to do the lower level semantics beneficially and agile.

emmanuel-hestia commented 2 years ago

"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."

with X = SQL tables

Indeed, through Argonodes we could quickly identify a normalized structure for the SQL tables. This would be done by tagging relative accessors as corresponding to SQL foreign keys, and, going further, and as needs arise (i.e. later) defining import functions from more basic JSON to SQL tables.

I will try to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording):

by processing a data sample from a Download Form (such as Twitter archives, Google takeouts etc., which are typically in JSON), Argonodes can identify the column names of the SQL database that could hold the same information as the JSON
as a second stage, Argonodes can also generate accessors, i.e. paths to the JSON entities corresponding to cells of the SQL database, which we can then use to actually fill the database
this is the focus of how @Amustache and @emmanuel-hestia (myself) have considered Argonodes, but Argonodes holds even more potential than that.

Is this a fair understanding of your point, @pdehaye ?

emmanuel-hestia commented 2 years ago

"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."

with X = experiences

Indeed, building on the previous comment, we could quickly identify a less normalized structure more appropriate for experiments. This could be done by tagging relative accessors as corresponding to SQL foreign keys and being or not kept within a single table. As before we could as needs arise define more complex import functions.

As before, I am trying to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording):

by processing data samples not from one single actor (e.g. Twitter OR Google but not both), but several at the same time (e.g. both samples from Twitter and samples from Google), we could find structure common to both sample families (e.g.: user e-mail, or list of user positions across time) and make the correspondence between them, either by putting the data in one single SQL database table, or in several distinct tables (one for each sample family) with correspondence between table columns (e.g. the column names could be identical if they describe the same concept, or more sophisticated means such as Linked Data if required)

If I understand correctly, the attention that @Amustache and myself have to supporting various versions of the data format of specific actors (for instance, the ever-changing data format of Facebook) would be functionally equivalent at least to more restrictive version of this idea.

emmanuel-hestia commented 2 years ago

"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."

with X = high-level relational semantics (how concepts interrelate).

These high level semantics are structuring of everything else, and do not cover all the semantics. Some lower level semantics ("this is a date") can be covered further down the toolchain, in duplicate or triplicate (around filters for instance). But once we have done this high-level relational semantics, it should be much clearer how to do the lower level semantics beneficially and agile.

As before, I am trying to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording), and I am not sure I understand but here goes:

Processing a wide range of samples from various sources, we could uncover a hierarchy of concepts that are fundamental to the very exercise of building a social network or similar applications. Such concepts would thus always be present, under one form or another, in the datasets of any actor we could encounter.
The actual data we see is a manifestation of this underlying structure. The differences between dataset families reflect differences in implementing the structure; Argonodes can be made to recognise these various implementations, automatically describe them (i.e. build accessors), and identify them as variants of known underlying concepts (like adding a new leaf on the appropriate branch of a concept tree)
The product of this inference is a concept tree can than be used to parse datasets — filling databases that we can easily explore even if the information comes from datasets with widely varying formats.

Is that a generally correct understanding of your idea, @pdehaye ?

pdehaye commented 2 years ago

"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream." with X = SQL tables Indeed, through Argonodes we could quickly identify a normalized structure for the SQL tables. This would be done by tagging relative accessors as corresponding to SQL foreign keys, and, going further, and as needs arise (i.e. later) defining import functions from more basic JSON to SQL tables.

I will try to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording):
* by processing a data sample from a Download Form (such as Twitter archives, Google takeouts etc., which are typically in JSON), Argonodes can identify the column names of the SQL database that could hold the same information as the JSON

* as a second stage, Argonodes can also generate accessors, i.e. paths to the JSON entities corresponding to cells of the SQL database, which we can then use to actually fill the database

* this is the focus of how @Amustache and @emmanuel-hestia (myself) have considered Argonodes, but Argonodes holds even more potential than that.
Is this a fair understanding of your point, @pdehaye ?

Yes, although the names of SQL columns need not be directly specified from the words that appear in the JSON as keys/accessors (which sometimes change language!).

pdehaye commented 2 years ago

"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream." with X = experiences Indeed, building on the previous comment, we could quickly identify a less normalized structure more appropriate for experiments. This could be done by tagging relative accessors as corresponding to SQL foreign keys and being or not kept within a single table. As before we could as needs arise define more complex import functions.

As before, I am trying to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording):
* by processing data samples not from one single actor (e.g. Twitter OR Google but not both), but several at the same time (e.g. both samples from Twitter and samples from Google), we could find structure common to both sample families (e.g.: user e-mail, or list of user positions across time) and make the correspondence between them, either by putting the data in one single SQL database table, or in several distinct tables (one for each sample family) with correspondence between table columns (e.g. the column names could be identical if they describe the same concept, or more sophisticated means such as Linked Data if required)
If I understand correctly, the attention that @Amustache and myself have to supporting various versions of the data format of specific actors (for instance, the ever-changing data format of Facebook) would be functionally equivalent at least to more restrictive version of this idea.

That is not what I meant. I had in mind the situation of Thomas with Google Location History, where to produce some visualizations due to what we have in place now he had to have specific CSV in place. That in turn required to have data that could be extracted with one query fast, which required him to populate the database in ways that were not normalized (or maybe Francois had already done that?). That in turn would require mapping where the data is normalized in SQL and where it isn't. I was pointing out that this information could be concentrated on a few of the line outputs of Argonodes.

pdehaye commented 2 years ago

"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream." with X = high-level relational semantics (how concepts interrelate). These high level semantics are structuring of everything else, and do not cover all the semantics. Some lower level semantics ("this is a date") can be covered further down the toolchain, in duplicate or triplicate (around filters for instance). But once we have done this high-level relational semantics, it should be much clearer how to do the lower level semantics beneficially and agile.

As before, I am trying to rephrase to make sure I understand correctly (and, if that is the case, possibly help others understand by providing another wording), and I am not sure I understand but here goes:
* Processing a wide range of samples from various sources, we could uncover a hierarchy of concepts that are fundamental to the very exercise of building a social network or similar applications. Such concepts would thus always be present, under one form or another, in the datasets of any actor we could encounter.

* The actual data we see is a manifestation of this underlying structure. The differences between dataset families reflect differences in implementing the structure; Argonodes can be made to recognise these various implementations, automatically describe them (i.e. build accessors), and identify them as variants of known underlying concepts (like adding a new leaf on the appropriate branch of a concept tree)

* The product of this inference is a concept tree can than be used to parse datasets — filling databases that we can easily explore even if the information comes from datasets with widely varying formats.
Is that a generally correct understanding of your idea, @pdehaye ?

Yes, Argonodes could do this (concrete example: parsing FAIRTIQ data to find the raw Google Location History data essentially untouched for each trip, but with of course each trip indexed at FAIRTIQ-specific accessors).

BUT the reality is that we shouldn't be doing it in Argonode now, for two reasons:

Argonodes still has the wrong architecture, I suspect
others downstream of Argonodes are doing this, we should first do it for "Twitter individual to collective" well through their mechanisms/improve those mechanisms.

Amustache commented 2 years ago

I'm taking the time to respond now, based on my current understanding of things - feel free to correct me.

@alexbfree

this collective Twitter thing is a way of introducing more semantics as well, around a clear business need across data sources.

If I understand correctly, the idea is to have a semantic model that allows for the correct description of at least Twitter data, but also data from other sources.

My answer is that it's actually "by design". Example: If we have a semantic for "this is a location", this semantic will be identical for Twitter, Facebook, Google, ...

Nevertheless, the task assigned to @emmanuel-hestia and me is to focus on a Twitter description, to start with (this ticket).

I imagine we have some information model design work to do there?

Yes. In idea, I would like to populate the https://github.com/hestiaAI/Argonodes repository with semantic descriptions (e.g., https://github.com/hestiaAI/Argonodes/wiki/Argonodes%3AfoundType), whene these do not exist elsewhere (e.g., https://schema.org/name).

These descriptions can then be reused in other models from other data sources.

I thought the design of the schema for the Twitter data (ie this ticket) was to be informed by a top down / abstracting commonalities approach not just bottom up / tightly mirroring the data.

I think the goal would be that we would have one model that can support multiple platforms

For me, the nuance comes here: the models are per-source, but the (semantic) descriptions used in the models are global (e.g., a location is still a location, regardless of its name in a model).

@pdehaye x @emmanuel-hestia

At that point I think we should convert/move the present discussion to, well, a discussion in the Argonodes repo, because we are moving away from "an issue" (= technical, goal). I'm waiting for @alexbfree and @pdehaye's endorsement (thumbs up 👍 on this post), and I'm doing it.

"By identifying relative positions of data with relative traversals, it helps more quickly identify modularity/compositionality in how X should be constructed downstream."

"with X = SQL tables"

tl;dr :

    flowchart LR
        A[Data source] --> B(Raw data)
        B --> |Potential Parser| C(JSON)
        C --> D(Model)
        D --> E[SQL instructions/tables]

This has already been discussed with @fquellec, and we will be working closely together to reach that goal.

"with X = experiences"

This has already been discussed with @fquellec, and we will be working closely together to reach that goal.

"with X = high-level relational semantics (how concepts interrelate)"

tl;dr :

    flowchart LR
        A[Data sourceS] --> B(ModelS)
        B --> C[Concept trees]

Argonodes still has the wrong architecture, I suspect.

Yes, this has already been discussed with @fquellec, the fact that the "final" output structure will/should evolve further.

Please, feel free to a) see the examples and b) describe a correct architecture, before we start implementing to much.

Anyhow

The Twitter model, along with its semantic descriptions, that will be later reused for other models from other sources, is on its way.

Amustache commented 2 years ago

Dropping that here because links can be made https://github.com/hestiaAI/hestialabs-experiences/discussions/578

alexbfree commented 2 years ago

I think this has been superceded by hestiaAI/clients/issues/35 and also #1036 - closing this, bizdev please reopen if you disagree

pdehaye commented 2 years ago

It is not quite superceded, because we discussed more here the nested structure. But ok to close.

hestiaAI / hestialabs-experiences