Use accessors to define SQLite Databases

We have changed the way we can create an experience from a manifest several times, first with Sparql, then with custom pipelines and now with SQLite. These changes have resulted in a lot of logic being left out of the manifest, we need a way to standardize this and give the power back to a single JSON configuration file in order to facilitate the creation of experiences.

A standard tool we use in the project to navigate through JSON files are "Accessors". They allow us to easily specify parts of data points in a way that many people can understand. They are defined as follow:

{
      accessor: {
        filePath: 'comments_and_reactions/comments.json',
        jsonPath: '$.comments_v2[*]'
      },
      columns: [
        {
          name: 'number',
          path: 'timestamp'

        },
        {
          name: 'First comment',
          path: 'comment'
        }
      ]
}

So this issue aims to use this type of data structure to create a SQL database in the manifest, which we could then query to generate graphs. Then once done, convert all the experiences to use this new version of the configuration (this could be in another issue). Here is an example of what each manifest should look like:

{
  "title": "Twitter",
  "icon": "twittr.png",
  "dataPortal": "https://help.twitter.com/en/managing-your-account/how-to-download-your-twitter-archive",
  "data": ["twitter.zip", "twitter-sample.zip"],
  "hideSummary": true,
  "hideFileExplorer": true,
  "preprocessors": {
    "**/ad-impressions.js": "twitter",
    "**/ad-engagements.js": "twitter",
    "**/personalization.js": "twitter"
  },
  "databaseTables": [
    {
      "name": "engagements",
      "accessors": [
        {
          "accessor": {
            "filePath": "comments_and_reactions/comments.json",
              "jsonPath": "$.comments_v2[*]"
          },
          "columns": [
            {
              "path": "path.to.column1",
              "name": "Column1"
            },
            {
              "path": "path.to.column2",
              "name": "Column2"
            },
            {
              "path": "path.to.column3",
              "name": "Column3"
            }
          ]
        }
      ]
    },
    {
      "name": "impressions",
      "accessors": [
        {
          "accessor": {
            "filePath": "comments_and_reactions/comments.json",
              "jsonPath": "$.comments_v2[*]"
          },
          "columns": [
            {
              "path": "path.to.column1",
              "name": "Column1"
            },
            {
              "path": "path.to.column2",
              "name": "Column2"
            },
            {
              "path": "path.to.column3",
              "name": "Column3"
            }
          ]
        }
      ]
    }
  ],
  "defaultView": [
    {
      "key": "overview",
      "sql": "overview",
      "files": ["impressions", "engagements"],
      "visualization": "ChartViewOverviewTwitter.vue",
      "showTable": false,
      "title": "Advertising overview",
      "text": "Understand how much, who and why you are being targeted by Twitter ads in this interactive visualization. Click on any graph to filter the results."
    },
    {
      "key": "ads-per-advertiser",
      "sql": "ads-per-advertiser",
      "files": ["impressions", "engagements"],
      "visualization": "ChartViewTopRow.vue",
      "vizProps": {
        "xLabel": "ads"
      },
      "showTable": false,
      "title": "Top advertisers",
      "text": "See the ranking of advertisers that have shown you the most ads."
    },
    {
      "key": "targeting-criteria-by-advertiser",
      "sql": "targeting-criteria-by-advertiser",
      "files": ["impressions", "engagements"],
      "postprocessor": "sunburstTargeting",
      "visualization": "ChartViewSunburst.vue",
      "showTable": false,
      "title": "Targeting criteria (Top 10 advertisers)",
      "text": "Understand why you have been targeted by these advertisers. The graph shows the 10 advertisers who use the most targeting criteria on you. In the first ring you will find the different advertisers, in the second ring the types of targeting used (location, language, ...). And finally, if you click on a ring, the graph will enlarge to show you the targeting values (Swiss, French, ...) in the outer ring."
    },
    {
      "key": "targeting-criteria-all-advertisers",
      "sql": "targeting-criteria-all-advertisers",
      "files": ["impressions", "engagements"],
      "visualization": "wordcloud-targeting-criteria",
      "showTable": false,
      "title": "Targeting criteria (All advertisers)",
      "text": "Get more details on how you are targeted by these advertisers. This graph shows the number of times you have been targeted by different advertisers using a specific criterion. You can change the type of criterion and get information about the meaning of each criterion's type at the bottom of the graph."
    },
    {
      "key": "all-criteria-all-advertisers",
      "sql": "all-criteria-all-advertisers",
      "files": ["impressions", "engagements"],
      "postprocessor": "sunburstTargeting",
      "showTable": false,
      "title": "Targeting criteria (1 advertiser)",
      "visualization": "ChartViewSearchSunburst.vue",
      "text": "Visualize the targeting criteria used by one specific advertiser. You can look up the advertiser in the table below and click on it to visualise them."
    },
    {
      "key": "genericDateViewer",
      "customPipeline": "genericDateViewer",
      "visualization": "ChartViewGenericDateViewer.vue",
      "showTable": false,
      "title": "Timeline",
      "text": "See all the dated events in your files, corresponding to data that has been collected on you at or concerning a specific date."
    }
  ]
}

In the current Twitter database we have twitterAds and twitterCriteria tables. I suggest we keep with the SQL convention of using singular and CamelCased names for tables: TwitterAd and TwitterCriterion

Valentin: Ok. I'll do that for Twitter and show you on Hestia-dev for confirmation just to make sure I'm on the right track

Andreas: ok I haven't looked at it closely but I see some complications in the twitter pipeline, there's an if statement, and this line:

    Math.abs(d3.timeSecond.count(new Date(ad.time), new Date(time))) < 120

It would be nice if you find a simple way to make this configurable, but I suspect that things like these would remain in javascript. Maybe the name of the javascript function could be put in the manifest? I'm just thinking out loud, don't know the best solution.

Valentin: yeah it's a 4 line if statement. I thought these should be in JS since it is part of the code to feed the data into the tables

Andreas: And the targetingCriteria table is created from nested data in the impressions. It's not trivial

Valentin: so if I understand, you also want this part to be configured instead of hardcoded?

Andreas: If it makes sense, make everything configurable.

Valentin:

And the targetingCriteria table is created from nested data in the impressions. It's not trivial

you mean the twitterCriteria table?

Andreas: yes It's possible that the twitter example is too complicated to make entirely configurable in json The goal is to have something that we can use for other experiences. I guess there will be a point where we say that things get so complicated we need to do everything in javascript It would be nice to be able to create a first version of an experience just with json, so it can be done by someone who is not a coder. Then if it gets too complicated, we involve a programmer

Valentin:

The goal is to have something that we can use for other experiences. I guess there will be a point where we say that things get so complicated we need to do everything in javascript

You mean going back to doing everything in javascript? Then what's the point of this work? Or you just mean doing part of it or only for some experiences in javascript?

Andreas: I hope we're not going to reach this point for every experience.

Valentin: question: does the current implementation support primary/foreign keys?

Andreas: I don't know. Florian wrote a layer above sqlite, I don't know if he accepts foreign keys. If he does, I don't know if it's used somewhere ok, looking at sql.js, we don't support foreign keys

Valentin: Hmm ok is there a way to configure autoincremented ids?

Andreas: we do use joins in sql queries for twitter. I guess they're fast enough without foreign keys

is there a way to configure autoincremented ids?

Doesn't look like there is

Valentin: Ok. The data feeding of the Twitter database is complicated. It is not straightforward because there are some relations like, we try to link engagements with impressions

Andreas: I wonder if this could be simplified by making two tables with a foreign key instead of that

Valentin: so perhaps, as a first step, I could just focus on table definitions and find out how to pass the db as an argument to the databaseBuilder instead of creating the db in the databaseBuilder

I wonder if this could be simplified by making two tables with a foreign key instead of that

yes, it could be, but that is not supported in sql.js

Andreas: We can change sql.js, it's pretty barebones right now. I wonder if we even need it as a layer above sqlite. I would have tried to just create functions that generate sql instead of hiding it, but I didn't have the opportunity to ask Florian why he ended up doing things this way

so perhaps, as a first step, I could just focus on table definitions and find out how to pass the db as an argument to the databaseBuilder

Sounds like a good first step, yes try to keep the configuration as short as possible

In the following segment, we process all the ads and set engagement to 1 if the engagement matches the ad and it happens within 120 seconds of the impression. I have a couple of questions: Why are we not counting the engagements? Why do we need to process all the ads, would it not be enough to find the matching ad with Array.find() (since I expect only at most 1 ad will match)?

  engagements.forEach(v => {
    const tweetId = v.promotedTweetInfo?.tweetId ?? null
    const advertiserName = v.advertiserInfo?.advertiserName ?? null
    const displayLocation = v.displayLocation ?? null
    const time = v.impressionTime ?? null
    adsItems.forEach(ad => {
      if (
        ad.tweetId === tweetId &&
        ad.advertiserName === advertiserName &&
        ad.displayLocation === displayLocation &&
        Math.abs(d3.timeSecond.count(new Date(ad.time), new Date(time))) < 120
      ) {
        ad.engagement = 1
      }
    })
  })

Looking at the code, I think you are right about both, we would like to count the engagements and use Array.find(). What concerns me here is also the "trick" we use to match engagements and impressions, I'll see if there is a better way...

What concerns me here is also the "trick" we use to match engagements and impressions, I'll see if there is a better way...

@andreaskundig told me sql.js doesn't support foreign keys, so storing engagements in a separate table doesn't seem to be a viable option unless we use a different engine. One thing I see is that we could potentially exploit jsonpath-plus by concatenating the two files (impressions and engagments) into a single json that we can manipulate with JSONPath. Then we can possibly connect engagements and impressions with a complex JSONPath. In that case though, we would probably not be able to do the 120 sec check.

@andreaskundig told me sql.js doesn't support foreign keys

I'm not sure that's correct, see for instance, https://github.com/sql-js/sql.js/issues/221 It makes sense since sql.js is a JS implementation of SQLite which indeed supports foreign keys.

hestiaAI / hestialabs-experiences

Use accessors to define SQLite Databases #602