hestiaAI / hestialabs-experiences

HestiaLabs Data Experiences & Digipower Academy
https://digipower.academy
Other
7 stars 1 forks source link

slow Twitter aggregate experience (no immediate solution yet) #1011

Open alexbfree opened 2 years ago

alexbfree commented 2 years ago

Aggregation test this morning

This morning we tested brown's twitter aggregation with five participants (Emmanuel, Yann, Jessica, Marie-Pierre, and I was using the big twitter sample data).

Processing the files on my computer in firefox took 1 minute, creating the visualization was slow, but rather a matter of seconds. Several times firefox asked me if it should kill the javascript process. Someone had a faster computer where it ran in about 20 seconds.

We tried again by sharing only the data needed for the aggregation (Advertising Overview, Targeting Criteria (1 advertiser)). This reduced the file size by about 1/4, but the processing time was just a few seconds less than before.

Later I tried this again with the 3 files that happened to remain after the meeting. Processing time was 40 seconds in firefox, but 20 seconds in chrome. The advertising overview took 30 seconds in firefox, and about 10 in chrome.

What takes time

Clicking the button "Explore the data" starts the following process:

This all runs faster with less files and smaller files.

In the twitter experience, the data for the Advertising Overview takes up a lot more data than the data for the Targeting Criteria (1 advertiser)

Recommendations for using the current solution

Possible development steps

light version of the advertising overview

Next to the full version of the ad overview, participants would have a similar tab that visualizes less data. We could change the sql request to load only last month for example. (Apparently twitter export data contains the last three months)

Maybe removing some graphs from the overview might help, but that's a bit complicated

further investigation by profiling

Once we now what takes most time we could try to optimize it. Maybe there is an encryption alghorithm that's faster to decrypt? Maybe there is a faster way to dump participant databases and a way to load all of them quickly on the aggregation database. Maybe we can process files in workers.

This will probably not give us a solution we can implement this week.

Shalf commented 2 years ago

Cf. https://kdrive.infomaniak.com/app/drive/193995/redirect/192440

alexbfree commented 2 years ago

@pdehaye @fquellec can you share some sample data so we can reproduce and investigate the problem?

alexbfree commented 2 years ago

standup: @andreaskundig is going to add some notes of suggestions

Shalf commented 2 years ago

Video of our trouble shooting this morning if that helps not doing it twice: https://kdrive.infomaniak.com/app/drive/193995/redirect/192488

Shalf commented 2 years ago

Biz-facing milestones:

andreaskundig commented 2 years ago

I made a branch with the quick fix I proposed. Despite no one reacting to my proposition, I had the impression the issue was super important.

pdehaye commented 2 years ago

Thank you Andreas. It did work for our workshop. Am not sure how much this will actually help though as, as I understand it, the Twitter data is time limited.

andreaskundig commented 2 years ago

if profiling reveals that decryption eats up a lot of time, @Caochy says there might be a possibility to speed it up by weakening encryption.

valentinoli commented 2 years ago

~ Twitter archives profiling feedback ~

Participant bubble

Image

Aggregator bubble

Image

valentinoli commented 2 years ago

From @alexbfree

ok so for participant side:

0.3 secs to init the file manager 0.06 secs to create the DB 15 secs to populate the DB

and for agg side:

0.2 sec to fetch file 0.8 sec to decrypt 0.6 seconds to init the file manager 0.02 secs to create the DB 22 seconds to populate the DB

valentinoli commented 2 years ago

Here is my (failed) attempt to implement transactions

/**
 * Insert all the given items in the database.
 * IMPORTANT: the items should NOT contain any 'undefined' value. Use null instead.
 * @param {String} table the table in which we insert the values.
 * @param {Array<Object>} items an array of rows to insert. Each row should contain the same attributes.
 */
insert(tableName, items) {
  this.#checkState()
  if (!this.#tables.has(tableName)) {
    throw new Error('This table does not exist.')
  }
  const table = this.#tables.get(tableName)
  const columns = table.getColumnNames().filter(
    // disregard auto-incremented columns in INSERT statement
    name => !table.datatype(name).toLowerCase().includes('autoincrement')
  )
  const columnsJoined = columns.join(', ')
  const statement = `INSERT INTO ${tableName}(${columnsJoined}) VALUES (${columns.map(c => `:${c}`).join(', ')});`

  const statements = items.map((item) => {
    // prepare() returns Statement object
    // but here we want to return the resolved statement as a string
    return this.#db.prepare(statement, item)
  })
  const BATCH_SIZE = 5000
  while (statements.length) {
    const statementsBulk = statements.splice(0, BATCH_SIZE)
    const transaction = `
      BEGIN TRANSACTION;
      ${statementsBulk.join('\n')}
      COMMIT TRANSACTION;
    `
    this.#db.run(transaction)
  }
}

Edit: I suspect sql.js only supports executing one statement at a time... we need to verify

valentinoli commented 2 years ago

We need to investigate this further

HaixinShi commented 2 years ago

Ready for biz-dev testing. @andreaskundig has built overview of last month to reduce the amount of data.

Shalf commented 2 years ago

Advertising overview (1 month) works and is pretty cool. Only had 2 datasets at hand, and having a decent RAM, didn't see such a visible difference for computing time, but sure it'll help TH on Monday.

alexbfree commented 2 years ago

Based on the positive test result I would suggest we put this back to New, for discussion between dev+bizdev, to see if there is more to do here on general performance improvement or whether we just close for now.

pdehaye commented 2 years ago

Transactions still seem like a worthwhile idea, no?

For the month-old overview, thank you, that seems to do the trick. However note, and this is just a remark, that it introduces new complexities: now people have two options of sharing the same data, just scoped differently, and they can pool it into different pools (month old, forever). I think what would make more sense is to pool into one, and then the aggregator would again have two views to select from.

alexbfree commented 2 years ago

Moving to 'specs needed' for dev to explore and test more.

It would be useful if @pdehaye could comment as to whether this is less critical/urgent now, it seems to be so.

pdehaye commented 2 years ago

It sounds like the particular Twitter-related problem has been (monentarily) resolved, at the expense of creating a lot of new tabs that are somewhat incompatible between each other for no good reason necessarily. Hence it would be less urgent.

It also sounds to me like the more generic underlying problem has got to do with batch-importing data into the database.

Finally, this is all to be re-built on top of better abstractions of how to build experiences. Twitter aggregate is the cutting edge of what we want to do, it feels more correct to push elsewhere for the better abstractions, and progressively to re-implement Twitter single/aggregate and the linking.