Open alexbfree opened 2 years ago
@pdehaye @fquellec can you share some sample data so we can reproduce and investigate the problem?
standup: @andreaskundig is going to add some notes of suggestions
Video of our trouble shooting this morning if that helps not doing it twice: https://kdrive.infomaniak.com/app/drive/193995/redirect/192488
Biz-facing milestones:
I made a branch with the quick fix I proposed. Despite no one reacting to my proposition, I had the impression the issue was super important.
Thank you Andreas. It did work for our workshop. Am not sure how much this will actually help though as, as I understand it, the Twitter data is time limited.
if profiling reveals that decryption eats up a lot of time, @Caochy says there might be a possibility to speed it up by weakening encryption.
~ Twitter archives profiling feedback ~
Participant bubble
Aggregator bubble
UnitDownload.fetchFiles -> UnitDownload.fetchFiles
UnitDownload.fetchFiles.getFile...
UnitDownload.fetchFiles.decryptBlob
TheDataExperience.onUnitFilesUpdate -> TheDataExperience.onUnitFilesUpdate
TheDataExperience.onUnitFilesUpdate.initFileManager
TheDataExperience.onUnitFilesUpdate.createDB
TheDataExperience.onUnitFilesUpdate.generateRecords
+ TheDataExperience.onUnitFilesUpdate.insertRecords
From @alexbfree
ok so for participant side:
0.3 secs to init the file manager 0.06 secs to create the DB 15 secs to populate the DB
and for agg side:
0.2 sec to fetch file 0.8 sec to decrypt 0.6 seconds to init the file manager 0.02 secs to create the DB 22 seconds to populate the DB
Here is my (failed) attempt to implement transactions
/**
* Insert all the given items in the database.
* IMPORTANT: the items should NOT contain any 'undefined' value. Use null instead.
* @param {String} table the table in which we insert the values.
* @param {Array<Object>} items an array of rows to insert. Each row should contain the same attributes.
*/
insert(tableName, items) {
this.#checkState()
if (!this.#tables.has(tableName)) {
throw new Error('This table does not exist.')
}
const table = this.#tables.get(tableName)
const columns = table.getColumnNames().filter(
// disregard auto-incremented columns in INSERT statement
name => !table.datatype(name).toLowerCase().includes('autoincrement')
)
const columnsJoined = columns.join(', ')
const statement = `INSERT INTO ${tableName}(${columnsJoined}) VALUES (${columns.map(c => `:${c}`).join(', ')});`
const statements = items.map((item) => {
// prepare() returns Statement object
// but here we want to return the resolved statement as a string
return this.#db.prepare(statement, item)
})
const BATCH_SIZE = 5000
while (statements.length) {
const statementsBulk = statements.splice(0, BATCH_SIZE)
const transaction = `
BEGIN TRANSACTION;
${statementsBulk.join('\n')}
COMMIT TRANSACTION;
`
this.#db.run(transaction)
}
}
Edit: I suspect sql.js only supports executing one statement at a time... we need to verify
We need to investigate this further
Ready for biz-dev testing. @andreaskundig has built overview of last month to reduce the amount of data.
Advertising overview (1 month) works and is pretty cool. Only had 2 datasets at hand, and having a decent RAM, didn't see such a visible difference for computing time, but sure it'll help TH on Monday.
Based on the positive test result I would suggest we put this back to New, for discussion between dev+bizdev, to see if there is more to do here on general performance improvement or whether we just close for now.
Transactions still seem like a worthwhile idea, no?
For the month-old overview, thank you, that seems to do the trick. However note, and this is just a remark, that it introduces new complexities: now people have two options of sharing the same data, just scoped differently, and they can pool it into different pools (month old, forever). I think what would make more sense is to pool into one, and then the aggregator would again have two views to select from.
Moving to 'specs needed' for dev to explore and test more.
It would be useful if @pdehaye could comment as to whether this is less critical/urgent now, it seems to be so.
It sounds like the particular Twitter-related problem has been (monentarily) resolved, at the expense of creating a lot of new tabs that are somewhat incompatible between each other for no good reason necessarily. Hence it would be less urgent.
It also sounds to me like the more generic underlying problem has got to do with batch-importing data into the database.
Finally, this is all to be re-built on top of better abstractions of how to build experiences. Twitter aggregate is the cutting edge of what we want to do, it feels more correct to push elsewhere for the better abstractions, and progressively to re-implement Twitter single/aggregate and the linking.
Aggregation test this morning
This morning we tested brown's twitter aggregation with five participants (Emmanuel, Yann, Jessica, Marie-Pierre, and I was using the big twitter sample data).
Processing the files on my computer in firefox took 1 minute, creating the visualization was slow, but rather a matter of seconds. Several times firefox asked me if it should kill the javascript process. Someone had a faster computer where it ran in about 20 seconds.
We tried again by sharing only the data needed for the aggregation (Advertising Overview, Targeting Criteria (1 advertiser)). This reduced the file size by about 1/4, but the processing time was just a few seconds less than before.
Later I tried this again with the 3 files that happened to remain after the meeting. Processing time was 40 seconds in firefox, but 20 seconds in chrome. The advertising overview took 30 seconds in firefox, and about 10 in chrome.
What takes time
Clicking the button "Explore the data" starts the following process:
This all runs faster with less files and smaller files.
In the twitter experience, the data for the Advertising Overview takes up a lot more data than the data for the Targeting Criteria (1 advertiser)
Recommendations for using the current solution
Possible development steps
light version of the advertising overview
Next to the full version of the ad overview, participants would have a similar tab that visualizes less data. We could change the sql request to load only last month for example. (Apparently twitter export data contains the last three months)
Maybe removing some graphs from the overview might help, but that's a bit complicated
further investigation by profiling
Once we now what takes most time we could try to optimize it. Maybe there is an encryption alghorithm that's faster to decrypt? Maybe there is a faster way to dump participant databases and a way to load all of them quickly on the aggregation database. Maybe we can process files in workers.
This will probably not give us a solution we can implement this week.