green-coding-solutions / green-metrics-tool

Measure energy and carbon consumption of software
https://metrics.green-coding.io
GNU Affero General Public License v3.0
168 stars 22 forks source link

Make the carbonDB_add deduplication function faster #753

Closed ribalba closed 3 weeks ago

ribalba commented 5 months ago

The problem is that currently the function takes 11 seconds to complete which triggers a timeout. We need to discuss if we should either 1) go back to row inserts 2) make the depluctation async.

Initially we did a check on every insert but then went to a bulk insert and a check if there are duplicates after the insert. This was faster in the benchmarks where we added loads of keys at once but in real life we only ever add a key or two so I propose to go back to the on insert check.

See discussion here: https://github.com/green-coding-solutions/green-metrics-tool/pull/676#discussion_r1492409659

ArneTR commented 5 months ago

How many data rows are typically in a frame that is send to the API?

Does it maybe make sense to only send one data row per API request and then even skip the time_stamping data field?

ribalba commented 5 months ago

Depending on the configuration. Let's assume 5 second sampling and an upload every 5 minutes. Which is 60 values. I don't think that individual API requests make sense for this case. But individual inserts might. I will need to do some timing checks.

ArneTR commented 5 months ago

I see.

The way to go seems to be a copy to a temporary table and then do internal conflict checking in the DB.

From my experience single inserts are painfully slow with postgresql. Not recommended. Over the network they become even unbearable, which might happen maybe at a later stage where DB and API are not colocated

Source for temp table code: https://stackoverflow.com/questions/73200153/how-to-ignore-duplicate-keys-using-the-psycopg2-copy-from-command-copying-csv-f

ArneTR commented 3 weeks ago

This is done now I think?

ribalba commented 3 weeks ago

Here https://github.com/green-coding-solutions/green-metrics-tool/commits/main/tools/carbondb_remove_duplicates.py