ooni / data

OONI Data CLI and Pipeline v5
https://docs.ooni.org/data
8 stars 4 forks source link

Improve performance of the DELETE operation on observation generation #87

Closed hellais closed 2 days ago

hellais commented 1 week ago

Currently the performance of the delete operation is quite a bottleneck in the processing operation. The cause of this is that in order to perform the DELETE every partition that matches the WHERE clause of the delete needs to be rewritten with the deleted rows.

We need to do this DELETE operation so that we can clear the relevant database tables from older observations before regenerating them.

Since the filtering on the where can potentially be quite quite large and it might be touching a lot of quite big partitions, this means we have to have a lot of extra head room in terms of storage requirements.

I was doing some research into this and one approach could be that of defining a custom partition key and then either doing a delete for the whole partition (which is probably not feasible as we would have to create way more partitions that necessary) or in any case at least narrowing the modifications to a single partition.

As part of this change I am planning to also make some important schema changes to the observation tables to set us up to be able to process daily in hourly batches instead of daily.

My suggestions is therefore to do the following:

Before doing these changes it's probably worth spending some time to better understand if this will indeed lead to better performance.

Another approach could be that of looking if it's possible to make use of some of the other merge tree table family or structuring our data in such a way where we can get it to be eventually consistent. This might however be tricky as it would be significantly changing the downstream data consuming queries to be aware of this eventual consistent pattern (eg. looking at the sign column in any aggregation).

hellais commented 1 week ago

Here is a sample clickhouse stacktrace that happens when we run into this issue:

Cannot reserve 8.43 MiB, not enough space