Try to optimize `aggregate_ngram_counts`

shaigue commented 1 year ago

Currently, this is the main bottleneck for the program, and takes more than 80% of the time. We expect that as the size of the data grows this will be even greater. Try to think a little bit on how to optimize this. If some simple improvement comes up, It could be worthwhile implementing it.

shaigue commented 1 year ago

When running the code with 120 CPUs, this is the main bottle neck. So I want to figure out how to speed this out. maybe increasing the size, or using a different query could be useful. On bookcorpus, this takes ~4 hours, and all other operations together take ~30 minuts.

I can also use the spec of the machine I got from the log to choose the correct run-parameters.

***UPDATE 28.06.2023: The solution I want to try is to use DuckDB's feature of loading multiple parquet files to a single table and to use a aggregate-groupby approach instead of insert or update approach resulted in a single query for each ngram size.

I want to test that and see if we get any improvements.

shaigue commented 1 year ago

OK, so the fix works! I'm just left with updating the times results from the logs to the README.md file

shaigue / pmi_masking

Try to optimize `aggregate_ngram_counts` #17