Open shaigue opened 1 year ago
When running the code with 120 CPUs, this is the main bottle neck. So I want to figure out how to speed this out. maybe increasing the size, or using a different query could be useful. On bookcorpus, this takes ~4 hours, and all other operations together take ~30 minuts.
***UPDATE 28.06.2023: The solution I want to try is to use DuckDB's feature of loading multiple parquet files to a single table and to use a aggregate-groupby approach instead of insert or update approach resulted in a single query for each ngram size.
I want to test that and see if we get any improvements.
OK, so the fix works! I'm just left with updating the times results from the logs to the README.md file
Currently, this is the main bottleneck for the program, and takes more than 80% of the time. We expect that as the size of the data grows this will be even greater. Try to think a little bit on how to optimize this. If some simple improvement comes up, It could be worthwhile implementing it.