shaigue / pmi_masking

This repository contains code that takes a text corpus and creates a PMI masking vocabulary for it.
MIT License
1 stars 0 forks source link

Try to optimize `aggregate_ngram_counts` #17

Open shaigue opened 1 year ago

shaigue commented 1 year ago

Currently, this is the main bottleneck for the program, and takes more than 80% of the time. We expect that as the size of the data grows this will be even greater. Try to think a little bit on how to optimize this. If some simple improvement comes up, It could be worthwhile implementing it.

shaigue commented 1 year ago

When running the code with 120 CPUs, this is the main bottle neck. So I want to figure out how to speed this out. maybe increasing the size, or using a different query could be useful. On bookcorpus, this takes ~4 hours, and all other operations together take ~30 minuts.

***UPDATE 28.06.2023: The solution I want to try is to use DuckDB's feature of loading multiple parquet files to a single table and to use a aggregate-groupby approach instead of insert or update approach resulted in a single query for each ngram size.

I want to test that and see if we get any improvements.

shaigue commented 1 year ago

OK, so the fix works! I'm just left with updating the times results from the logs to the README.md file