modernmt / DataCollection

Data collection, alignment and TAUS repository
Apache License 2.0
20 stars 8 forks source link

langstat2candidates.py requires large amounts of RAM #8

Open achimr opened 7 years ago

achimr commented 7 years ago

langstat2candidates.py, particularly when used with the -candidates parameter uses up large amounts of RAM (needing 32-64 GB of RAM for large language pairs). This is because it reads the entire candidates file into memory (dictionary with the URLs as keys and the entire candidates file line as values). Retaining all this data seems unnecessary. This reduces the parallelizability and leads to crashes.

achimr commented 7 years ago

Matching candidates from some language into English with recent CommonCrawls (2016_50) requires 60+ GB of RAM