Closed lzamparo closed 7 years ago
So the above list needs to be amended. This issue is already more complex than one single issue should be, but I'll keep it as long as I can still use it to organize development.
A few updates:
One clean way of parallelizing Unigram dictionary that does not blow up the memory on the full data set is to expose logic in DatasetReader.generate_token_worker
to process the list of tokens in each file into a collections.Counter
with unique k-mers (key) and counts (value). I'd also have to expose new logic for updating the UnigramDictionary
to use add_count
instead of add
.
Dictionary prep now uses concurrent.futures
, and failure of DatasetReader has been rightly judged to be a queue overloading bug. Closing this issue.
Whatever the embedding model used, I need to change the DatasetReader to be able to scale to larger data sets