lzamparo / embedding

Learning semantic embeddings for TF binding preferences directly from sequence
Other
0 stars 0 forks source link

Fix DatasetReader to truly act as a generator for SELEX data #5

Closed lzamparo closed 7 years ago

lzamparo commented 7 years ago

Whatever the embedding model used, I need to change the DatasetReader to be able to scale to larger data sets

lzamparo commented 7 years ago

So the above list needs to be amended. This issue is already more complex than one single issue should be, but I'll keep it as long as I can still use it to organize development.

lzamparo commented 7 years ago

A few updates:

lzamparo commented 7 years ago

One clean way of parallelizing Unigram dictionary that does not blow up the memory on the full data set is to expose logic in DatasetReader.generate_token_worker to process the list of tokens in each file into a collections.Counter with unique k-mers (key) and counts (value). I'd also have to expose new logic for updating the UnigramDictionary to use add_count instead of add.

lzamparo commented 7 years ago

Dictionary prep now uses concurrent.futures, and failure of DatasetReader has been rightly judged to be a queue overloading bug. Closing this issue.