Fix DatasetReader to truly act as a generator for SELEX data

lzamparo / embedding

Learning semantic embeddings for TF binding preferences directly from sequence

Other

0 stars 0 forks source link

Fix DatasetReader to truly act as a generator for SELEX data #5

Closed lzamparo closed 7 years ago

lzamparo commented 7 years ago

Whatever the embedding model used, I need to change the DatasetReader to be able to scale to larger data sets

build the unigram dictionary as expected
beginning at each epoch, keep a randomly ordered queue of factor-gz files
have several workers that dequeue files, process them into macrobatches, ready to yield a macrobatch when required by the DatasetReader
when generating macro-batches, don't hog all the memory! In particular, I should not waste time by reading and processing the whole dataset each epoch.
should time how long it takes to process an entire epoch's worth of data, and compare against the present state of the DatasetReader.

lzamparo commented 7 years ago

So the above list needs to be amended. This issue is already more complex than one single issue should be, but I'll keep it as long as I can still use it to organize development.

Unigram dictionary needs to be parallelized, it takes forever on the full probe data set [in progress]
~~beginning at each epoch, keep a randomly ordered queue of factor-gz files~~
Organize workers to process macrobatches by using concurrent.futures rather than the complicated queueing system of IterableQueue [new branch, in progress]
When generating macro-batches, don't hog all the memory! In particular, I should not waste time by reading and processing the whole dataset each epoch. [new branch, in progress]
~~should time how long it takes to prepare an entire epoch's worth of data, and compare against the present state of the DatasetReader.~~

lzamparo commented 7 years ago

A few updates:

~~Parallelizing Unigram dictionary generation is fraught with peril; concurrent.futures seems not to work as expected and takes a long time. More logging needed to try to debug properly.~~
~~Still need to try simplifying DatasetReader for batch generation. Timing the prep of macrobatches for a whole epoch is a good idea, should add a test for this. ~~
may want to try simpler formulations of processed files queue (see here)

lzamparo commented 7 years ago

One clean way of parallelizing Unigram dictionary that does not blow up the memory on the full data set is to expose logic in DatasetReader.generate_token_worker to process the list of tokens in each file into a collections.Counter with unique k-mers (key) and counts (value). I'd also have to expose new logic for updating the UnigramDictionary to use add_count instead of add.

lzamparo commented 7 years ago

Dictionary prep now uses concurrent.futures, and failure of DatasetReader has been rightly judged to be a queue overloading bug. Closing this issue.