Still cannot train properly on the full SELEX data

lzamparo / embedding

Learning semantic embeddings for TF binding preferences directly from sequence

Other

0 stars 0 forks source link

Still cannot train properly on the full SELEX data #4

Closed lzamparo closed 7 years ago

lzamparo commented 7 years ago

Tried running the model on device with different values for k, stride. No output was produced, and eventually the jobs timed out.

I need to try running the same jobs on CPU and with longer duration, to see if this is a problem induced by transfer to device, or if it's a memory bound problem that isn't reporting properly, or if it's something else.

lzamparo commented 7 years ago

Have done a lot of work on this issue for #5, but still no solution. Problem probably is that producers fill up queue much faster than it can be consumed, leading to an overflow of a buffer that is ill defined by the Queue. I'm consuming ~10x fewer macrobatches than are being generated:

mski1743:kmer_models zamparol$ cat full_data_k_6_s_1.txt | grep -c "receiving macrobatch from child process" 3281

mski1743:kmer_models zamparol$ cat full_data_k_6_s_1.txt | grep -c "sending macrobatch to parent process" 35626

So, for now I should cut back on num procs, have have each worker wait a bit longer, plus also make the macrobatches larger (so each worker takes longer to push to the consumer queue). Longer term, I should forget about this finicky multiprocessing baloney and use Dask.

lzamparo commented 7 years ago

For completeness sake, here's a list of things to try before giving up on current code base and rewriting DatasetReader.generate_dataset_parallel in dask:

fewer processes, larger macrobatches, longer waits with each batch
move computation back onto the GPU, to speed up consumption of macrobatches
make a limit to the consumer queue size; if max-size is exceeded, make the producer procs block.
catch the exceptions (BrokenPipeError, EOFError) where needed.
explicitly join the worker processes, in the off chance this is a simple multiprocessing handling error.

lzamparo commented 7 years ago

fewer processes, larger macrobatches, longer waits with each batch

This was sufficient to keep training running smoothly.

make a limit to the consumer queue size; if max-size is exceeded, make the producer procs block.

This should also work.

It still takes far too long to train even a single epoch (36h and counting, not yet done one). Going to close this issue, and #5 for now.