Closed lzamparo closed 7 years ago
Have done a lot of work on this issue for #5, but still no solution. Problem probably is that producers fill up queue much faster than it can be consumed, leading to an overflow of a buffer that is ill defined by the Queue. I'm consuming ~10x fewer macrobatches than are being generated:
mski1743:kmer_models zamparol$ cat full_data_k_6_s_1.txt | grep -c "receiving macrobatch from child process"
3281
mski1743:kmer_models zamparol$ cat full_data_k_6_s_1.txt | grep -c "sending macrobatch to parent process"
35626
So, for now I should cut back on num procs, have have each worker wait a bit longer, plus also make the macrobatches larger (so each worker takes longer to push to the consumer queue). Longer term, I should forget about this finicky multiprocessing baloney and use Dask.
For completeness sake, here's a list of things to try before giving up on current code base and rewriting DatasetReader.generate_dataset_parallel
in dask:
fewer processes, larger macrobatches, longer waits with each batch
This was sufficient to keep training running smoothly.
make a limit to the consumer queue size; if max-size is exceeded, make the producer procs block.
This should also work.
It still takes far too long to train even a single epoch (36h and counting, not yet done one). Going to close this issue, and #5 for now.
Tried running the model on device with different values for k, stride. No output was produced, and eventually the jobs timed out.
I need to try running the same jobs on CPU and with longer duration, to see if this is a problem induced by transfer to device, or if it's a memory bound problem that isn't reporting properly, or if it's something else.