Closed corinsexton closed 3 years ago
Hi!
10,000 bp is still a very small region to train on. 2% of the hg38 genome is still roughly 60 million bp which, at 10 000 bp max length windows gives a total of at least 6000 windows to train over for every EM iteration per instance. Unless I'm misunderstanding your exact training setup. Regardless, you can verify how many windows you are using to train by looking at the line count for the window.bed in your train directory and I suspect it's a very large number. You are very likely bottlenecked by job submission times or even just processes spinning up and down.
I would highly recommend a few things:
--max-train-rounds
option and cap it to lower than the default of 100. Training will likely never converge with minibatch since the training regions chosen are random. You can get a very good idea what is considered good enough in terms of convergence by plotting out your likelihood.tab file from your previous run in this case. It should converge very quickly and not change much after a significant number of rounds. We've had decent success with around 30 rounds or so but even less is not unheard of.Let me know if this helps!
This was very helpful! I didn't understand the immense number of windows I was asking it to train over. Increasing window size was definitely the answer. Thank you!
Hi there,
I’m running a segmentation on hg38 using 15 tracks. I originally specified using the
minibatch-fraction
at 5% of the genome and it took around 115 hours given 16 cores and training 5 instances.This seemed like an appropriate amount of time, but I wanted to train on a specified regions using the
include-coords
parameter. The regions I included totaled around 2% of the genome with the maximum length being 10,000 bp. This run has taken an enormously long amount of time. There was still nolikelihood.0.tab
file was created in thetraindir/log/
directory after 90 hours.Am I missing something that could make this runtime much speedier? Why would using
include_coords
tank my runtime performance so terribly?Thanks for any help!