hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

Extremely long runtime #152

Closed corinsexton closed 3 years ago

corinsexton commented 3 years ago

Hi there,

I’m running a segmentation on hg38 using 15 tracks. I originally specified using the minibatch-fraction at 5% of the genome and it took around 115 hours given 16 cores and training 5 instances.

This seemed like an appropriate amount of time, but I wanted to train on a specified regions using the include-coords parameter. The regions I included totaled around 2% of the genome with the maximum length being 10,000 bp. This run has taken an enormously long amount of time. There was still no likelihood.0.tab file was created in the traindir/log/ directory after 90 hours.

Am I missing something that could make this runtime much speedier? Why would using include_coords tank my runtime performance so terribly?

Thanks for any help!

EricR86 commented 3 years ago

Hi!

10,000 bp is still a very small region to train on. 2% of the hg38 genome is still roughly 60 million bp which, at 10 000 bp max length windows gives a total of at least 6000 windows to train over for every EM iteration per instance. Unless I'm misunderstanding your exact training setup. Regardless, you can verify how many windows you are using to train by looking at the line count for the window.bed in your train directory and I suspect it's a very large number. You are very likely bottlenecked by job submission times or even just processes spinning up and down.

I would highly recommend a few things:

  1. Increase the window size if possible to cover larger regions to train over at a time.
  2. Use the --max-train-rounds option and cap it to lower than the default of 100. Training will likely never converge with minibatch since the training regions chosen are random. You can get a very good idea what is considered good enough in terms of convergence by plotting out your likelihood.tab file from your previous run in this case. It should converge very quickly and not change much after a significant number of rounds. We've had decent success with around 30 rounds or so but even less is not unheard of.

Let me know if this helps!

corinsexton commented 3 years ago

This was very helpful! I didn't understand the immense number of windows I was asking it to train over. Increasing window size was definitely the answer. Thank you!