hoffmangroup / segway

Application for semi-automated genomic annotation.
http://segway.hoffmanlab.org/
GNU General Public License v2.0
13 stars 7 forks source link

parallelization not working? #166

Closed YichaoOU closed 1 year ago

YichaoOU commented 1 year ago

Hello,

I'm setting: SEGWAY_NUM_LOCAL_JOBS=32 and SEGWAY_CLUSTER=local. But segway didn't seem to run on multiple cores. Am I doing the parallelization right?

Thanks, Yichao

EricR86 commented 1 year ago

Segway training can only parallelize based on the number of training instances. EMT training requires the output of the previous job to continue so there's no practical way to parallelize the process for a single instance when running locally. On a cluster, jobs can be submitted at will per thread.

It's recommended to have a higher number of instances than 1 in general. We've used 10 in practice.

Hope that addresses your issue.

YichaoOU commented 1 year ago

cool, now I set –num-instances=20, from the htop output, it did seem to be parallel now. But the CPU usage for each instance is not 100%. One reason I can think of is that the job finished very fast. Is it true? Because I also used --include-coords for ATAC-seq peaks, which is 200~500bp per region for 100k+ regions. How can I fully parallel them? Or now is the best I can have? If so, that's OK.

Thanks! Yichao

image

YichaoOU commented 1 year ago

Up to now, the running speed seems to slow down in terms of the terminal output (previously, running locally appears very fast) and the process has used ~80G memory.

running locally 30895: emt18.0.45224.traindir3.cbfe4704c9b811ed949a9440c9386226 ()
running locally 30896: emt2.0.60456.traindir3.cbfe4704c9b811ed949a9440c9386226 ()
running locally 30897: emt9.0.86667.traindir3.cbfe4704c9b811ed949a9440c9386226 ()
running locally 30898: emt1.0.111421.traindir3.cbfe4704c9b811ed949a9440c9386226 ()
running locally 30899: emt16.0.60456.traindir3.cbfe4704c9b811ed949a9440c9386226 ()
running locally 30900: emt14.0.111421.traindir3.cbfe4704c9b811ed949a9440c9386226 ()

image

EricR86 commented 1 year ago

It doesn't look like there's a specific software issue here to troubleshoot unfortunately. It looks like a single thread is having issue. Namely your first instance (0), is having trouble completing those 6 jobs. Not sure why. If you have run multiple instances, it is certainly odd that all other threads have seem to complete.

It's also worth noting that you seem to have an exceptionally large number of training windows. Segway can train over undefined data if necessary in order to reduce the total number of jobs by decreasing the regions needed to cover for EMT to finish. You can look at window.bed in your train directory to get an idea of the number of regions you have.

YichaoOU commented 1 year ago

window.bed is the same I gave as --include-coords, which is ATAC-seq peaks, and it is 200~500bp per region for 100k+ regions.

Should I decrease the number of regions and possibly merge nearby peaks with a range, like 2MB?

YichaoOU commented 1 year ago

For a small test, I only used the first 1k regions for --include-coords, so the window.bed is also 1000 lines. But why I have more than 1000 runs?

running locally 1670: emt9.0.426.traindir4.2f7f507cca6111ed98489440c9386226 ()
running locally 1671: emt11.0.426.traindir4.2f7f507cca6111ed98489440c9386226 ()
running locally 1672: emt8.0.426.traindir4.2f7f507cca6111ed98489440c9386226 ()
running locally 1673: emt13.0.426.traindir4.2f7f507cca6111ed98489440c9386226 ()

Thanks, Yichao

EricR86 commented 1 year ago

Each run in local mode just counts the total number of gmtkEMTrain processes that need to run across all instances. It's not unusual. Decreasing the number of regions and merging nearby peaks would also certainly speed up training by reducing the overhead of spinning up so many processes.

YichaoOU commented 1 year ago

Thanks! For the above "small" test (I thought it was small, but maybe not) with 1k regions, it took more than a day to finish and ended up 200k+ runs. I think I will definitely try give less number of regions, and each region with bigger size.