vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

weird vg sim usage in jenkins/vgci.py::VGCITest::test_sim_mhc_snp1kg_mpmap #1615

Open ekg opened 6 years ago

ekg commented 6 years ago

I am trying to debug why test_sim_mhc_snp1kg_mpmap and friends have been breaking in https://github.com/vgteam/vg/pull/1612. Today I spent a lot of time running:

jenkins/jenkins.sh -l -r -i -k -s -t jenkins/vgci.py::VGCITest::test_sim_mhc_snp1kg_mpmap

I notice that it seems to be running a number of small sim jobs, each producing ~1500 reads. Is this to be expected?

2018-04-09 17:18:02,460 - toil-rt - INFO - Run: vg sim -x index.xg -n 1562 -d 0.01 -p 1000 -v 75.0 -S 5 -I -s 11 -F error_template.fastq -a | tee sim_0_3.gam | vg annotate -p -x annot_index.xg -a - | tee sim_0_3_annot.gam | vg view -aj -

It seems rather inefficient to me to do this, and it's taking a while (which could be the bug I'm chasing). In any case I found the design odd.

ekg commented 6 years ago

Specifically what's odd about the pattern is that we have to read through the 551MB error_template.fastq every time we make the call, so it would be helpful to just make a single run through all the haplotypes we are going to sim from. So we simulate reads at about 45/s on my system:

-> % vg sim -x index.xg -n 1562 -d 0.01 -p 1000 -v 75.0 -S 5 -I -s 14 -F error_template.fastq | pv -l >/dev/null
1.56k 0:00:33 [46.5 /s] [   <=>                                                                                                                                                          
adamnovak commented 6 years ago

It's helpful to break up sim into multiple chunks when building really big GAMs. We might be using too many chunks in this case.

I think it's controlled by a toil-vg config parameter, so we could update the test to use fewer chunks and be faster.

glennhickey commented 6 years ago

Yeah, it's using the same chunking for all the tests. Never noticed it being a bottleneck on the smaller ones, but it should be easy to parameterize

On Mon, Apr 9, 2018 at 2:25 PM, Adam Novak notifications@github.com wrote:

It's helpful to break up sim into multiple chunks when building really big GAMs. We might be using too many chunks in this case.

I think it's controlled by a toil-vg config parameter, so we could update the test to use fewer chunks and be faster.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1615#issuecomment-379847666, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7gQRn3ontCggcSMS0jrM1tqpGAA2ks5tm6d9gaJpZM4TMumo .