vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.1k stars 194 forks source link

explore vg parameters for PacBio #770

Open shilpagarg opened 7 years ago

shilpagarg commented 7 years ago

@ekg Hi Erik,

As you suggested, I tried different parameter values for PacBio data, something similar to BWA mem.

When I used

vg map -k 17 -W 40 -r 10 -q 2 -z 1 -o 1 -y 1 -x yeast.illumina.SK1_Y12.covall.chrI.freebayes.X.xg -g yeast.illumina.SK1_Y12.covall.chrI.freebayes.X.gcsa -f ../reads/yeast.pacbio.SK1_Y12.cov30.chrI.fastq

I did not get the results in one day. By using these values:

-k 17 -W 40 -r 10 -q 2 -z 1 -o 1 -y 1
Error:
error:[gssw] Stuck in main matrix!
vg: src/gssw.c:2653: gssw_alignment_trace_back_word: Assertion `0' failed.
ekg commented 7 years ago

This is a strange error and I wonder if it might be caused by an out of memory condition.

On Sun, Apr 16, 2017, 1:11 PM shilpagarg notifications@github.com wrote:

@ekg https://github.com/ekg Hi Erik,

As you suggested, I tried different parameter values for PacBio data, something similar to BWA mem.

When I used

vg map -k 17 -W 40 -r 10 -q 2 -z 1 -o 1 -y 1 -x yeast.illumina.SK1_Y12.covall.chrI.freebayes.X.xg -g yeast.illumina.SK1_Y12.covall.chrI.freebayes.X.gcsa -f ../reads/yeast.pacbio.SK1_Y12.cov30.chrI.fastq

I did not get the results in one day. By using these values:

-k 17 -W 40 -r 10 -q 2 -z 1 -o 1 -y 1 Error: error:[gssw] Stuck in main matrix! vg: src/gssw.c:2653: gssw_alignment_trace_back_word: Assertion `0' failed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/770, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EbJ_ONkk7DHtpm5jqwDjKGtLJDytks5rwfdsgaJpZM4M-mgB .

edawson commented 7 years ago

I've run into this message before with my nanopore reads (and others).

  1. Do you have lowercase letters in either your reference or your reads?
  2. Is the banded mapper (-B <bandwidth>) still a thing? I had better success with that, since my reads had indels.
  3. Are there boatloads of Ns in your reads?

If you run the first couple of correction steps of most any PacBio pipeline it should help things a bit.

shilpagarg commented 7 years ago

Good to know. So what -B <Bandwidth> value you use for long reads? The main issue I see in the PacBio alignments is that some reads that are aligned right using default -B value gets wrong when I use -B 10000000. Moreover, these read alignments were all in either forward or backward orientation using default B value. I would expect the similar results even if we change B value. Is it some sort of bug?

ekg commented 7 years ago

It isn't -B but -w now.

On Tue, Apr 18, 2017, 9:26 AM shilpagarg notifications@github.com wrote:

Good to know. So what -B value you use for long reads? The main issue I see in the PacBio alignments is that some reads that are aligned right using default -B value gets wrong when I use -B 10000000. Moreover, these read alignments were all in either forward or backward orientation using default B value. I would expect the similar results even if we change B value. Is it some sort of bug?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/770#issuecomment-294712328, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EUxPtWwjbEeoGNBNipTRQ-28Stlbks5rxGW2gaJpZM4M-mgB .

ekg commented 7 years ago

Also, have you followed Eric' suggestion?

If you run the first couple of correction steps of most any PacBio pipeline it should help things a bit.

On Tue, Apr 18, 2017, 10:25 AM Erik Garrison erik.garrison@gmail.com wrote:

It isn't -B but -w now.

On Tue, Apr 18, 2017, 9:26 AM shilpagarg notifications@github.com wrote:

Good to know. So what -B value you use for long reads? The main issue I see in the PacBio alignments is that some reads that are aligned right using default -B value gets wrong when I use -B 10000000. Moreover, these read alignments were all in either forward or backward orientation using default B value. I would expect the similar results even if we change B value. Is it some sort of bug?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/770#issuecomment-294712328, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EUxPtWwjbEeoGNBNipTRQ-28Stlbks5rxGW2gaJpZM4M-mgB .

shilpagarg commented 7 years ago

Yes, I am using -w, just to be consistant with Eric, I mentioned '-B'.

I don't have either of case 1 and 3 mentioned by Eric.

edawson commented 7 years ago

-B 10000000 is way too large. My understanding of the banding is that it's the size of non-overlapping read segments taken for alignment, not the max number of INDELs allowed (as in "Banded alignment"), but I am not sure if that's correct.

Regardless, I had much better luck with bands on the order of -B 32, -B 64, or -B 128 on my long reads. Anything smaller was slow to align and anything larger than about 512 would throw the stuck in matrix error. My intuition is that this is because the long reads accumulate too many errors over a small number of basepairs to align cleanly and that the read bands begin to contain pore barf.

As to Dazzler: the first steps of the pipeline are overlap60%->scrub->correct (dazzler blog is here). I think if you ran just these three steps you could place the output fastqs into MSGA and get a rough, though improved, assembly.

You might want to continue down the pipeline through the remaining steps (overlap95%->scrub->correct->assemble), as I'd bet there's a way to extract a GFA2.0 graph from Dazzler. You could then convert this back to GFA1.0 and use it as input to vg (Richard has a tool for this I believe).

shilpagarg commented 7 years ago

Thanks for the details. If you use -B 32or other smaller values, then do u able to align the reads to all the nodes in forward orientation? Or a single read is mix of forward and backward orientations?

ekg commented 7 years ago

There is no way to align the reads to only the forward orientation of the graph. You can align to both and then filter out the reverse. But that would usually just throw away half of your data.

On Tue, Apr 18, 2017, 5:48 PM shilpagarg notifications@github.com wrote:

Thanks for the details. If you use -B 32or other smaller values, then do u able to align the reads to all the nodes in forward orientation? Or a single read is mix of forward and backward orientations?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/770#issuecomment-294888064, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4Ec4KlRMD7KKVhF3hcyQFAJz2xxsOks5rxNtegaJpZM4M-mgB .

ekg commented 7 years ago

Here is a script to test the alignment performance for simulated data with 15% error (10% SNP and 5% indel, 10kb reads).

vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 32 -p >z.vg
vg index -k 16 -x z.xg -g z.gcsa -p z.vg
vg sim -x z.xg -n 1000 -l 10000 -e 0.1 -i 0.05 -a >simpb.gam
vg map -d z -G simpb.gam -M 8 -w 2048 -W 64 --compare # outputs table

The second column (the pairwise match rate between the simulated true read and the realigned version) can be used to understand the accuracy.

It looks like these settings: -M 8 -w 2048 -W 64 work reasonably well for the pacbio reads. At least, they are better than the defaults. Probably we should provide an easy way to set this behavior on the command line.

ekg commented 7 years ago

This is the histogram of performance for the previous test.

simpb compare hist

     ident       
 Min.   :0.0956  
 1st Qu.:0.3814  
 Median :0.4757  
 Mean   :0.4584  
 3rd Qu.:0.5283  
 Max.   :0.7234