vgteam / vg

tools for working with genome variation graphs
https://biostars.org/tag/vg/
Other
1.13k stars 194 forks source link

Re-aligning gams is cumbersomes #1327

Open glennhickey opened 6 years ago

glennhickey commented 6 years ago

@ekg @edawson Any ideas on ways to re-align a GAM that was made with vg map? It's come up a few times, and I've ended up doing something pretty hacky to make sure it's properly paired to be aligned with -i:

 vg view -a my.gam | | jq -s -c 'sort_by(.name)[]' | filter.py | vg view -JaG > my_properly_paired.gam

where filter.py is a tiny python script that removes unpaired reads from the json. On a 2G chromosome 21 gam, the whole process takes several hours, with the jq sort requiring nearly 300G of RAM.

This could be improved by, say, adding a name sort to rocksdb and a pair filter to vg filter, but wanted to check if there's anything in place I don't know about.

There seems to be some related code in gamsort.cpp, but it doesn't look finished and I can't get anything out of the command line.

ekg commented 6 years ago

This is really not well supported. It would almost be easier to push it back to fastq using vg view -X and then sort and realign that. How can we make this easier?

On Fri, Dec 15, 2017, 16:05 Glenn Hickey notifications@github.com wrote:

@ekg https://github.com/ekg @edawson https://github.com/edawson Any ideas on ways to re-align a GAM that was made with vg map? It's come up a few times, and I've ended up doing something pretty hacky to make sure it's properly paired to be aligned with -i:

vg view -a my.gam | | jq -s -c 'sort_by(.name)[]' | filter.py | vg view -JaG > my_properly_paired.gam

where filter.py is a tiny python script that removes unpaired reads from the json. On a 2G chromosome 21 gam, the whole process takes several hours, with the jq sort requiring nearly 300G of RAM.

This could be improved by, say, adding a name sort to rocksdb and a pair filter to vg filter, but wanted to check if there's anything in place I don't know about.

There seems to be some related code in gamsort.cpp, but it doesn't look finished and I can't get anything out of the command line.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1327, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EXoJWR-N-GHZ2JZDhjNGeDu4TpNTks5tAoqhgaJpZM4RDmaR .

glennhickey commented 6 years ago

Going through fastq would be much faster, but would require assuming some kind of naming convention to link pairs via read names. Not sure how to get around without a (rocksdb backed?) sort by name.

On Fri, Dec 15, 2017 at 11:27 AM, Erik Garrison notifications@github.com wrote:

This is really not well supported. It would almost be easier to push it back to fastq using vg view -X and then sort and realign that. How can we make this easier?

On Fri, Dec 15, 2017, 16:05 Glenn Hickey notifications@github.com wrote:

@ekg https://github.com/ekg @edawson https://github.com/edawson Any ideas on ways to re-align a GAM that was made with vg map? It's come up a few times, and I've ended up doing something pretty hacky to make sure it's properly paired to be aligned with -i:

vg view -a my.gam | | jq -s -c 'sort_by(.name)[]' | filter.py | vg view -JaG > my_properly_paired.gam

where filter.py is a tiny python script that removes unpaired reads from the json. On a 2G chromosome 21 gam, the whole process takes several hours, with the jq sort requiring nearly 300G of RAM.

This could be improved by, say, adding a name sort to rocksdb and a pair filter to vg filter, but wanted to check if there's anything in place I don't know about.

There seems to be some related code in gamsort.cpp, but it doesn't look finished and I can't get anything out of the command line.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1327, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EXoJWR-N- GHZ2JZDhjNGeDu4TpNTks5tAoqhgaJpZM4RDmaR .

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1327#issuecomment-352048834, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7qQVuGsXhafiHII8OzLlPuHrTqGBks5tAp3hgaJpZM4RDmaR .

ekg commented 6 years ago

For what it's worth, it's not any easier in BAM. You have to go back to fastq, afaik.

glennhickey commented 6 years ago

It's still easier in BAM imho, because we can sort by name. I can go through fastq with two simple commands:

samtools sort -n my.bam -O BAM > name_sort.bam samtools fastq name_sort.bam -s singles.fq -1 reads1.fq -2 reads2.fq

or I could also strip the singles out of name_sort.bam directly with a python script.

Parsing speed is another factor that makes big GAMs much more unwieldy to work with than BAMs. 'vg view -X' is only 2-3X slower than 'samtools fastq' but I'm getting 10-15X slower for 'vg view -a' vs 'samtools view'.

On Tue, Dec 19, 2017 at 12:49 PM, Erik Garrison notifications@github.com wrote:

For what it's worth, it's not any easier in BAM. You have to go back to fastq, afaik.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1327#issuecomment-352834880, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7poJTkxF8z2InJQ4SCiLJp58ORADks5tB_cfgaJpZM4RDmaR .