Open glennhickey opened 6 years ago
This is really not well supported. It would almost be easier to push it back to fastq using vg view -X and then sort and realign that. How can we make this easier?
On Fri, Dec 15, 2017, 16:05 Glenn Hickey notifications@github.com wrote:
@ekg https://github.com/ekg @edawson https://github.com/edawson Any ideas on ways to re-align a GAM that was made with vg map? It's come up a few times, and I've ended up doing something pretty hacky to make sure it's properly paired to be aligned with -i:
vg view -a my.gam | | jq -s -c 'sort_by(.name)[]' | filter.py | vg view -JaG > my_properly_paired.gam
where filter.py is a tiny python script that removes unpaired reads from the json. On a 2G chromosome 21 gam, the whole process takes several hours, with the jq sort requiring nearly 300G of RAM.
This could be improved by, say, adding a name sort to rocksdb and a pair filter to vg filter, but wanted to check if there's anything in place I don't know about.
There seems to be some related code in gamsort.cpp, but it doesn't look finished and I can't get anything out of the command line.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1327, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EXoJWR-N-GHZ2JZDhjNGeDu4TpNTks5tAoqhgaJpZM4RDmaR .
Going through fastq would be much faster, but would require assuming some kind of naming convention to link pairs via read names. Not sure how to get around without a (rocksdb backed?) sort by name.
On Fri, Dec 15, 2017 at 11:27 AM, Erik Garrison notifications@github.com wrote:
This is really not well supported. It would almost be easier to push it back to fastq using vg view -X and then sort and realign that. How can we make this easier?
On Fri, Dec 15, 2017, 16:05 Glenn Hickey notifications@github.com wrote:
@ekg https://github.com/ekg @edawson https://github.com/edawson Any ideas on ways to re-align a GAM that was made with vg map? It's come up a few times, and I've ended up doing something pretty hacky to make sure it's properly paired to be aligned with -i:
vg view -a my.gam | | jq -s -c 'sort_by(.name)[]' | filter.py | vg view -JaG > my_properly_paired.gam
where filter.py is a tiny python script that removes unpaired reads from the json. On a 2G chromosome 21 gam, the whole process takes several hours, with the jq sort requiring nearly 300G of RAM.
This could be improved by, say, adding a name sort to rocksdb and a pair filter to vg filter, but wanted to check if there's anything in place I don't know about.
There seems to be some related code in gamsort.cpp, but it doesn't look finished and I can't get anything out of the command line.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1327, or mute the thread https://github.com/notifications/unsubscribe-auth/AAI4EXoJWR-N- GHZ2JZDhjNGeDu4TpNTks5tAoqhgaJpZM4RDmaR .
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1327#issuecomment-352048834, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7qQVuGsXhafiHII8OzLlPuHrTqGBks5tAp3hgaJpZM4RDmaR .
For what it's worth, it's not any easier in BAM. You have to go back to fastq, afaik.
It's still easier in BAM imho, because we can sort by name. I can go through fastq with two simple commands:
samtools sort -n my.bam -O BAM > name_sort.bam samtools fastq name_sort.bam -s singles.fq -1 reads1.fq -2 reads2.fq
or I could also strip the singles out of name_sort.bam directly with a python script.
Parsing speed is another factor that makes big GAMs much more unwieldy to work with than BAMs. 'vg view -X' is only 2-3X slower than 'samtools fastq' but I'm getting 10-15X slower for 'vg view -a' vs 'samtools view'.
On Tue, Dec 19, 2017 at 12:49 PM, Erik Garrison notifications@github.com wrote:
For what it's worth, it's not any easier in BAM. You have to go back to fastq, afaik.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/1327#issuecomment-352834880, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7poJTkxF8z2InJQ4SCiLJp58ORADks5tB_cfgaJpZM4RDmaR .
@ekg @edawson Any ideas on ways to re-align a GAM that was made with vg map? It's come up a few times, and I've ended up doing something pretty hacky to make sure it's properly paired to be aligned with -i:
where filter.py is a tiny python script that removes unpaired reads from the json. On a 2G chromosome 21 gam, the whole process takes several hours, with the jq sort requiring nearly 300G of RAM.
This could be improved by, say, adding a name sort to rocksdb and a pair filter to vg filter, but wanted to check if there's anything in place I don't know about.
There seems to be some related code in gamsort.cpp, but it doesn't look finished and I can't get anything out of the command line.