walaj / VariantBam

Filtering and profiling of next-generational sequencing data using region-specific rules
Other
74 stars 10 forks source link

Slow reading of CRAM platinum genome #12

Closed hannespetur closed 7 years ago

hannespetur commented 7 years ago

Hello. Thank you for creating this nice tool.

I was testing your tool using on a Platinum Genomes CRAM file to see how it performs on CRAM files, but reading seems to be incredibly slow.

$ ls ~/technical/data/NA12878.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.cram*
~/technical/data/NA12878.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.cram
~/technical/data/NA12878.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.cram.crai
$ time samtools view -h ~/technical/data/NA12878.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.cram -T $GENOME "chr5:5000000-5001000" > /dev/null
real    0m0.057s
user    0m0.039s
sys 0m0.017s
$ time variant ~/technical/data/NA12878.alt_bwamem_GRCh38DH.20150706.CEU.illumina_platinum_ped.cram -T $GENOME --region "chr5:5000000-5001000" > /dev/null
....(after 20 minutes, it has still not returned)....

where $GENOME points to a copy of the 1000G GRCh38 reference genome.

Do you know why this is? I suspect VariantBam is reading through the entire CRAM file instead of using the index.

Best regards, Hannes

walaj commented 7 years ago

Hi Hannes,

You're right, VariantBam does not use the index. This was by design, since for linked-regions where we want to grab linked reads from a pair, they could be aligned anywhere in the genome. I figured that if one wanted to restrict VariantBam to just run in just some region, a pipe out from Samtools is the more parsimonious way to go, rather than have to concepts of "region" (e.g. where to even look, and then where a rule should be applied).

The region flag in VariantBam then is to specify a region that a rule should be applied to. In your example, since no rules are supplied, the region will just the default rule of accepting all reads that intersect that area.

Hope this helps, Jeremiah

hannespetur commented 7 years ago

Thank you Jeremiah for your response. It cleared things up for me.

I really like the idea behind linked regions, but I would want to use it without restricting myself to a small region first. And you are right, the read mate could be aligned anywhere on the genome. However, we know exactly where it is (the RNEXT and PNEXT fields in the SAM specs). I think it would be really cool to extend VariantBam so it would fetch distant mates using the index and significantly speeding up the process.

I will close this issue because you have already answered my question, but I'd love to hear your thoughts on this.

Best regards, Hannes

walaj commented 7 years ago

Hi Hannes,

I see what you're saying, and this is a good idea. I've thought of this before, but haven't gotten around to implementing. I'll put this on my to-do list and will comment here when I get a solution.

Best, Jeremiah