sstadick / perbase

Per-base per-nucleotide depth analysis
MIT License
115 stars 13 forks source link

Possible to calculate only say INDEL column values? #35

Closed jelber2 closed 3 years ago

jelber2 commented 3 years ago

Hi, perbase is a great tool! I was wondering if there might be a simple solution to more quickly calculate just the INS and DEL columns? If not, then just close this issue? I ask because for say a 24GB BAM file using about 50 cores it takes roughly six hours for the following perbase base-depth rnaseq.bam | cut -f 9-10 | bgzip -@75 > test.gz

sstadick commented 3 years ago

The short is not "as is". The slow part is perbase walking every cigar to find indels. In the longer term I do intend to essentially create a very raw variant calling mode that would only output snps and indels. But that is a ways down the road.

What I'd recommend trying is just pre-filter your bam to only reads with INDELS in them.

samtools view -h rnaseq.bam | awk '$1 ~ "^@" || $6 ~ "I|D"' | samtools view -u -o rnaseq_indels.bam
samtools index rnaseq_indels.bam
# Note also the reduced bgzip threads
perbase base-depth rnaseq_indels.bam | cut -f 9-10 | bgzip -@5 > test.gz

If that doesn't work for you let me know. If there is enough of a use case for just an INDEL counter, I wouldn't mind exploring how to do it more efficiently as its own subcommand.

jelber2 commented 3 years ago

Thx! I will try it, but I think you forgot to output the header (samtools view -h) and perbase to run on rnaseq_indels.bam.

sstadick commented 3 years ago

@jelber2 you are correct! I edited for posterity 👍

jelber2 commented 3 years ago

This is great! Takes about < 1hr!

$ samtools view -@75 -h rnaseq.bam | \
$ awk '$1 ~ "^@" || $6 ~ "I|D"' | \
$ samtools view -@75 -u -o rnaseq_indels.bam

real    29m4.159s
user    63m30.416s
sys     10m27.048s

$ samtools index rnaseq_indels.bam

real    0m19.281s
user    0m16.648s
sys     0m2.632s

$ perbase base-depth rnaseq_indels.bam | \
$ cut -f 9-10 |  \
$ bgzip -@5 > test3.gz

real    21m5.484s
user    40m14.956s
sys     4m27.652s
sstadick commented 3 years ago

Nice! I'll close this for now but let me know if you run into any bumps in the road!