base-depth become extremely slow when calculating highly covered site.

sstadick / perbase

Per-base per-nucleotide depth analysis

MIT License

117 stars 14 forks source link

base-depth become extremely slow when calculating highly covered site. #59

Open y9c opened 1 year ago

y9c commented 1 year ago

Position 21:8215302 on GRCh38 (Ensembl ref) is used for debugging. When calculating this single site, whose depth is ~10k ,the command took more than 10min and still not finish.

perbase base-depth -m -t 16 -r Homo_sapiens.GRCh38.genome.fa debug.bam

debug.tar.gz

sstadick commented 1 year ago

Thanks for making an issue! I'll check it out in the next few days. Could you please share the version of perbase that you are using? (I think it's in the output of perbse base-depth --help).

It's been a while since I dug into the perbase code, but it's not impossible that that a pileup of that depth would be slow. A good comparison would be to do a pileup with samtools as a sanity check, since perbase is using htslib under the hood.

y9c commented 1 year ago

perbase-base-depth 0.8.5

y9c commented 1 year ago

I did test samtools mpileup and my customized pipeup script using rust-htslib. The speed is not fast but acceptable. It took 0.5s for samtools mpileup.

samtools mpileup -f  debug.bam -r 21:8215302-8215302 > /dev/null  0.52s user 0.05s system 99% cpu 0.573 total

y9c commented 1 year ago

I waited more than 30 mins, but perbase is still running. I use all the threads, and there is no IO or memory bound.

sstadick commented 1 year ago

So, I think the issue is that you need to specify the region for perbase. It's spinning away doing nothing looking every possible based from the BAM header. If you give it a region it will be faster (seconds):

perbase base-depth --mate-fix --threads 16 -r /data/misc/Homo_sapiens.GRCh38.dna.primary_assembly.fa -b <(printf "21\t8215302\t8215303\n") ~/Downloads/debug.bam

I do think this is unintuitive and still counts as a bug. Work could also be done to avoid having perbase do the setup / teardown for every TID when it should be able to avoid that from the start. So I'll leave this open for any future work.

y9c commented 1 year ago

Thanks. I remember I also test with bed file input, and also wait for more than 20min. I will post my test results later.

y9c commented 1 year ago

Seem that there is some problem with my previous test. I might forget to specify the bed file. I try to run the analysis with a bed file just now, and it finish in ~1min, it is quicker than before but is still too slow. I try to parse this single site with rust-hislib, and drop mate overlap by read name. It took <1s. Is it because some filtering step in perbase slow down this analysis? BTW, bed file is 0-based, so the record should be 21\t8215301\t8215302, correct?

sstadick commented 1 year ago

I'm pretty sure it's slow because it checks every contig in the bam header, loads up that that contigs reference, spins up resources to process it, then discovers no reads for that contig.

Basically, it's not at all optimized for single query work. I'm totally open to PRs on this! And I'll leave this open in the event of future work on Perbase.