iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
110 stars 14 forks source link

Logging performance of de novo variant discovery with racon #303

Open leoisl opened 2 years ago

leoisl commented 2 years ago

To be done by @mbhall88 . Could you please use this commit?

leoisl commented 2 years ago

@mbhall88 could you also please report the max RAM usage? I am quite concerned with this part in denovo racon: https://github.com/rmcolq/pandora/blob/12a08c5483c19fc12411e174970d31c86e842a2d/src/denovo_discovery/discover_main.cpp#L205-L206

This is a dictionary from loci names to the subreads that map to each locus, inferred by pandora map. This structure could get potentially very large, as we basically store a substring of every read that map to each locus (is just the region of the read that maps to that specific locus, but still...). There are potentially many better ways to store this info, but I also want to avoid premature optimisation, and just work on this if RAM is indeed an issue.

mbhall88 commented 2 years ago

The nanopore runtime and memory usage of drprg with the previous version of de novo discovery

https://github.com/mbhall88/drprg/issues/15#issuecomment-1275609531

and the Illumina

https://github.com/mbhall88/drprg/issues/15#issuecomment-1276908991

Pandora is the vast majority of runtime and memory in drprg so these act as good benchmark figures, especially given the only thing that will change between them is the pandora version

leoisl commented 2 years ago

I just finished running pandora discover at this commit on the paper ONT data. Comparison with paper run:

Paper run

commit 02f9ec

I think runtime is totally fine, we got almost the same exact runtime.

RAM usage is much higher though, probably related to this: https://github.com/rmcolq/pandora/issues/303#issuecomment-1297228115 . It will be the RAM bottleneck of the pipeline (pandora compare on ONT data takes 15.7GB, this new denovo implementation takes 1 GB more).

OFC this is data dependent, so let's see how @mbhall88 benchmarks compare

iqbal-lab commented 2 years ago

This RAM use is, IMO, acceptable for the moment - i wouldn't postpone the merge to reduce RAM

mbhall88 commented 2 years ago

Okay, so here are the updated benchmark figures

Runtime

Illumina

image

Median is 91s, down from 98s. So it is faster now on Illumina!

Nanopore

image

Median is now 171s, up from 163s.

Memory

Illumina

image

Median is 46MB, down from 54MB, so Memory is marginally down too.

Nanopore

image

Median is now 270MB, up from 240MB.


All in all, I think this acceptable.