Logging performance of de novo variant discovery with racon

iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads

MIT License

110 stars 14 forks source link

Logging performance of de novo variant discovery with racon #303

Open leoisl opened 2 years ago

leoisl commented 2 years ago

To be done by @mbhall88 . Could you please use this commit?

leoisl commented 2 years ago

@mbhall88 could you also please report the max RAM usage? I am quite concerned with this part in denovo racon: https://github.com/rmcolq/pandora/blob/12a08c5483c19fc12411e174970d31c86e842a2d/src/denovo_discovery/discover_main.cpp#L205-L206

This is a dictionary from loci names to the subreads that map to each locus, inferred by pandora map. This structure could get potentially very large, as we basically store a substring of every read that map to each locus (is just the region of the read that maps to that specific locus, but still...). There are potentially many better ways to store this info, but I also want to avoid premature optimisation, and just work on this if RAM is indeed an issue.

mbhall88 commented 2 years ago

The nanopore runtime and memory usage of drprg with the previous version of de novo discovery

https://github.com/mbhall88/drprg/issues/15#issuecomment-1275609531

and the Illumina

https://github.com/mbhall88/drprg/issues/15#issuecomment-1276908991

Pandora is the vast majority of runtime and memory in drprg so these act as good benchmark figures, especially given the only thing that will change between them is the pandora version

leoisl commented 2 years ago

I just finished running pandora discover at this commit on the paper ONT data. Comparison with paper run:

Paper run

Threads: 16
Runtime: 7.59h
RAM: 10.2 GB

commit 02f9ec

Threads: 16
Runtime: 7.58h
RAM: 16.7 GB

I think runtime is totally fine, we got almost the same exact runtime.

RAM usage is much higher though, probably related to this: https://github.com/rmcolq/pandora/issues/303#issuecomment-1297228115 . It will be the RAM bottleneck of the pipeline (pandora compare on ONT data takes 15.7GB, this new denovo implementation takes 1 GB more).

OFC this is data dependent, so let's see how @mbhall88 benchmarks compare