iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
109 stars 14 forks source link

Clean up unused modules and remove read information bookkeeping #337

Closed leoisl closed 1 year ago

leoisl commented 1 year ago

This is a clean up PR. It removes two unused modules:

  1. The gene-DBG module;
  2. The associated noise-filtering module (e.g. removing likely FP genes from the gene-DBG, etc);

The reasoning for removing these modules are:

  1. AFAIK, there is a good amount of code in them, but they are not fully implemented yet;
  2. We never managed to resume Rachel's work and finish these up;
  3. @Danderson123 is developing amira, which is a downstream tool to pandora, and include a gene-DBG module and a noise-fitlering module;

This module also removes read-information bookkeeping code, which improves RAM usage by ~6x with even minor improvements to results (see https://github.com/rmcolq/pandora/issues/330#issue-1752440505).

There is a small follow-up PR to this one, fixing some small issues and CLI, but I wanted to keep separated from this one, which is the large cleanup.

leoisl commented 1 year ago

Looks straightforward to me. I was a bit worried where you pass in the sample id fixed to zero in add_clusters_to_pangraph, but decided this must be because you only ever map one sample.

Yeah, it is exactly this. I had the same worry when I first read the pandora code, but then later it made sense too!

iqbal-lab commented 1 year ago

@Danderson123 i think it is very hard for you to review this; when I am cavalier and skip over things I am relaxed about it, but you might be overwhelmed by all this code. If this is not merged by tomorrow, maybe we can talk you through how we look through this kind of PR and what we look for.