iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Add parameter allowing pandora to filter out genes that are not proportionally covered enough by the reads #348

Closed leoisl closed 6 months ago

leoisl commented 8 months ago

@Danderson123 needs this for amira. This is the full description of this parameter:

  --min-gene-coverage-proportion FLOAT
                              Minimum gene coverage proportion to keep a gene. Given the coverage on the kmers
of the maximum likelihood path of a gene, we compute the number of bases that have at least one read
covering it. If the proportion of such bases is larger than the value in this parameter, the gene is kept.
Otherwise, the gene is filtered out. [default: 0.8]

One important thing to note about this parameter is that the first and last bases of the maximum likelihood path of a PRG in pandora usually have counts 0. That happens if the first minimizer is not the lowest in the first window(similarly to the last minimizer and last window). By default, the window length is 14, so in 13/14 (92%) of the cases, we will have a stretch of counts 0 before having real counts. In general, the counts throughout the ML path of a gene looks like this:

0 0 0 0 0 0 0 0 20 20 20 20 20 20 20 20 20 50 50 50 50 50 ... 50 50 50 50 50 20 20 20 20 20 20 20 20 20 0 0 0 0 0

This is important for this parameter. The smaller the gene, the more impact this first and last stretches of artificial null coverages have.

There are some ways around this:

  1. Do nothing, count the 0s as real coverage (what we do right now);
  2. Ignore the first and last stretch of null coverages in the ML path, up to the window size, when doing coverage calculations;
  3. Propagate the first non-zero coverage value to its null coverages to the left. Also do this on the right flank;

Please let me know if we should proceed implementing one of these 3 options.

leoisl commented 8 months ago

CLI parameter implemented in https://github.com/leoisl/pandora/tree/min_gene_coverage_proportion

iqbal-lab commented 8 months ago

Great! Option 2 seems sensible?

leoisl commented 8 months ago

Yep, will also wait @Danderson123 reevaluation to see what his data says!

leoisl commented 6 months ago

Closed via https://github.com/rmcolq/pandora/pull/351