iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Fix to finding the longest path through PRGs #343

Closed leoisl closed 11 months ago

leoisl commented 11 months ago

Summary

@Danderson123 found a strange issue that was happening in pandora. Basically he had a long MSA, created a PRG from it, and when indexing, pandora was saying the longest path through that gene had 7 minimising kmers (well the story is not exactly this, but after some debugging, we found out this was the issue). The longest path through that gene actually has 241 minimising kmers. There was a big bug when calculating the maximum path through a gene, mostly affecting cases when reads are longer than genes, i.e. ONT reads. This PR is a tentative fix to this, but I'd like to run this on some real data to measure the effect of this fix.

Preliminary

KmerGraph::min_path_length() had a semantic and an implementation bug. The semantic bug is that this method actually retrieves the length of the max path in the kmer DAG instead of the min path. This is fixed in this PR, and from this point on I will refer to this method as KmerGraph::max_path_length(). The implementation bug is that when computing the length of the max path, we use a length vector that is indexed in two different ways, e.g. one is using the index of the kmers sorted by their PRG paths, e.g.:

len[j - 1]

and the other is using the Kmer node id, which is the index of the kmers sorted topologically, e.g.:

len[sorted_nodes[j - 1]->out_nodes[i].lock()->id]

Although these indexes somehow overlap, i.e. the order of the kmer nodes sorted by PRG paths and topologically are similar, they are not identical, and in non-trivial PRGs there can be significant differences. Just the fact of accessing a data structure with two different set of indexes is already problematic.

Solution

We are now just using a single way to access this length vector (the index of the kmers sorted topologically)

Misc

This PR also contains some refactors to KmerGraph::max_path_length().

TODO

Need to test the effect of this fix. It definitely affects our ONT results, but I am not sure if it is a small or large effect. I will be testing on the 4-way. It would be really helpful if @mbhall88 could test this on the DrPRG evaluation and @Danderson123 on the Amira evaluation. Please note that this version of pandora has changes from the several previous PRs incorporated, and you might need to tweak some parameters in order to not have some genes filtered out. Notably, when running map/discover/compare --genome-size need to be tweaked to the expected genome size of your samples. Also, by default, all genes with mean kmer coverage < 3 will be removed (change this with --min-abs-gene-coverage), all genes with mean kmer coverage < 5% of the sample coverage will be removed (change this with --min-rel-gene-coverage) and all genes with mean kmer coverage > 100x of the sample coverage will be removed (change this with --max-rel-gene-coverage). This is a notable change from previous versions of pandora, especially for small datasets, where these values were hard-coded, different, and were just triggered if the sample coverage was >= 20x (i.e. if you had a toy dataset, these filters wouldn't trigger whereas now they do). Also please do use the --debugging-files flag so that we can debug the run later if required.

I am assuming you will run this in codon, so I will soon give you a path to an executable in codon that you can use

Danderson123 commented 11 months ago

Great that you have been able to get a fix so quickly. Can I just check that now is it the max length (and no longer the min length) of the PRG that will be compared to the read length to decide when x proportion of k-mers need to be mapped to call a gene?

leoisl commented 11 months ago

Great that you have been able to get a fix so quickly. Can I just check that now is it the max length (and no longer the min length) of the PRG that will be compared to the read length to decide when x proportion of k-mers need to be mapped to call a gene?

Yes. There was a semantic bug in the code saying it was the min length.

leoisl commented 11 months ago

@Danderson123 @mbhall88 please find an executable in codon here: /hps/nobackup/iqbal/leandro/pandora_versions/pandora_70f7f0_release

mbhall88 commented 11 months ago

Great work team!
Hmmm it will be quite a bit of work for me to test this through drprg as it still uses a version of pandora prior to the new index structure. I'd have to update a decent amount of code and tests to account for this. Might take me some time to get around to doing this.

leoisl commented 11 months ago

Evaluation

I did an evaluation using the 4-way data and no filters applied on nanopore data only:

  1. Without denovo, this PR (blue curve) improves precision WRT tip of dev (orange curve), but we keep the same precision as the latest pre-release (green curve):

image

  1. With denovo, this PR (blue curve) improves precision significantly (error rate is 0.17%) compared to the tip of dev (green curve, error rate is 0.31%) or the latest pre-release (orange curve, error rate is 0.27%) at the expense of less recall. Recall for this PR is 82%, lower than tip of dev (84.6%) and latest pre-release (84.1%).

image

Summary

I think these are positive results. The loss of recall might be related with the strict default value of --min-rel-gene-coverage (all genes with mean kmer coverage < 5% of the sample coverage will be removed).

Dan also reported increase of precision on pandora gene calling on his amira dataset from 91.7% to 94.7%.

Overall, I think these changes should be merged.