Closed leoisl closed 11 months ago
Great that you have been able to get a fix so quickly. Can I just check that now is it the max length (and no longer the min length) of the PRG that will be compared to the read length to decide when x proportion of k-mers need to be mapped to call a gene?
Great that you have been able to get a fix so quickly. Can I just check that now is it the max length (and no longer the min length) of the PRG that will be compared to the read length to decide when x proportion of k-mers need to be mapped to call a gene?
Yes. There was a semantic bug in the code saying it was the min length.
@Danderson123 @mbhall88 please find an executable in codon here: /hps/nobackup/iqbal/leandro/pandora_versions/pandora_70f7f0_release
Great work team!
Hmmm it will be quite a bit of work for me to test this through drprg as it still uses a version of pandora prior to the new index structure. I'd have to update a decent amount of code and tests to account for this. Might take me some time to get around to doing this.
I did an evaluation using the 4-way data and no filters applied on nanopore data only:
I think these are positive results. The loss of recall might be related with the strict default value of --min-rel-gene-coverage
(all genes with mean kmer coverage < 5% of the sample coverage will be removed).
Dan also reported increase of precision on pandora gene calling on his amira dataset from 91.7% to 94.7%.
Overall, I think these changes should be merged.
Summary
@Danderson123 found a strange issue that was happening in
pandora
. Basically he had a long MSA, created a PRG from it, and when indexing,pandora
was saying the longest path through that gene had 7 minimising kmers (well the story is not exactly this, but after some debugging, we found out this was the issue). The longest path through that gene actually has 241 minimising kmers. There was a big bug when calculating the maximum path through a gene, mostly affecting cases when reads are longer than genes, i.e. ONT reads. This PR is a tentative fix to this, but I'd like to run this on some real data to measure the effect of this fix.Preliminary
KmerGraph::min_path_length()
had a semantic and an implementation bug. The semantic bug is that this method actually retrieves the length of the max path in the kmer DAG instead of the min path. This is fixed in this PR, and from this point on I will refer to this method asKmerGraph::max_path_length()
. The implementation bug is that when computing the length of the max path, we use a length vector that is indexed in two different ways, e.g. one is using the index of the kmers sorted by their PRG paths, e.g.:and the other is using the Kmer node id, which is the index of the kmers sorted topologically, e.g.:
Although these indexes somehow overlap, i.e. the order of the kmer nodes sorted by PRG paths and topologically are similar, they are not identical, and in non-trivial PRGs there can be significant differences. Just the fact of accessing a data structure with two different set of indexes is already problematic.
Solution
We are now just using a single way to access this length vector (the index of the kmers sorted topologically)
Misc
This PR also contains some refactors to
KmerGraph::max_path_length()
.TODO
Need to test the effect of this fix. It definitely affects our ONT results, but I am not sure if it is a small or large effect. I will be testing on the 4-way. It would be really helpful if @mbhall88 could test this on the DrPRG evaluation and @Danderson123 on the Amira evaluation. Please note that this version of
pandora
has changes from the several previous PRs incorporated, and you might need to tweak some parameters in order to not have some genes filtered out. Notably, when running map/discover/compare--genome-size
need to be tweaked to the expected genome size of your samples. Also, by default, all genes with mean kmer coverage < 3 will be removed (change this with--min-abs-gene-coverage
), all genes with mean kmer coverage < 5% of the sample coverage will be removed (change this with--min-rel-gene-coverage
) and all genes with mean kmer coverage > 100x of the sample coverage will be removed (change this with--max-rel-gene-coverage
). This is a notable change from previous versions ofpandora
, especially for small datasets, where these values were hard-coded, different, and were just triggered if the sample coverage was >= 20x (i.e. if you had a toy dataset, these filters wouldn't trigger whereas now they do). Also please do use the--debugging-files
flag so that we can debug the run later if required.I am assuming you will run this in codon, so I will soon give you a path to an executable in codon that you can use