iqbal-lab-org / pandora

Pan-genome inference and genotyping with long noisy or short accurate reads
MIT License
107 stars 14 forks source link

Improved mapping #344

Closed leoisl closed 7 months ago

leoisl commented 11 months ago

Summary

@Danderson123 has been having some issues using pandora in his artificial amira dataset. This artificial dataset is simply composed of several PRGs, each one representing an AMR gene, built from their MSAs, and he tries to map the first allele of each MSA back to their respective PRGs with pandora. This should be a trivial task, and we should be getting 99.9% accuracy. Well, his use case is not that simple because many genes are contained inside another (he is using panaroo to cluster the alleles, and there are thresholds to separate genes, and there is always edge cases, e.g. if two alleles are identical but the shorter has a deletion spanning 30% of the original allele, do we classify as the same gene or as a new gene?). Anyway, it is a simple artificial dataset, but we should expect some shared regions between unrelated graphs due to the way the alleles are clustered, and this way of clustering alleles is standard, so in general I think pandora should be aware of this and behave well in this situation.

At first he was getting ~91% accuracy. PR https://github.com/rmcolq/pandora/pull/343 increased that to ~94% (I reported 94.7% on that PR but that is imprecise, as it was on a subset of the input). This PR increases the accuracy to 99.6%, and was done by Dan and I iteratively exchanging code, ideas, and benchmarks. I'd be up to increasing this accuracy to 99.9%, but this can be done later.

Details

This PR improves pandora mapping by doing a better detection of conflicting clusters and selection.

  1. Conflicting clusters detection

Conflicting clusters happen in pandora when we have 2 good clusters mapping to two different genes and they stem from the same region of the read, so we have to choose one. Prior to this PR this was done simply by checking if one cluster contained the other. This containment can be understood as an envelopment, e.g. if cluster 1 stems from a region in the read spanning the interval [200, 600], then it envelops cluster 2 spanning [400, 600], so these clusters are declared as conflicting and we have to choose one. In several of @Danderson123 examples, this enveloping did not happen because the smaller cluster would match one or two more minimisers, spanning a little bit more bases, e.g. [400, 610] and then the clusters would not conflict anymore and we would report both of them, whereas reporting the first cluster only is clearly the correct choice. In one extreme example, a single additional base was mapped and the clusters were inferred as not conflicting! Anyway, we've now changed this, and we now compute the overlap between two clusters, and if that overlap is >80% of the smaller cluster, we say we have a conflict. This 80% is a default value that can be changed with a new parameter, --min-cluster-overlap:

  --min-cluster-overlap FLOAT Minimum proportion of overlap between two clusters of hits to consider them conflicting. Only one of the conflicting clusters will be kept. [default: 0.8]
  1. Conflicting clusters selection

Once we have decided we have two conflicting clusters, we have to choose the best one. Analysing data from @Danderson123 results on his artificial dataset, we figured out that we would get better results by:

  1. Selecting the cluster that contains the highest number of unique minimisers with a tolerance (default is 5%). By unique minimisers we mean that if there is a minimiser that map to a single place or to 10 places in a PRG, we will count just as one, i.e. we are not multicounting minimisers that map to several different regions of a graph anymore;
  2. If both clusters have similar number of unique minimisers (within the tolerance) then we choose the cluster that best "fits" the graph, i.e. the one with highest target coverage. The target coverage is the number of unique minimising kmers the cluster goes through over the number of minimisers in the max path of the PRG;

The *.clusters_filter_report files were improved to reflect all this information so that we can understand and debug the choices pandora makes in real datasets. This is one example of some lines from this file for the amira artificial data, that shows that pandora preferred to map an allele of group_23595 to its own PRG:

read    prg nb_of_unique_minimisers target_cov  read_start  read_end    overlap status  favoured_cluster_prg    favoured_cluster_nb_of_unique_minimisers    favoured_cluster_target_cov favoured_cluster_read_start favoured_cluster_read_end
group_23595 group_9877  61  0.663043    501 920 1   filtered_out    group_22365 175 0.650558    501 1762
group_23595 group_22365 175 0.650558    501 1762    1   filtered_out    group_23595 174 0.994286    502 1762
group_23595 group_23595 174 0.994286    502 1762    NA  kept

The previous data shows that we filtered out the mapping to PRG group_9877 favouring the mapping to group_22365 because group_22365 has many more unique minimisers (175 vs 61). However, when solving the next conflict, even though the mapping to group_22365 has more unique minimisers than the mapping to group_23595 (175 vs 174), this falls on the 5% tolerance and so we then look at the target coverage, and we see that the mapping to group_23595 has a much better target coverage than the mapping to group_22365 (99.4% vs 65%), and thus we choose to map group_23595 to its own PRG.

This unique minimisers tolerance for choosing the cluster is 5% by default, but can be configured with this new CLI parameter:

  --cluster-mini-tolerance FLOAT
                              When two clusters of hits are conflicting, the one with highest number of unique minimisers will be kept. However, if the difference between the number of unique minimisers is too small, less than this parameter, then we will prefer the cluster that has higher target coverage. [default: 0.05]

Core code

The core code for this PR is just these two, the rest is "supporting" code:

  1. Conflicting clusters detection: excerpt 1 and excerpt 2
  2. Conflicting clusters selection: excerpt 1

Evaluation

  1. Amira results: @Danderson123 ran this version on his amira dataset, and he improved the accuracy of detecting alleles from ~94% to 99.6%;

  2. 4-way results

I ran this on the 4-way results (nanopore data only). No denovo results clearly improved (blue curve = current release, orange = dev, green = this PR):

image

However, with denovo, it is arguable if this PR improves the results (blue curve = current release, orange = dev, green = this PR). While we do increase recall from 82% to 83.8%, this comes with an increase of error rate from 0.17% to 0.25%, compared to dev. This is in contradiction with the no denovo results and with amira results, so it is a bit confusing to me...

image

iqbal-lab commented 11 months ago

I'm not going to think about this til next week, but an idea for you. Suppose your mapping process is fixed, and we compare how it works on two different PRGs. In prg1, we overcluster a bit, so sometimes two different genes end up in one msa In prg2, we undercluster, so sometimes one gene is split in two.

How does your change in mapping interact with these. Is it better with prg1 or 2?

Dan's prg is constructed differently to the 4way one.

I wonder how the 4way does with his new e coli prg

leoisl commented 11 months ago

In prg1, we overcluster a bit, so sometimes two different genes end up in one msa

In this case, both the code proposed in this PR and the code previous to this PR will work similarly. If we overcluster a set of similar genes into a single PRG, we can assume that when we map a region from that set of genes, it will generate a single cluster of hits to that PRG only. Thus we won't have conflicting clusters and the improvements in this PR, which deal with conflicting clusters detection and selection won't have much of an effect. Overclustering creates another type of problem that pandora still does not handle: listing several maximum likelihood paths in the PRGs when there is evidence of overclustering or mixed samples, instead of always generating a single one... But that is out of the scope of this PR I guess...

In prg2, we undercluster, so sometimes one gene is split in two.

In this other case, when we map a region of a gene, we might map to two or more PRGs due to underclustering. In this case, the improvements described in this PR for conflicting clusters detection and selection will have a significant effect, as we are able to detect conflicting clusters much better, with less restrictions, and to select the PRG that fits better the mapped read (well, at least in our benchmarks). pandora code prior to this PR has little support for underclustering, I think it does a good job assigning random candidate PRGs for mapping a read when we have several equally good mappings, as evidenced by the improved results in Roundhound once this was added. But once this situation does not happen, e.g. a set of similar good mapping but not identical, it suffers to detect conflicting clusters and selecting the best one. And my guess is that this is also happening in many of our use cases, including Roundhound, as a more restricted situation (identical mappings) is happening.

I wonder how the 4way does with his new e coli prg

I can run the 4-way evaluation on his PanRG also.

Closing remarks

We could think of the argument that it is the user's responsibility to provide a cleaner PanRG, where over- and underclustering is reduced, but it seems in practice this is complicated. Usually the parameters we set for the tools we use for clustering alleles into genes pre-MSA don't work well for all genes, and will always undercluster and overcluster, sometimes a large fraction of the genes. It is also hard to quantify and fix this. I can speak of my own experience on this using mmseqs2 to cluster the plasmid genes. I think Dan is having the same issue with panaroo and his AMR genes. And we are in principle devs/advanced users of pandora. There are also the predefined sets of clusters (e.g. panX and piggy clusters we used in the paper) that we don't have control over these parameters... Anyway, what I am trying to say is that it is complicated for the user to build a very-high-quality PanRG, thus we should expect and model over- and underclustering in the PanRG in pandora. This PR is a first try onto modelling underclustering, and it shows significant improvements in Dan's benchmark and good improvements in the 4-way benchmark, so it is a change we should consider merging.

mbhall88 commented 11 months ago

I think it would be good to understand why the denovo results regress compared to dev (orange). Do you have any idea what that is being caused by?

leoisl commented 10 months ago

I think it would be good to understand why the denovo results regress compared to dev (orange). Do you have any idea what that is being caused by?

No idea, but I've started an investigation/debugging but will be delayed as the cluster is down for the week basically. I will update here when I understand and can explain what is happening.

leoisl commented 7 months ago

Merging, quoting Zam in slack:

I say merge. We'll deal with any future issues