Coarse-grained simulation and evaluation framework to evaluate pandora mapping

leoisl commented 2 years ago

The SAM file pandora is producing allows us to understand exactly how it is mapping regions of reads to PRGs. However, we did not yet evaluate this mapping in details in a coarse-grained way. We can do it with a simulated dataset. This is also related with @Danderson123 project.

1. Main input: PanRG

The main input for our simulation is a PanRG, not a genome. The idea is that from a PanRG we can simulate a genome with possibly unknown regions between the PRGs so that we have a perfect annotation, and know exactly which regions of reads map to which PRG.

2. Simulation algorithm

This is the main algorithm to build a simulated genome G, simulated annotation A, and simulated reads R:

For each PRG P in the PanRG, choose a random path and transform it to a sequence, which will be the allele of P in G (this can be easily done with the pandora random subcommand);
Build a copy number profile for the PanRG. This is basically assigning an integer to each PRG meaning how many copies of that gene is present in G. An integer 0 means the gene is absent from G, 1 means it is present just once, >1 means there are several copies of the gene. The profile is derived from 6 probabilities: CN0, CN1, ..., CN5 which denotes the probability of a gene having CNn copies. These probabilities must sum to 1. For simplicity, in the first iteration, we will just work with CN0 and CN1;
(Step skipped for now): randomly mutates gene copies WRT a mutation profile;
Given which genes are in G and how many times it appears in G, create an order of these genes with strands assigned randomly. This will give us a gene ordering of G, e.g.: gene4+ -> gene2- -> gene1+ -> gene3+ -> gene3-.
Generate genome G using the gene ordering obtained in step 4 by translating the order and the strands to sequences and concatenating them;
Generate an annotation A (gff3) of G using the gene ordering obtained in step 4;
(Optional step): Generate random intergenic regions between two genes in G. This is parameterised by PIR (Probability of Intergenic Region - how probable is two genes have an intergenic region), MinIR (min length of the intergenic region) and MaxIR (max length of the intergenic region). In every junction between two genes, we check if we should add a random intergenic region between them. If we do so, we have to shift the annotation A to the right of the genes that come after this added intergenic region. Bacteria rarely have intergenic regions, but this can also simulate an incomplete annotation;
We simulate long reads from G, with an error rate and coverage profile;

3. Output and evaluation

By the end of all these steps, we will have a set of long reads R from G, where we know exactly where each read come from in G. Using the annotation, we know exactly which region of the read come from each gene. We can then run pandora mapping R to the PanRG, and look at the SAM and see if we were able to map each region of each read to their gene of origin.

4. Possible variations

There are many variations of this pipeline that we can study by changing simulation parameters. We can for example see how pandora mapping accuracy change when we add several genes with copy number 2, 3, 4, etc. We can see the effect of increasing/decreasing mutation rate between copies. We can add frequent and large intergenic regions to simulate a badly annotated genome, etc.

5. Conclusion

I think this is a good way to build a simulated dataset to improve pandora for our current analyses. Note that this simulation and evaluation framework does not care about variations between the genome and the simulated reads, nor try to evaluate if pandora genotypes these variations correctly. This was already extensively evaluated in the pandora paper. This framework concerns with the larger picture. It is all about mapping regions of reads to the correct PRGs, and this is what the current @Danderson123 and @LeahRoberts project is mostly about.

6. Feedback

Could I have any feedback on this from you @iqbal-lab @rmcolq @mbhall88 @Danderson123 @LeahRoberts ? @iqbal-lab and @Danderson123 IDK how this simulation and evaluation framework relates with the current simulations you are working on for @Danderson123 project. There might be several overlaps between this and your simulations. However, the objectives are different I think. Here I just want to debug if pandora is able to map reads to the correct genes, and how copy number, mutation and error rate affects this mapping. I will not evaluate gene ordering here.

I will wait for your approval or rejection of this development direction. I have to say, however, that trying to evaluate this directly on real data is way too complicated for a long list of reasons...

iqbal-lab commented 2 years ago

My understanding is

@danderson123 is building a prototype gene order de bruijn graph based on the Pandora Sam. @martinghunt has given him test cases and simulated perfect reads. His goal is A. to confirm his graph is correctly constructed given perfect input B. to understand how his graph properties vary with read length distribution, given the length distribution of genes. To do this, he might as well have mocked perfect Pandora Sam files.
Leandro wants to know if our mapping is working, partly for Dan, but also for basic Pandora. For that, the proposal above makes sense to me. But I'm not going to think hard til Monday

leoisl commented 2 years ago

Alright, we can discuss it in details on Monday. While I do think this is important, we might have higher-priority tasks...

iqbal-lab commented 2 years ago

OK, just to clarify myself, when I said "the above makes sense", I mean your proposal leandro here

https://github.com/rmcolq/pandora/issues/301#issue-1425844895

mbhall88 commented 2 years ago

That seems like a sane approach to me. I used a similar (slightly more naive) approach when evaluating the very first iteration of denovo discovery.

iqbal-lab-org / pandora