Open leoisl opened 2 years ago
My understanding is
@danderson123 is building a prototype gene order de bruijn graph based on the Pandora Sam. @martinghunt has given him test cases and simulated perfect reads. His goal is A. to confirm his graph is correctly constructed given perfect input B. to understand how his graph properties vary with read length distribution, given the length distribution of genes. To do this, he might as well have mocked perfect Pandora Sam files.
Leandro wants to know if our mapping is working, partly for Dan, but also for basic Pandora. For that, the proposal above makes sense to me. But I'm not going to think hard til Monday
Alright, we can discuss it in details on Monday. While I do think this is important, we might have higher-priority tasks...
OK, just to clarify myself, when I said "the above makes sense", I mean your proposal leandro here
https://github.com/rmcolq/pandora/issues/301#issue-1425844895
That seems like a sane approach to me. I used a similar (slightly more naive) approach when evaluating the very first iteration of denovo discovery.
The SAM file
pandora
is producing allows us to understand exactly how it is mapping regions of reads to PRGs. However, we did not yet evaluate this mapping in details in a coarse-grained way. We can do it with a simulated dataset. This is also related with @Danderson123 project.1. Main input: PanRG
The main input for our simulation is a PanRG, not a genome. The idea is that from a PanRG we can simulate a genome with possibly unknown regions between the PRGs so that we have a perfect annotation, and know exactly which regions of reads map to which PRG.
2. Simulation algorithm
This is the main algorithm to build a simulated genome
G
, simulated annotationA
, and simulated readsR
:P
in the PanRG, choose a random path and transform it to a sequence, which will be the allele ofP
inG
(this can be easily done with thepandora random
subcommand);G
. An integer0
means the gene is absent fromG
,1
means it is present just once,>1
means there are several copies of the gene. The profile is derived from 6 probabilities:CN0
,CN1
, ...,CN5
which denotes the probability of a gene havingCNn
copies. These probabilities must sum to 1. For simplicity, in the first iteration, we will just work withCN0
andCN1
;G
and how many times it appears inG
, create an order of these genes with strands assigned randomly. This will give us a gene ordering ofG
, e.g.:gene4+ -> gene2- -> gene1+ -> gene3+ -> gene3-
.G
using the gene ordering obtained in step 4 by translating the order and the strands to sequences and concatenating them;A
(gff3) ofG
using the gene ordering obtained in step 4;G
. This is parameterised byPIR
(Probability of Intergenic Region - how probable is two genes have an intergenic region),MinIR
(min length of the intergenic region) andMaxIR
(max length of the intergenic region). In every junction between two genes, we check if we should add a random intergenic region between them. If we do so, we have to shift the annotationA
to the right of the genes that come after this added intergenic region. Bacteria rarely have intergenic regions, but this can also simulate an incomplete annotation;G
, with an error rate and coverage profile;3. Output and evaluation
By the end of all these steps, we will have a set of long reads
R
fromG
, where we know exactly where each read come from inG
. Using the annotation, we know exactly which region of the read come from each gene. We can then runpandora
mappingR
to the PanRG, and look at the SAM and see if we were able to map each region of each read to their gene of origin.4. Possible variations
There are many variations of this pipeline that we can study by changing simulation parameters. We can for example see how
pandora
mapping accuracy change when we add several genes with copy number 2, 3, 4, etc. We can see the effect of increasing/decreasing mutation rate between copies. We can add frequent and large intergenic regions to simulate a badly annotated genome, etc.5. Conclusion
I think this is a good way to build a simulated dataset to improve
pandora
for our current analyses. Note that this simulation and evaluation framework does not care about variations between the genome and the simulated reads, nor try to evaluate ifpandora
genotypes these variations correctly. This was already extensively evaluated in thepandora
paper. This framework concerns with the larger picture. It is all about mapping regions of reads to the correct PRGs, and this is what the current @Danderson123 and @LeahRoberts project is mostly about.6. Feedback
Could I have any feedback on this from you @iqbal-lab @rmcolq @mbhall88 @Danderson123 @LeahRoberts ? @iqbal-lab and @Danderson123 IDK how this simulation and evaluation framework relates with the current simulations you are working on for @Danderson123 project. There might be several overlaps between this and your simulations. However, the objectives are different I think. Here I just want to debug if
pandora
is able to map reads to the correct genes, and how copy number, mutation and error rate affects this mapping. I will not evaluate gene ordering here.I will wait for your approval or rejection of this development direction. I have to say, however, that trying to evaluate this directly on real data is way too complicated for a long list of reasons...