iqbal-lab-org / viridian

MIT License
15 stars 5 forks source link

Primers contributing to Clean.Tot.cons #100

Open martinghunt opened 1 year ago

martinghunt commented 1 year ago

We have examples where there is an amplicon with two right-hand primers. Neither primer is excluded at the primer/amplicon identification stage, and both are provided to cylon.

Observed behaviour: in all_stats.tsv, at the "inner" (or left most, whatever you want to call it) right-hand primer position, the reads from the "inner primer" are contributing to Clean.Tot.cons.

Expected behaviour: those reads do not contribute to Clean.Tot.cons.

An example is #99 , in amplicon SARS-CoV-2_76.

This may be confounded by the adjacent amplicon 77 being dropped, I don't know. The behaviour below may also be happening when the adjacent amplicon is not dropped - this needs checking.

For the sample in #99 this is what happens:

  1. no primers for amplicon SARS-CoV-2_76 are excluded (a separate issue to this one)
  2. all the reads arise from the leftmost primer SARS-CoV-2_76_RIGHT (coords 23029-23057).
  3. therefore the only read coverage at 23029-23057 is from reads ending with primer SARS-CoV-2_76_RIGHT . The primer is not (as far as the code knows) at the end of the amplicon - the decision earlier was to also keep SARS-CoV-2_76_RIGHT_alt1 (coords 23121-23141).
  4. all_stats.tsv reports non-zero Clean.Tot.cons values at 23029-23057 - they are fairly close to the values in RawDepth.
jeff-k commented 1 year ago

The primer id stage before cylon is different than what is used for masking the consensus. When an adjacent amplicon is dropped the overhanging primer bases will be considered, so we would expect those reads to contribute to Clean.Tot.cons.

Are there examples where this happens when the adjacent amplicon is not dropped?

martinghunt commented 1 year ago

This issue has nothing to do with cylon. It's about when primer parts of reads are used to contirbute to good or bad coverage for masking, and in the stats.

We have sequence chunks like this in the amplicon from left to right:

L = left chunk (I don't care about left primers here) P1 = primer1. Q = sequence between primer1 and primer2 P2 = primer2.

and the subsequent amplicon is dropped.

I would expect that the P1 part of the reads that were made during PCR from P1 do not count towards the good depth, since there should be reads that arose from P2 during PCR. Those reads from P2 should also include the P1 sequence and can be relied upon for their P1 sequence.

jeff-k commented 1 year ago

I'm sorry I don't quite understand. How does this differ from what we're doing right now?

martinghunt commented 1 year ago

Current behaviour: Reads originating from primer P1 contribute to good coverage in the P1 region of the consensus. Reads originating from primer P2 contribute to good coverage in the P1 region of the consensus consensus.

Expected behaviour: Reads originating from primer P1 do not contribute to coverage in the P1 region of the consensus. Reads originating from primer P2 contribute to good coverage in the P1 region of the consensus consensus.

jeff-k commented 1 year ago

P1 and P2 are alts on the same side of an amplicon, right? Do you have an example?

iqbal-lab commented 1 year ago

this simulation is the example @jeff-k

jeff-k commented 1 year ago

In #99's example we are adjacent to a missing amplicon. This means that primer bases will be considered no matter what. I'm not sure how we can determine that the current behaviour is deviating from the expected behaviour.