iqbal-lab-org / viridian

MIT License
15 stars 5 forks source link

Deciding when to exclude primers #99

Open martinghunt opened 1 year ago

martinghunt commented 1 year ago

We have examples where there is a choice of two primers for the end of a given amplicon. Only one primer of the two primers has support from the reads.

Observed behaviour: Both primers are kept, instead of excluding the primer that isn't supported by reads. Cylon is given amplicon coords as if both primers are present (and presumably everything downstream also thinks both primers there) Expected behaviour: The primer not supported by reads is excluded.

Details are below for one sample.

Path on codon: /nfs/research/zi/mhunt/Viridian/Debug_sims_lily_20220930/Samples/1441/vwf.20220926.ccf5ef4676.ont.frs0.6/. Current working directory was /nfs/research/zi/mhunt/Viridian/Debug_sims_lily_20220930/Samples/1441/ when running it.

Version of viridian workflow: ccf5ef46765ad30cb11e9a3df716efdb2d11a4f9

The primer scheme is ARTIC 4.1. The run is using simulated Nanopore reads.

Amplicon SARS-CoV-2_76 is present at around 380X depth. The next amplicon SARS-CoV-2_77 is not present - it has zero reads.

Amplicon SARS-CoV-2_76 has two right primers:

  1. SARS-CoV-2_76_RIGHT at coords 23029-23057.
  2. SARS-CoV-2_76_RIGHT_alt1 at coords 23121-23141.

Read mapping shows that:

  1. the reads for amplicon SARS-CoV-2_76 all come from the primer SARS-CoV-2_76_RIGHT.
  2. there is zero depth for primer SARS-CoV-2_76_RIGHT_alt1.

Excerpt from samtools depth -aa on the BAM made by viridian (after coordinate sorting it):

MN908947        23024   391
MN908947        23025   389
MN908947        23026   382
MN908947        23027   395
MN908947        23028   392
MN908947        23029   384
MN908947        23030   372
MN908947        23031   386
MN908947        23032   381
MN908947        23033   364
MN908947        23034   376
MN908947        23035   351
MN908947        23036   379
MN908947        23037   386
MN908947        23038   384
MN908947        23039   384
MN908947        23040   388
MN908947        23041   390
MN908947        23042   386
MN908947        23043   389
MN908947        23044   384
MN908947        23045   383
MN908947        23046   388
MN908947        23047   384
MN908947        23048   375
MN908947        23049   386
MN908947        23050   374
MN908947        23051   385
MN908947        23052   376
MN908947        23053   367
MN908947        23054   336
MN908947        23055   325
MN908947        23056   304
MN908947        23057   288
MN908947        23058   2
MN908947        23059   1
MN908947        23060   1
MN908947        23061   1
MN908947        23062   0
MN908947        23063   0
MN908947        23064   0
MN908947        23065   0
MN908947        23066   0
MN908947        23067   0
MN908947        23068   0
MN908947        23069   0

and continues at zero depth until around the start of amplicon 78:

MN908947        23211   0
MN908947        23212   0
MN908947        23213   0
MN908947        23214   0
MN908947        23215   0
MN908947        23216   1
MN908947        23217   1
MN908947        23218   1
MN908947        23219   1
MN908947        23220   302
MN908947        23221   314
MN908947        23222   337
MN908947        23223   343
MN908947        23224   352
MN908947        23225   355
MN908947        23226   360
MN908947        23227   354
MN908947        23228   357

cylon is given this in the file ampicons.json for SARS-CoV-2_76:

    "SARS-CoV-2_76": {
      "start": 22648,
      "end": 23140,
      "left_primer_end": 22773,
      "right_primer_start": 23028
    },

This means that primers SARS-CoV-2_76_RIGHT and SARS-CoV-2_76_RIGHT_alt1 have been counted as present in the reads. Expected behaviour is that only SARS-CoV-2_76_RIGHT is present.

Screenshot attached showing the mapped reads in this region. Amplicon SARS-CoV-2_76 is highlighted (it also has two left primers. The highlighted part is using the "inner" of the two primers at each end of the amplicon).

image (2)
jeff-k commented 1 year ago

The logic for identifying primers before cylon is run is different than the later primer id logic that we use to mask primer positions in the consensus.

Unfortunately, if a fragment ends between two alternative primers and the adapter sequence happens to be similar enough to the reference not to be soft-clipped we have no way to be sure if it's legitimately a fragment that ends in that region, or a fragment with adapter noise that ends there.

Luckily we have a couple advantages over the data here:

In summary, there are going to be annoying sequencing artefacts that we need to catch so we have to use additional polishing and QC filters.

Simulated test cases are very helpful for identifying these kinds of artefacts and testing at the component level. Since there isn't much that we can do in this situation, the best we can do is tweak our QC model to reduce the impact.

To adjust the QC thresholds we should make sure that we look at real world examples of this. Do we have a way to identify this problem in real data?

martinghunt commented 1 year ago

This issue is solely about how reads are allocated to amplicons/primers, and then deciding which primers to reject. It's not about QC.

The reads unambiguously are from SARS-CoV-2_76_RIGHT based on read mapping. The reads are mapped with the end coordinate being 23057. samtools depth:

MN908947        23054   336
MN908947        23055   325
MN908947        23056   304
MN908947        23057   288
MN908947        23058   2
MN908947        23059   1
MN908947        23060   1
MN908947        23061   1
MN908947        23062   0
MN908947        23063   0

The depth of the remainder of the amplicon, and indeed past it, is zero. Therefore SARS-CoV-2_76_RIGHT_alt1 is not present.

The only noise here is the depth of 2, 1, 1, 1 at positions 23058-23061.

The expected behaviour is that primer SARS-CoV-2_76_RIGHT_alt1 is excluded.

jeff-k commented 1 year ago

Before running cylon we detect which primers are present for each amplicon. We do this because we want to include cases where both alternate primers are present in the sequencing data. We construct a histogram of observed primers for each amplicon and use this to decide which ones are actually there.

Unfortunately, it appears that it is possible for adapter sequences to trick this stage.

Luckily, we have downstream QC and tighter primer id to compensate for these cases.

martinghunt commented 1 year ago

Firstly, the code should work on these simulated reads and exclude primer SARS-CoV-2_76_RIGHT_alt1. The read mapping ends exactly at the end of primer SARS-CoV-2_76_RIGHT (except those few erroneous depths of 2,1,1,1, which should not be enough to trick into thinking there's another primer downstream of there).

Secondly, there's a minimal example of simulated nanopore reads attached that reproduces this error. Perfect reads from amplicon 75 and from amplicon 76. All reads span exactly from the start of a left primer to end of a right primer. Reads do not have any adapter and they are literally identical to the reference genome. They include each primer to its end, and have zero extra bases past the primer ends. The reads are at 200X, with 100X from each strand.

Amplicon 75 reads are from primer SARS-CoV-2_75_LEFT to primer SARS-CoV-2_75_RIGHT. Amplicon 76 reads are from primer SARS-CoV-2_76_LEFT to primer SARS-CoV-2_76_RIGHT.

Expected behaviour: the primer SARS-CoV-2_76_RIGHT_alt1 (coords 23121-23141) is excluded.

Observed behaviour: the amplicon coords given to cylon are:

    "SARS-CoV-2_76": {
      "start": 22648,
      "end": 23140,
      "left_primer_end": 22773,
      "right_primer_start": 23028
    },

which means that the primer SARS-CoV-2_76_RIGHT_alt1 has not been excluded.

reads.perfect_ont.fq.gz

jeff-k commented 1 year ago

Thank you for the example. Please note that there is a bit more logic behind identifying the target primers: https://github.com/jeff-k/viridian_workflow/blob/6ea9cdbf7660508b41cf0def51b17c3ae4626054/viridian_workflow/readstore.py#L320

In short, we reason about the "primer region" during this first amplicon id pass, that being the interval spanned by min(p1_start, p2_start) and max(p1_end, p2_end) when there are alternate primers.

When one of the multiple primers is not observed above the primer region threshold (default 100) it is excluded from this region. When neither of the primers are above this threshold (as in this example), we assume the largest extent of the amplicon. This is to safely accommodate data which may be fragmented.

Please note that these preliminary primer bounds are not related to the consensus masking or qc processes. So the primers are not really "kept".

Is this behaviour causing problems for cylon?

iqbal-lab commented 1 year ago

@jeff-k - referring to this comment of yours (my bold)

"When one of the multiple primers is not observed above the primer region threshold (default 100) it is excluded from this region. When neither of the primers are above this threshold (as in this example), we assume the largest extent of the amplicon. "

i'm confused. One of the primers has coverage of 200x and the alt has zero. So, the alt should be excluded. What do you mean neither of the primers is above the threshold of 100?

iqbal-lab commented 1 year ago

No, i'm doubly confused.

  1. I don't know why jeff is saying neither primer has enough depth to count as being present.
  2. I don't know why Martin is saying the right primer isn't being excluded. In https://github.com/iqbal-lab-org/viridian_workflow/issues/99#issuecomment-1274906237 he shows a JSON with "right_primer_start": 23028, which is the start of SARS-CoV-2_76_RIGHT not SARS-CoV-2_76_RIGHT_alt1. So...that json is passing the correct coords to cylon, no?
jeff-k commented 1 year ago

Whoops, I interpreted the attached file as a fastq based on the extension and only counted half the records.

Either way, yes, for point 2 this code appears to be working as described.

martinghunt commented 1 year ago

I don't know if this is agreeing that there is a bug or disagreeing that there is bug and that the code does not need changing: "Either way, yes, for point 2 this code appears to be working as described."

The right primer is not being excluded. As described below.

Amplicon SARS-CoV-2_76 has two right primers:

  1. SARS-CoV-2_76_RIGHT at coords 23029-23057.
  2. SARS-CoV-2_76_RIGHT_alt1 at coords 23121-23141.

Cylon is given this:

"SARS-CoV-2_76": {
      "start": 22648,
      "end": 23140,
      "left_primer_end": 22773,
      "right_primer_start": 23028
    },

The end coordinate given to cylon is the end coordinate of the right primer SARS-CoV-2_76_RIGHT_alt1 (difference off 1 because of 0- or 1-based coordinates).

Expected behaviour is: "end": 23056, ie the end of the primer SARS-CoV-2_76_RIGHT.

Months ago it was agreed that this would be part of the method. When there is:

then all other alt primers there with no support would be excluded from that point onwards.

iqbal-lab commented 1 year ago

OK that answers my question here

https://github.com/iqbal-lab-org/viridian_workflow/issues/99#issuecomment-1275214193

I looked at that json and checked the right primer start coord which seemed to correctly be at 23028. I missed your main point was that the end was at 23140, which is the bug. That end cpord needs to be at the right hand end of 76-right, not 76-right-alt1

iqbal-lab commented 1 year ago

So, I withdraw my question 2, and now I see the bug again, sorry for confusion

jeff-k commented 1 year ago

Ah, I see the confusion. The start and end fields refer to the amplicon coordinates, and the left_primer_start and right_primer_end fields refer to the observed coverage by primers as specified.

Note that it would be somewhat redundant if all four fields represented the same data (+/- primer lengths).

Since the data required by cylon is available in the fields that are specifically labelled as the primer coords I believe that this issue can be closed.

martinghunt commented 1 year ago

It can't be closed because:

  1. the end coordinate is incorrect. Cylon requires the end coordinate to be correct.
  2. the question of why primer SARS-CoV-2_76_RIGHT_alt1 is counted as present, despite zero evidence of its presence, needs investigating. In both the simulated data and in the perfect reads.
jeff-k commented 1 year ago

I have double checked the primer data source and I can confirm that the right-most bound of the amplicon (that is, considering all possible amplicons) is correctly propagated to the end field.

In addition to this the coordinates of the observed primers are correctly identified in the primer start and end fields. In this case that means excluding the coords of alt1.

This has always been the specified behaviour of these fields and it would not be appropriate to relabel the JSON output that we have been generating already.

iqbal-lab commented 1 year ago

ok i'd like to have a conversation about this; i'm not feeling great today, so i suggest the conversation happens on friday. i'd like this to stay open til we have had that conversation.