jaleezyy / covid-19-signal

Files and methodology pertaining to the sequencing and analysis of SARS-CoV-2, causative agent of COVID-19.
MIT License
30 stars 25 forks source link

frameshift reporting #75

Closed agmcarthur closed 4 years ago

agmcarthur commented 4 years ago

We submitted 40 sequences that passed SIGNAL metrics, but 15 were rejected due to frameshifts. We should add this level of reporting/QC to ensure sequences are ready for GISAID.

Your submission Batch '20200610_RISC_CoV_submitted_by_TIBDN.xls' on http://gisaid.org/CoV2020 has been processed.

Currently 25 out of 40 sequences are released. 15 sequences have not made it through the curation check due to frameshift issues. The details are as follows:

HCoV-19/Canada/Toronto/2020/S123 Gap of 2 nucleotide(s) found at refpos 2444 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2458 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2462 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2476 (FRAMESHIFT). Gap of 2 nucleotide(s) found at refpos 2480 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2484 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2487 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2493 (FRAMESHIFT). Gap of 10 nucleotides when compared to the reference sequence.

HCoV-19/Canada/Toronto/2020/S14 Gap of 1 nucleotide(s) found at refpos 19383 (FRAMESHIFT). Gap of 10 nucleotide(s) found at refpos 19496 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19509 (FRAMESHIFT). Gap of 12 nucleotides when compared to the reference sequence. NSP4 is missing. %UniqueMutations 0.43%.

HCoV-19/Canada/Toronto/2020/S162 Gap of 2 nucleotide(s) found at refpos 2633 (FRAMESHIFT). Gap of 22 nucleotide(s) found at refpos 2676 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2704 (FRAMESHIFT). Gap of 4 nucleotide(s) found at refpos 2709 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2730 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2734 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2740 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2745 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2749 (FRAMESHIFT). Gap of 5 nucleotide(s) found at refpos 8656 (FRAMESHIFT). Gap of 2 nucleotide(s) found at refpos 8668 (FRAMESHIFT). Gap of 2 nucleotide(s) found at refpos 8673 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8677 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8681 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8686 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8746 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8755 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8757 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8814 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8820 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8825 (FRAMESHIFT). Gap of 2 nucleotide(s) found at refpos 8829 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8833 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8840 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 8853 (FRAMESHIFT). Gap of 2 nucleotide(s) found at refpos 8860 (FRAMESHIFT). Gap of 59 nucleotides when compared to the reference sequence. NS7a is missing. NSP4 is missing. %UniqueMutations 0.10%.

HCoV-19/Canada/Toronto/2020/S20 Gap of 1 nucleotide(s) found at refpos 9272 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 9373 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 9377 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 9381 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 9387 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 9401 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 9409 (FRAMESHIFT). Gap of 2 nucleotide(s) found at refpos 9415 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 9461 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 9473 (FRAMESHIFT). Gap of 2 nucleotide(s) found at refpos 9478 (FRAMESHIFT). Gap of 13 nucleotides when compared to the reference sequence. NSP16 is missing. %UniqueMutations 0.40%.

HCoV-19/Canada/Toronto/2020/S201 Gap of 1 nucleotide(s) found at refpos 19515 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 27571 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 27583 (FRAMESHIFT). Gap of 5 nucleotide(s) found at refpos 27597 (FRAMESHIFT). Gap of 8 nucleotides when compared to the reference sequence. NS7a is missing. NSP2 is missing. NSP14 is missing. NS7a is missing. NSP2 is missing. NSP14 is missing.

HCoV-19/Canada/Toronto/2020/S223 Gap of 10 nucleotide(s) found at refpos 2264 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2282 (FRAMESHIFT). Gap of 11 nucleotides when compared to the reference sequence. NSP2 is missing. NSP2 is missing.

HCoV-19/Canada/Toronto/2020/S3 Gap of 1 nucleotide(s) found at refpos 19361 (FRAMESHIFT). Gap of 1 nucleotides when compared to the reference sequence. NSP14 is missing. NSP14 is missing.

HCoV-19/Canada/Toronto/2020/S309 Gap of 10 nucleotide(s) found at refpos 19383 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19443 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19471 (FRAMESHIFT). Gap of 13 nucleotide(s) found at refpos 19538 (FRAMESHIFT). Gap of 25 nucleotides when compared to the reference sequence. %UniqueMutations 0.31%.

HCoV-19/Canada/Toronto/2020/S330 Gap of 2 nucleotide(s) found at refpos 19478 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19502 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19504 (FRAMESHIFT). Gap of 10 nucleotides when compared to the reference sequence. NSP14 is missing. NSP14 is missing.

HCoV-19/Canada/Toronto/2020/S357 Gap of 5 nucleotide(s) found at refpos 687 (FRAMESHIFT). Gap of 5 nucleotides when compared to the reference sequence. NSP1 is missing. NSP1 is missing.

HCoV-19/Canada/Toronto/2020/S4 Gap of 1 nucleotide(s) found at refpos 2294 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2302 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2314 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2322 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 2328 (FRAMESHIFT). Gap of 5 nucleotides when compared to the reference sequence. NS7a is missing. NSP2 is missing. NS7a is missing. NSP2 is missing.

HCoV-19/Canada/Toronto/2020/S46 Gap of 1 nucleotide(s) found at refpos 19319 (FRAMESHIFT). Gap of 1 nucleotides when compared to the reference sequence. NSP14 is missing. NSP14 is missing.

HCoV-19/Canada/Toronto/2020/S48 Gap of 8 nucleotide(s) found at refpos 687 (FRAMESHIFT). Gap of 8 nucleotides when compared to the reference sequence. NSP1 is missing. %UniqueMutations 0.54%.

HCoV-19/Canada/Toronto/2020/S6 Gap of 1 nucleotide(s) found at refpos 19300 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19317 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19528 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19536 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19541 (FRAMESHIFT). Gap of 5 nucleotides when compared to the reference sequence. NSP14 is missing. NSP14 is missing.

HCoV-19/Canada/Toronto/2020/S65 Gap of 7 nucleotide(s) found at refpos 686 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19327 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19334 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19336 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19366 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19376 (FRAMESHIFT). Gap of 1 nucleotide(s) found at refpos 19379 (FRAMESHIFT). Gap of 13 nucleotides when compared to the reference sequence. NSP1 is missing. %UniqueMutations 0.28%.

agmcarthur commented 4 years ago

Showing 38 of 40 genomes submitted to GISIAD (2 lack Ct values). All had >90% genome fraction, >90% positions with at least 100x coverage. QUAST gives mismatch, Ns, indels statistics and I should have filtered based on those. More importantly, average coverage was more than 2000x for all but one genome, so why is our pipeline creating indels?

indels mismatch Ns

agmcarthur commented 4 years ago

ivar_variants.tsv files

fail.tar.gz pass.tar.gz

agmcarthur commented 4 years ago

Possible reasons for this result:

  1. Molecular biology - these are real coverage gaps via ARTIC amplification or Illumina sequencing.

  2. Ambiguity error - there is sequencing evidence for a nucleotide at these positions, but a base call could not be made, so these should have been labeled as N, not as a gap.

  3. Threshold error - a setting in our iVar pipeline is too stringent, we should have been able to call a base at these positions.

These data as well as our latest 300+ isolate run (with lower MiSeq coverage) should be good data sets to evaluation these options. I'm hoping its the last one.

jaleezyy commented 4 years ago

Update (examining S330):

Report of indels can be seen in QUAST report of the original run (dated May 8, 2020, using an older version of the pipeline). image

However re-running it with updated pipeline (commit 550ad9b) shows no indels. Possibly due to alterations in pre-processing leading to improved consensus generation: image

fmaguire commented 4 years ago

The differences between the pipeline ~8th of May and 550ad9b in order of my guess at the most likely culprit for this indel issue:

  1. the use of hisat2 for read mapping instead of bwa-mem
  2. using cutadapt to remove amplicon primer sequences instead of ivar trim with a .bed file
  3. trimmomatic to trim reads and remove adapters instead of trim_galore
jaleezyy commented 4 years ago

I suspected as much. The remaining samples above are going to be re-run, so we'll see if this finding is consistent (ultimately re-run all of them to refresh the data).

raphenya commented 4 years ago

I re-ran all samples using commit 550ad9b, ivar version 1.2.2 and 3 samples had indels as follows:

Sample S4
Sample S357 and S65
agmcarthur commented 4 years ago

My request:

To the report text file add:

Please also add all of these to the report html and summary html.

We should probably add a QC warning for N's per 100 kbp, but I'm not sure what value to use. Does GISAID give any guidance?

agmcarthur commented 4 years ago

Oops, for item 2 I meant "Frameshifts in SARS-CoV-2 open reading frames"