Closed agmcarthur closed 4 years ago
Showing 38 of 40 genomes submitted to GISIAD (2 lack Ct values). All had >90% genome fraction, >90% positions with at least 100x coverage. QUAST gives mismatch, Ns, indels statistics and I should have filtered based on those. More importantly, average coverage was more than 2000x for all but one genome, so why is our pipeline creating indels?
ivar_variants.tsv files
Possible reasons for this result:
Molecular biology - these are real coverage gaps via ARTIC amplification or Illumina sequencing.
Ambiguity error - there is sequencing evidence for a nucleotide at these positions, but a base call could not be made, so these should have been labeled as N, not as a gap.
Threshold error - a setting in our iVar pipeline is too stringent, we should have been able to call a base at these positions.
These data as well as our latest 300+ isolate run (with lower MiSeq coverage) should be good data sets to evaluation these options. I'm hoping its the last one.
Update (examining S330):
Report of indels can be seen in QUAST report of the original run (dated May 8, 2020, using an older version of the pipeline).
However re-running it with updated pipeline (commit 550ad9b
) shows no indels. Possibly due to alterations in pre-processing leading to improved consensus generation:
The differences between the pipeline ~8th of May and 550ad9b
in order of my guess at the most likely culprit for this indel issue:
hisat2
for read mapping instead of bwa-mem
cutadapt
to remove amplicon primer sequences instead of ivar trim
with a .bed
filetrimmomatic
to trim reads and remove adapters instead of trim_galore
I suspected as much. The remaining samples above are going to be re-run, so we'll see if this finding is consistent (ultimately re-run all of them to refresh the data).
I re-ran all samples using commit 550ad9b
, ivar version 1.2.2 and 3 samples had indels as follows:
My request:
To the report text file add:
Please also add all of these to the report html and summary html.
We should probably add a QC warning for N's per 100 kbp, but I'm not sure what value to use. Does GISAID give any guidance?
Oops, for item 2 I meant "Frameshifts in SARS-CoV-2 open reading frames"
We submitted 40 sequences that passed SIGNAL metrics, but 15 were rejected due to frameshifts. We should add this level of reporting/QC to ensure sequences are ready for GISAID.