Why is WH02 so bad? Not bad sequence with samtools consensus

niemasd / PRJCA008874-Analysis

Analysis of SARS-CoV-2 FASTQs from PRJCA008874

GNU General Public License v3.0

4 stars 0 forks source link

Why is WH02 so bad? Not bad sequence with samtools consensus #1

Closed babarlelephant closed 2 years ago

babarlelephant commented 2 years ago

Hi, this is what I obtained by downloading your merged bams and doing

samtools consensus -m bayesian --use-qual --call-fract 0.75 -a --min-depth 10 bamfile.bam -o seq.fasta --show-del yes --show-ins no

Any guess on why it gives such a different result for WH02? Your consensus sequence has 27512 N

The fasta I got: https://wtools.io/paste-code/bC6N

mutations wrt Wuhan-Hu-1

WH01
  N :60
  - :4
WH02
  A1381G G3135C A5202T G7500A T7826A A9121T T10061C C11006A T11503A T14878G G15907C G18506A G18523C C19374T C22109G A24339T G24932C C24937T
  N :949
  - :62
WH03
  N :21
  - :1
WH04
  C8782T T28144C
  N :17
  - :0
WH19001
  A12T G20670A G20679A A29638T T29655C G29705T C29743T
  N :1120
  - :0
WH19004
  N :96
  - :0
WH19008
  A12T G20670A G20679A A24325G
  N :702
  - :0
YS8011
  N :41
  - :0

babarlelephant commented 2 years ago

For the coverage plot that I got quite different to yours

niemasd commented 2 years ago

Pulling up the base counts for that sample and visualizing them in SamBamViz, that region is extremely low-coverage (ranging between 2 to 4 reads per site):

visualization

Note that the SamBamViz visualization uses 0-based indexing, not 1-based

niemasd commented 2 years ago

Regarding the differences in coverage plots:

(1) Are you using the trimmed BAM rather than the raw BAM? Your Pastebin says "niemasd merged BAM then samtools consensus", which makes me think that you're using the untrimmed merged BAMs rather than the trimmed BAMs. You should be using the trimmed BAMs.

(2) I'm using a minimum base quality score of 20 when I call SamBamViz to generate the plot, but you're not filtering by quality score as far as I can tell. In general, low-quality base calls need to be ignored in all steps of the analysis (I use a threshold of 20 as per the iVar recommendations)

babarlelephant commented 2 years ago

Yes I'm using the merged untrimmed bams, which gives a reasonable coverage and sequence, you are perfectly right that filtering for quality needs to be done, but your filtering step may be too stringent at least for WH02.

Thank you!

niemasd commented 2 years ago

but your filtering step may be too stringent at least for WH02

I'm not sure what you mean by this. A base quality score of 20 means 99% accuracy, and trimming/ignoring base calls with quality <20 is standard not only in viral genomics, but in most (all?) sequencing contexts:

https://galaxyproject.github.io/training-material/topics/sequence-analysis/tutorials/quality-control/tutorial.html

If the coverage drops below a reasonable level after using a base call quality threshold of 20, "This might be too strict" is not the correct interpretation: the correct interpretation is "This sample has too few high-quality base calls covering these positions" (which is precisely what N in the consensus sequence denotes).