This repository contains a (re-)analysis of SARS-CoV-2 FASTQs from PRJCA008874 (Lu et al., The Lancet 2020) using the SARS-CoV-2 analysis pipeline developed for UC San Diego's Return to Learn program (C-VIEW).

0: Input Data

The FASTQ files from PRJCA008874 can be found here:

HTTP: https://download.cncb.ac.cn/gsa/CRA006587
FTP: ftp://download.big.ac.cn/gsa/CRA006587
Metadata: CRA006587.xlsx (Markdown)

Here's a mapping of BioSample to various relevant IDs for convenience:

Table: BioSamples

BioSample	Description	Experiment Accession	Run Accession(s)
SAMC703641	WH19001 and WH19005	CRX398523	CRR456568
SAMC703642	WH19002	CRX398524	CRR456569
SAMC703643	WH19004	CRX398525	CRR456570
SAMC703644	WH19008	CRX398526	CRR456571
SAMC703645	YS8011	CRX398527	CRR456572, CRR456573
SAMC703646	WH01	CRX398528	CRR456574, CRR456575, CRR456576, CRR456577, CRR456578, CRR456579, CRR456580, CRR456581, CRR456582, CRR456583, CRR456584, CRR456585, CRR456586, CRR456587
SAMC703647	WH02	CRX398529	CRR456588, CRR456589, CRR456590, CRR456591, CRR456592, CRR456593, CRR456594, CRR456595, CRR456596, CRR456597
SAMC703648	WH03	CRX398530	CRR456598, CRR456599, CRR456600, CRR456601
SAMC703649	WH04	CRX398531	CRR456602, CRR456603, CRR456604, CRR456605

1: Mapping Reads + Sorting BAM

I'm using Minimap2 v2.24 to map reads against the NC_045512.2 reference genome, piped to samtools v1.14 to sort and output as BAM.

1.1: Index Reference Genome

minimap2 -t 2 -d NC_045512.2.fas.mmi NC_045512.2.fas > NC_045512.2.fas.mmi.log 2>&1

1.2: Map Reads

Some of the FASTQ files are quite huge, so rather than downloading them locally, I will map reads on-the-fly while downloading. For convenience, I have created a CSV file containing the FASTQ FTP paths (fastq_list.csv). The individual Minimap2 command is as follows:

minimap2 -t THREADS -a -x sr REF.MMI READ1.FASTQ.GZ READ2.FASTQ.GZ | samtools sort --threads THREADS -o SORTED.BAM

-a means "output in the SAM format"
-x sr means "use Minimap2's genomic short-read mapping preset"

However, I'll use named pipes for the FASTQ files to download + map on-the-fly, and I'll throw out unmapped reads (via Minimap2's --sam-hit-only flag):

minimap2 --sam-hit-only -t THREADS -a -x sr REF.MMI <(wget -qO- READ1.FASTQ.GZ.URL) <(wget -qO- READ2.FASTQ.GZ.URL) | samtools sort --threads THREADS -o SORTED.BAM

Putting everything together, here's the looped command for doing everything:

for f in $(cat fastq/fastq_list.csv) ; do crr=$(echo $f | cut -d'/' -f6) && url1=$(echo $f | cut -d',' -f1) && url2=$(echo $f | cut -d',' -f2) && minimap2 --sam-hit-only -t 1 -a -x sr ref/NC_045512.2.fas.mmi <(wget -qO- "$url1") <(wget -qO- "$url2") 2> "bam/$crr.01.sorted.untrimmed.bam.log" | samtools sort --threads 1 -o "bam/$crr.01.sorted.untrimmed.bam" ; done

The resulting BAM files can be found in the data/bam folder.

1.3: Merging BAMs

As can be seen in Table: BioSamples, some BioSamples have multiple FASTQ pairs (and thus BAMs). However, for any downstream analyses, I wanted to merge all of the BAMs corresponding to a single BioSample into a single BAM file. The samtools merge command merges multiple sorted BAMs into a single sorted output BAM.

samtools merge -o OUTPUT.BAM INPUT1.BAM INPUT2.BAM ...

The resulting merged BAM files can be found in the data/merged_bam folder.

Note: A preprint was released in June 2022 in which the authors called consensus sequences (and performed downstream analyses on the consensus sequences, e.g. phylogenetic inference, tree dating) using the untrimmed BAMs from this repo. This is a technical Bioinformatics mistake: reads need to be trimmed in order to remove low-quality bases (which we will do in Step 2) prior to any downstream analyses. Trimming is an important Quality Control (QC) step in any sequence data analysis, something I explained to the authors. If you want to learn more about read trimming and why QC is so important in sequence data analysis, this Galaxy tutorial is an excellent resource. This Data Carpentry tutorial about assessing read quality as well as this subsequent tutorial about trimming and filtering are also excellent resources.

2: Trimming BAMs + Sorting Trimmed BAMs

I'm using iVar v1.3.1 to quality trim the mapped reads, and I'm using samtools v1.14 to sort the trimmed BAMs.

ivar trim -e -i SORTED_UNTRIMMED.BAM -p UNSORTED_TRIMMED_PREFIX
samtools sort --threads 1 -o SORTED_TRIMMED.BAM UNSORTED_TRIMMED.BAM

-e means "include reads without primers" (we're not trimming primers, as there were no primers to trim)

The resulting trimmed BAM files can be found in the data/trimmed_bam folder.

3: Calculating Coverage + Stats

I'm using SamBamViz v0.0.4 to compute various statistics and produce various visualizations from the trimmed BAMs. I'm using a minimum base quality score of 20 (-q 20) to match standard consensus calling approaches such as iVar Consensus.

SamBamViz.py -i TRIMMED_SORTED.BAM -o OUT_DIR -q 20 --ylog

The resulting SamBamViz output files can be found in the data/sambamviz folder.

4: Creating Pile-Ups

I'm using samtools v1.14 to create pile-up files from the trimmed BAMs. Because iVar Variants and Consensus read the pile-up from standard input anyways, I gzip the pile-up files to save space.

samtools mpileup -B -A -aa -d 0 -Q 0 --reference REF_GENOME.FASTA TRIMMED_SORTED.BAM | gzip -9 > PILEUP.TXT.GZ

-B means "disable BAQ (per-base alignment quality)"
-A means "count orphans (do not discount anomalous read pairs)"
-aa means "output absolutely all positions, including unused reference sequences"
-d 0 means "don't cap the per-file depth"
-Q 0 means "skip bases with less than 0 base quality (i.e., don't skip any bases based on base quality)"

The resulting pile-up files can be found in the data/pileup folder.

5: Calling Variants

I'm using iVar v1.3.1 to call variants.

zcat PILEUP.TXT.GZ | ivar variants -m 10 -r REF_GENOME.FASTA -g REF_GENOME.GFF -p VARIANTS.TSV

-m 10 means "minimum read depth to call a variant = 10 reads"

The resulting variant TSV files can be found in the data/variants folder.

6: Calling Consensus Sequences

I'm using iVar v1.3.1 to call consensus sequences.

zcat PILEUP.TXT.GZ | ivar consensus -m 10 -n N -t 0.5 -p OUT_PREFIX

-m 10 means "minimum read depth to call a base in the consensus sequence = 10 reads"
-t 0.5 means "a single base must show up in at least 50% of the reads covering a given position to show up in the consensus"
-n N means "use the letter N to denote ambiguous positions (i.e., positions in which we were unable to call a base)"
- To clarify, a position in the consensus sequence can be N if (1) it had a coverage of less than 10, or (2) it had a coverage of greater than 10 but none of the 4 possible bases had a frequency greater than or equal to 0.5

The resulting consensus sequences can be found in the data/consensus folder.

Miscellaneous Comments

Low Coverage After Trimming

As mentioned at the beginning of this document, this analysis was conducted using the C-VIEW pipeline developed for UC San Diego's Return to Learn program, which uses iVar Trim to perform trimming. If you suspect that iVar Trim might be trimming the reads too aggressively, rather than skipping quality trimming entirely (which is objectively a bad idea), I would recommend trying a different read trimmer.

One popular option is fastp. Note that fastp (and many other read trimmers) trim FASTQ files before read mapping (whereas iVar Trim trims BAM files after read mapping). As such, if you want to try out a read trimmer that trims FASTQ files (rather than BAM files), you will need to convert the merged untrimmed BAM files back into FASTQ files, e.g. using bamtofastq. The FASTQ files you will get from using bamtofastq on the merged untrimmed BAM files will be much smaller than the original FASTQ files because the converted FASTQ files will only contain mapped reads (because the BAM files only contain mapped reads).

After you have converted the untrimmed BAM files into FASTQ files and then trimmed the FASTQ files using fastp (or whichever read trimmer you choose), you will need to remap the trimmed FASTQ files to the reference genome (as described in Step 1; this should be quite fast), which will result in trimmed BAM files. You will then be able to continue the rest of the analysis starting at Step 3.

Supplement

I have added some extra supplementary exploratory files for samples of interest in the supplement folder.

niemasd / PRJCA008874-Analysis

readme