niemasd / PRJCA008874-Analysis

Analysis of SARS-CoV-2 FASTQs from PRJCA008874
GNU General Public License v3.0
4 stars 0 forks source link

This repository contains a (re-)analysis of SARS-CoV-2 FASTQs from PRJCA008874 (Lu et al., The Lancet 2020) using the SARS-CoV-2 analysis pipeline developed for UC San Diego's Return to Learn program (C-VIEW).

0: Input Data

The FASTQ files from PRJCA008874 can be found here:

Here's a mapping of BioSample to various relevant IDs for convenience:

Table: BioSamples

BioSample Description Experiment Accession Run Accession(s)
SAMC703641 WH19001 and WH19005 CRX398523 CRR456568
SAMC703642 WH19002 CRX398524 CRR456569
SAMC703643 WH19004 CRX398525 CRR456570
SAMC703644 WH19008 CRX398526 CRR456571
SAMC703645 YS8011 CRX398527 CRR456572, CRR456573
SAMC703646 WH01 CRX398528 CRR456574, CRR456575, CRR456576, CRR456577, CRR456578, CRR456579, CRR456580, CRR456581, CRR456582, CRR456583, CRR456584, CRR456585, CRR456586, CRR456587
SAMC703647 WH02 CRX398529 CRR456588, CRR456589, CRR456590, CRR456591, CRR456592, CRR456593, CRR456594, CRR456595, CRR456596, CRR456597
SAMC703648 WH03 CRX398530 CRR456598, CRR456599, CRR456600, CRR456601
SAMC703649 WH04 CRX398531 CRR456602, CRR456603, CRR456604, CRR456605

1: Mapping Reads + Sorting BAM

I'm using Minimap2 v2.24 to map reads against the NC_045512.2 reference genome, piped to samtools v1.14 to sort and output as BAM.

1.1: Index Reference Genome

minimap2 -t 2 -d NC_045512.2.fas.mmi NC_045512.2.fas > NC_045512.2.fas.mmi.log 2>&1

1.2: Map Reads

Some of the FASTQ files are quite huge, so rather than downloading them locally, I will map reads on-the-fly while downloading. For convenience, I have created a CSV file containing the FASTQ FTP paths (fastq_list.csv). The individual Minimap2 command is as follows:

minimap2 -t THREADS -a -x sr REF.MMI READ1.FASTQ.GZ READ2.FASTQ.GZ | samtools sort --threads THREADS -o SORTED.BAM

However, I'll use named pipes for the FASTQ files to download + map on-the-fly, and I'll throw out unmapped reads (via Minimap2's --sam-hit-only flag):

minimap2 --sam-hit-only -t THREADS -a -x sr REF.MMI <(wget -qO- READ1.FASTQ.GZ.URL) <(wget -qO- READ2.FASTQ.GZ.URL) | samtools sort --threads THREADS -o SORTED.BAM

Putting everything together, here's the looped command for doing everything:

for f in $(cat fastq/fastq_list.csv) ; do crr=$(echo $f | cut -d'/' -f6) && url1=$(echo $f | cut -d',' -f1) && url2=$(echo $f | cut -d',' -f2) && minimap2 --sam-hit-only -t 1 -a -x sr ref/NC_045512.2.fas.mmi <(wget -qO- "$url1") <(wget -qO- "$url2") 2> "bam/$crr.01.sorted.untrimmed.bam.log" | samtools sort --threads 1 -o "bam/$crr.01.sorted.untrimmed.bam" ; done

The resulting BAM files can be found in the data/bam folder.

1.3: Merging BAMs

As can be seen in Table: BioSamples, some BioSamples have multiple FASTQ pairs (and thus BAMs). However, for any downstream analyses, I wanted to merge all of the BAMs corresponding to a single BioSample into a single BAM file. The samtools merge command merges multiple sorted BAMs into a single sorted output BAM.

samtools merge -o OUTPUT.BAM INPUT1.BAM INPUT2.BAM ...

The resulting merged BAM files can be found in the data/merged_bam folder.

Note: A preprint was released in June 2022 in which the authors called consensus sequences (and performed downstream analyses on the consensus sequences, e.g. phylogenetic inference, tree dating) using the untrimmed BAMs from this repo. This is a technical Bioinformatics mistake: reads need to be trimmed in order to remove low-quality bases (which we will do in Step 2) prior to any downstream analyses. Trimming is an important Quality Control (QC) step in any sequence data analysis, something I explained to the authors. If you want to learn more about read trimming and why QC is so important in sequence data analysis, this Galaxy tutorial is an excellent resource. This Data Carpentry tutorial about assessing read quality as well as this subsequent tutorial about trimming and filtering are also excellent resources.

2: Trimming BAMs + Sorting Trimmed BAMs

I'm using iVar v1.3.1 to quality trim the mapped reads, and I'm using samtools v1.14 to sort the trimmed BAMs.

ivar trim -e -i SORTED_UNTRIMMED.BAM -p UNSORTED_TRIMMED_PREFIX
samtools sort --threads 1 -o SORTED_TRIMMED.BAM UNSORTED_TRIMMED.BAM

The resulting trimmed BAM files can be found in the data/trimmed_bam folder.

3: Calculating Coverage + Stats

I'm using SamBamViz v0.0.4 to compute various statistics and produce various visualizations from the trimmed BAMs. I'm using a minimum base quality score of 20 (-q 20) to match standard consensus calling approaches such as iVar Consensus.

SamBamViz.py -i TRIMMED_SORTED.BAM -o OUT_DIR -q 20 --ylog

The resulting SamBamViz output files can be found in the data/sambamviz folder.

4: Creating Pile-Ups

I'm using samtools v1.14 to create pile-up files from the trimmed BAMs. Because iVar Variants and Consensus read the pile-up from standard input anyways, I gzip the pile-up files to save space.

samtools mpileup -B -A -aa -d 0 -Q 0 --reference REF_GENOME.FASTA TRIMMED_SORTED.BAM | gzip -9 > PILEUP.TXT.GZ

The resulting pile-up files can be found in the data/pileup folder.

5: Calling Variants

I'm using iVar v1.3.1 to call variants.

zcat PILEUP.TXT.GZ | ivar variants -m 10 -r REF_GENOME.FASTA -g REF_GENOME.GFF -p VARIANTS.TSV

The resulting variant TSV files can be found in the data/variants folder.

6: Calling Consensus Sequences

I'm using iVar v1.3.1 to call consensus sequences.

zcat PILEUP.TXT.GZ | ivar consensus -m 10 -n N -t 0.5 -p OUT_PREFIX

The resulting consensus sequences can be found in the data/consensus folder.

Miscellaneous Comments

Low Coverage After Trimming

As mentioned at the beginning of this document, this analysis was conducted using the C-VIEW pipeline developed for UC San Diego's Return to Learn program, which uses iVar Trim to perform trimming. If you suspect that iVar Trim might be trimming the reads too aggressively, rather than skipping quality trimming entirely (which is objectively a bad idea), I would recommend trying a different read trimmer.

One popular option is fastp. Note that fastp (and many other read trimmers) trim FASTQ files before read mapping (whereas iVar Trim trims BAM files after read mapping). As such, if you want to try out a read trimmer that trims FASTQ files (rather than BAM files), you will need to convert the merged untrimmed BAM files back into FASTQ files, e.g. using bamtofastq. The FASTQ files you will get from using bamtofastq on the merged untrimmed BAM files will be much smaller than the original FASTQ files because the converted FASTQ files will only contain mapped reads (because the BAM files only contain mapped reads).

After you have converted the untrimmed BAM files into FASTQ files and then trimmed the FASTQ files using fastp (or whichever read trimmer you choose), you will need to remap the trimmed FASTQ files to the reference genome (as described in Step 1; this should be quite fast), which will result in trimmed BAM files. You will then be able to continue the rest of the analysis starting at Step 3.

Supplement

I have added some extra supplementary exploratory files for samples of interest in the supplement folder.