ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

Burrows-Wheeler-Aligner (BWA): Host Decontamination (MERGER ISSUE RESOLVED) #29

Open ndreey opened 6 months ago

ndreey commented 6 months ago

BWA Short Reads

Bowtie2 is fast and has similar results to BWA but is generally regarded to have higher accuracy.

BWA modes

From this guide Depending on read length, BWA has different modes optimized for different sequence lengths:

Hence, we will be using BWA-MEM

Filtering the reference genome

First we will remove the contigs with less than 50Kb using bbmap's script reformat.sh. This is the stats for the unfiltered masked reference Tconura_ref.fasta

stats.sh data/Tconura_reference_genome/Tconura_ref.fasta

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.3198  0.1802  0.1803  0.3196  0.0134  0.0000  0.0000  0.3606  0.0295

Main genome scaffold total:             2563
Main genome contig total:               640009
Main genome scaffold sequence total:    1946.195 MB
Main genome contig sequence total:      1920.204 MB     1.335% gap
Main genome scaffold N/L50:             319/1.742 MB
Main genome contig N/L50:               89532/6.455 KB
Main genome scaffold N/L90:             1189/398.914 KB
Main genome contig N/L90:               302349/1.761 KB
Max scaffold length:                    8.968 MB
Max contig length:                      221.479 KB
Number of scaffolds > 50 KB:            2231
% main genome in scaffolds > 50 KB:     99.41%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                  2,563         640,009   1,946,194,716   1,920,204,323    98.66%
   5 KB                  2,563         640,009   1,946,194,716   1,920,204,323    98.66%
  10 KB                  2,562         640,006   1,946,187,472   1,920,197,421    98.66%
  25 KB                  2,515         639,525   1,945,239,894   1,919,272,300    98.67%
  50 KB                  2,231         628,381   1,934,714,323   1,909,285,595    98.69%
 100 KB                  1,920         612,641   1,912,895,667   1,888,172,579    98.71%
 250 KB                  1,452         572,562   1,835,768,641   1,812,795,694    98.75%
 500 KB                  1,051         514,160   1,690,623,403   1,670,101,324    98.79%
   1 MB                    624         411,623   1,378,771,933   1,362,408,122    98.81%
 2.5 MB                    179         200,851     683,668,414     675,737,218    98.84%
   5 MB                     31          57,855     193,639,010     191,346,369    98.82%

Now after calling these commands:

# Load the module
module load bbmap/39.06

# Path to folder
ref_dir="data/Tconura_reference_genome"

# Filter on contig length.
reformat.sh in=${ref_dir}/Tconura_ref.fasta out=${ref_dir}/Tconura_ref-filtered.fasta fastaminlen=50000
stats.sh data/Tconura_reference_genome/Tconura_ref-filtered.fasta
A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.3198  0.1803  0.1804  0.3195  0.0131  0.0000  0.0000  0.3607  0.0240

Main genome scaffold total:             2231
Main genome contig total:               628381
Main genome scaffold sequence total:    1934.714 MB
Main genome contig sequence total:      1909.286 MB     1.314% gap
Main genome scaffold N/L50:             316/1.747 MB
Main genome contig N/L50:               88847/6.474 KB
Main genome scaffold N/L90:             1164/409.843 KB
Main genome contig N/L90:               299451/1.775 KB
Max scaffold length:                    8.968 MB
Max contig length:                      221.479 KB
Number of scaffolds > 50 KB:            2231
% main genome in scaffolds > 50 KB:     100.00%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                  2,231         628,381   1,934,714,323   1,909,285,595    98.69%
  50 KB                  2,231         628,381   1,934,714,323   1,909,285,595    98.69%
 100 KB                  1,920         612,641   1,912,895,667   1,888,172,579    98.71%
 250 KB                  1,452         572,562   1,835,768,641   1,812,795,694    98.75%
 500 KB                  1,051         514,160   1,690,623,403   1,670,101,324    98.79%
   1 MB                    624         411,623   1,378,771,933   1,362,408,122    98.81%
 2.5 MB                    179         200,851     683,668,414     675,737,218    98.84%
   5 MB                     31          57,855     193,639,010     191,346,369    98.82%

We now only have scaffolds with a min length of 50Kb! It took less than 5 seconds with 8 cores.

Indexing the reference genome

The job took 46min and used 10GB of ram.

bwa-index-genome.sh

#!/bin/bash

#SBATCH --job-name bwa-index
#SBATCH -A naiss2023-22-412
#SBATCH -p core -n 8
#SBATCH -t 08:35:00
#SBATCH --output=SLURM-%j-bwa-index.out
#SBATCH --error=SLURM-%j-bwa-index.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load bwa/0.7.17

cd data/Tconura_reference_genome/

# Index genome
bwa index -p Tconura_ref-filtered Tconura_ref-filtered.fasta

# End time and date
echo "$(date)       [End]"

Aligning and decontaminating.

We will continue using bwa-mem for aligning and then use samtools to filter out the unmapped pair-end reads and finally use BEDTools function bamtofastq to generate the fastq files.

The script below will start 304 jobs running maximum of 40 in parallel by utilizing trimmed_fastq.txt which is a simple two column with the R1 and R2 trimmed files. It will align and generate three .bam and two .fastq.gz files.

#!/bin/bash

#SBATCH --job-name bwa-decontamination
#SBATCH -A naiss2023-22-412
#SBATCH --array=1-304%40
#SBATCH -p core -n 8
#SBATCH -t 04:30:00
#SBATCH --output=slurm-logs/decontamination/SLURM-%j-bwa-decon-%a.out
#SBATCH --error=slurm-logs/decontamination/SLURM-%j-bwa-decon-%a.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load bwa/0.7.17
module load samtools/1.19
module load BEDTools/2.31.1

# Path to trimmed reads and reference database of Tconura.
TRIM_DIR="02-TRIM"
REF="data/Tconura_reference_genome/Tconura_ref-filtered"

# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}

# Read the fastq files this task will work with.
FASTQ_FILES=$(sed -n "${JOBID}p" doc/trimmed_fastq.txt)

# Get the sample id, R1, R2 and lane id.
SAMPLE="$(echo "$FASTQ_FILES" | cut -d "_" -f 1,2)"
R1="$(echo "$FASTQ_FILES" | cut -f 1)"
R2="$(echo "$FASTQ_FILES" | cut -f 2)"
LANE_ID="${R1%%_R1*}"

# Get the population the sample belongs to
while IFS="," read SAMPLE_COMP POP HP REGION RANGE; do
    if [ "$SAMPLE" == "$SAMPLE_COMP" ]; then
        OUT_POP="$POP"
    fi
done < "doc/metadata-no-hybrids.csv"

# Info text
echo "$(date)   Processing: $LANE_ID"
echo "$(date)   Population: $OUT_POP"

# Create directory for sample specific bam files
BAM_DIR="03-HOST-ALIGNMENT-BAM/${OUT_POP}/${LANE_ID}"
if [ ! -d "${BAM_DIR}" ]; then
    mkdir -p "${BAM_DIR}"
fi

# Run bwa-mem algorithm
bwa mem -t 8 $REF ${TRIM_DIR}/${R1} ${TRIM_DIR}/${R2} > ${BAM_DIR}/${LANE_ID}.sam
echo "$(date)   BWA Alignment finished"

# Convert .sam to .bam
samtools view -b ${BAM_DIR}/${LANE_ID}.sam > ${BAM_DIR}/${LANE_ID}.bam

# Generate a bam with all unmapped reads.
samtools view -@ 8 -b -f 4 ${BAM_DIR}/${LANE_ID}.bam \
    > ${BAM_DIR}/${LANE_ID}-unmapped-reads.bam

# Subset the pairs that did not align to reference
samtools view -@ 8 -b -f 12 ${BAM_DIR}/${LANE_ID}.bam \
    > ${BAM_DIR}/${LANE_ID}-unmapped-pairs.bam

# Sort by name so BEDtools can convert to fastq.
samtools sort -n -@ 8 \
    -o ${BAM_DIR}/${LANE_ID}-sorted-unmapped-pairs.bam \
    ${BAM_DIR}/${LANE_ID}-unmapped-pairs.bam
echo "$(date)   SAMtools complete"

# Create directory for decontaminated reads in population
CLEAN_DIR="04-CLEAN-FASTQ/${OUT_POP}"
if [ ! -d "$CLEAN_DIR" ]; then
    mkdir -p "$CLEAN_DIR"
fi

# Generate R1 and R2 clean fastq names.
R1_OUT=${R1%%_001*}
R2_OUT=${R2%%_001*}

# Convert the bam file to R1 and R2 fastq files
bedtools bamtofastq \
    -i ${BAM_DIR}/${LANE_ID}-sorted-unmapped-pairs.bam \
    -fq ${CLEAN_DIR}/${R1_OUT}-clean.fastq \
    -fq2 ${CLEAN_DIR}/${R2_OUT}-clean.fastq
echo "$(date)   bamtofastq complete"

# Compress the fastq files
gzip ${CLEAN_DIR}/${R1_OUT}-clean.fastq
gzip ${CLEAN_DIR}/${R2_OUT}-clean.fastq
echo "$(date)   clean-fastq compressed"

# End time and date
echo "$(date)       [End]"
ndreey commented 6 months ago

minimap2 Long Reads

minimap2 is a fast long read aligner.

minimap2 index

We first run this script to index (minimize) the Tconura_ref-filtered.fasta to generate a Tconura_ref-filtered.mmi file. By indexing first we will save computation time when we align. Importantly that we use the -x map-hifi flag to specify the alignment presets for pac-bio hifi reads.

#!/bin/bash

#SBATCH --job-name minimap2-index
#SBATCH -A naiss2023-22-412
#SBATCH -p core -n 4 
#SBATCH -t 01:35:00
#SBATCH --output=slurm-logs/decontamination/SLURM-%j-minimap2-index.out
#SBATCH --error=slurm-logs/decontamination/SLURM-%j-minimap2-index.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load minimap2/2.26-r1175

# Move to reference directory
cd data/Tconura_reference_genome/

# Index genome
minimap2 -t 4 -x map-hifi -d Tconura_ref-filtered.mmi Tconura_ref-filtered.fasta

# End time and date
echo "$(date)       [End]"

Aligning the pacbio-hifi reads

This script will start three jobs in parallel aligning the long reads to the filtered reference genome and generate:

#!/bin/bash

#SBATCH --job-name minimap2-decontamination
#SBATCH -A naiss2023-22-412
#SBATCH --array=1-3
#SBATCH -p node -n 1
#SBATCH -t 06:00:00
#SBATCH --output=slurm-logs/decontamination/SLURM-%j-minimap2-decon-hifi.out
#SBATCH --error=slurm-logs/decontamination/SLURM-%j-minimap2-decon-hifi.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load samtools/1.19
module load minimap2/2.26-r1175
module load BEDTools/2.31.1

# Path to the hifi pacbio raw data
RAW="/crex/proj/snic2020-6-222/Projects/Tconura/data/reference/hifiasm_Assemb2020_pt_042/pt_042/ccsreads/pt_042_001"

# Path to trimmed reads and reference database of Tconura.
REF="data/Tconura_reference_genome/Tconura_ref-filtered"

# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}

# Read the fastq files this task will work with.
READ=$(sed -n "${JOBID}p" doc/pt_042_hifi-pacbio.txt)

# Get the sample id
SAMPLE="${READ%%.*}"

# Info text
echo "$(date)   Processing: $SAMPLE"

# Create directory for sample specific bam files
BAM_DIR="03-HOST-ALIGNMENT-BAM/hifi-pacbio/${SAMPLE}"
if [ ! -d "${BAM_DIR}" ]; then
    mkdir -p "${BAM_DIR}"
fi

# Run minimap2 alignment.
minimap2 -a -x map-hifi -t 16 Tconura_ref-filtered.mmi $READ | \
    samtools sort - -@ 16 -o ${BAM_DIR}/${SAMPLE}.bam
echo "$(date)   minimap2 alignment complete"

# Generate a bam with all unmapped reads.
samtools view -@ 16 -b -f 4 ${BAM_DIR}/${SAMPLE}.bam \
    > ${BAM_DIR}/${SAMPLE}-unmapped-reads.bam
echo "$(date)   samtools filtering complete"

# Create directory for decontaminated clean fastq files
CLEAN_DIR="04-CLEAN-FASTQ/hifi-pacbio"
if [ ! -d "$CLEAN_DIR" ]; then
    mkdir -p "$CLEAN_DIR"
fi

# Convert the bam file to fastq
bedtools bamtofastq \
    -i  ${BAM_DIR}/${SAMPLE}-unmapped-reads.bam\
    -fq ${CLEAN_DIR}/${SAMPLE}-clean.fastq 
echo "$(date)   bamtofastq complete"

# Compress the fastq files
gzip ${CLEAN_DIR}/${SAMPLE}-clean.fastq
echo "$(date)   clean-fastq compressed"

# End time and date
echo "$(date)       [End]"