Open ndreey opened 6 months ago
minimap2
is a fast long read aligner.
We first run this script to index (minimize) the Tconura_ref-filtered.fasta
to generate a Tconura_ref-filtered.mmi
file. By indexing first we will save computation time when we align. Importantly that we use the -x map-hifi
flag to specify the alignment presets for pac-bio hifi reads.
#!/bin/bash
#SBATCH --job-name minimap2-index
#SBATCH -A naiss2023-22-412
#SBATCH -p core -n 4
#SBATCH -t 01:35:00
#SBATCH --output=slurm-logs/decontamination/SLURM-%j-minimap2-index.out
#SBATCH --error=slurm-logs/decontamination/SLURM-%j-minimap2-index.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Start time and date
echo "$(date) [Start]"
# Load in modules
module load bioinfo-tools
module load minimap2/2.26-r1175
# Move to reference directory
cd data/Tconura_reference_genome/
# Index genome
minimap2 -t 4 -x map-hifi -d Tconura_ref-filtered.mmi Tconura_ref-filtered.fasta
# End time and date
echo "$(date) [End]"
This script will start three jobs in parallel aligning the long reads to the filtered reference genome and generate:
#!/bin/bash
#SBATCH --job-name minimap2-decontamination
#SBATCH -A naiss2023-22-412
#SBATCH --array=1-3
#SBATCH -p node -n 1
#SBATCH -t 06:00:00
#SBATCH --output=slurm-logs/decontamination/SLURM-%j-minimap2-decon-hifi.out
#SBATCH --error=slurm-logs/decontamination/SLURM-%j-minimap2-decon-hifi.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Start time and date
echo "$(date) [Start]"
# Load in modules
module load bioinfo-tools
module load samtools/1.19
module load minimap2/2.26-r1175
module load BEDTools/2.31.1
# Path to the hifi pacbio raw data
RAW="/crex/proj/snic2020-6-222/Projects/Tconura/data/reference/hifiasm_Assemb2020_pt_042/pt_042/ccsreads/pt_042_001"
# Path to trimmed reads and reference database of Tconura.
REF="data/Tconura_reference_genome/Tconura_ref-filtered"
# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}
# Read the fastq files this task will work with.
READ=$(sed -n "${JOBID}p" doc/pt_042_hifi-pacbio.txt)
# Get the sample id
SAMPLE="${READ%%.*}"
# Info text
echo "$(date) Processing: $SAMPLE"
# Create directory for sample specific bam files
BAM_DIR="03-HOST-ALIGNMENT-BAM/hifi-pacbio/${SAMPLE}"
if [ ! -d "${BAM_DIR}" ]; then
mkdir -p "${BAM_DIR}"
fi
# Run minimap2 alignment.
minimap2 -a -x map-hifi -t 16 Tconura_ref-filtered.mmi $READ | \
samtools sort - -@ 16 -o ${BAM_DIR}/${SAMPLE}.bam
echo "$(date) minimap2 alignment complete"
# Generate a bam with all unmapped reads.
samtools view -@ 16 -b -f 4 ${BAM_DIR}/${SAMPLE}.bam \
> ${BAM_DIR}/${SAMPLE}-unmapped-reads.bam
echo "$(date) samtools filtering complete"
# Create directory for decontaminated clean fastq files
CLEAN_DIR="04-CLEAN-FASTQ/hifi-pacbio"
if [ ! -d "$CLEAN_DIR" ]; then
mkdir -p "$CLEAN_DIR"
fi
# Convert the bam file to fastq
bedtools bamtofastq \
-i ${BAM_DIR}/${SAMPLE}-unmapped-reads.bam\
-fq ${CLEAN_DIR}/${SAMPLE}-clean.fastq
echo "$(date) bamtofastq complete"
# Compress the fastq files
gzip ${CLEAN_DIR}/${SAMPLE}-clean.fastq
echo "$(date) clean-fastq compressed"
# End time and date
echo "$(date) [End]"
BWA Short Reads
Bowtie2 is fast and has similar results to BWA but is generally regarded to have higher accuracy.
BWA modes
From this guide Depending on read length, BWA has different modes optimized for different sequence lengths:
Hence, we will be using
BWA-MEM
Filtering the reference genome
First we will remove the contigs with less than 50Kb using
bbmap
's scriptreformat.sh
. This is the stats for the unfiltered masked referenceTconura_ref.fasta
Now after calling these commands:
We now only have scaffolds with a min length of 50Kb! It took less than 5 seconds with 8 cores.
Indexing the reference genome
The job took 46min and used 10GB of ram.
bwa-index-genome.sh
Aligning and decontaminating.
We will continue using
bwa-mem
for aligning and then usesamtools
to filter out the unmapped pair-end reads and finally useBEDTools
functionbamtofastq
to generate the fastq files.The script below will start 304 jobs running maximum of 40 in parallel by utilizing
trimmed_fastq.txt
which is a simple two column with the R1 and R2 trimmed files. It will align and generate three.bam
and two.fastq.gz
files.