ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

Quality Control: Generating FastQC and MultiQC reports #9

Open ndreey opened 2 months ago

ndreey commented 2 months ago

Quality Control

I generate raw-samples.txt which has the sample_name, R1 and R2 as columns.

# Get R1 and R2
ls 00-RAW/ | tr " " "\n" | paste - - > doc/tmp.r1_r2.txt

# Get sample names
cat doc/raw_samples.txt | cut -f 1 | cut -d "_" -f 1,2 > doc/tmp.sample_names.txt

# Paste them together
paste doc/tmp.sample_names.txt doc/tmp.r1_r2.txt > doc/raw_samples.txt

# Remove the tmp files
rm doc/tmp.r1_r2.txt doc/tmp.sample_names.txt 

I then prepare for quality control results by setting up 01-QC with subdirectories (using expansions)

# Generates results folders for fastqc and multiqc for forward and reverse reads + for each sequence project.
mkdir -p 01-QC/{fastqc,multiqc}_raw_{R1,R2}_{P12002,P14052}

>tree -d 01-QC/
01-QC/
├── fastqc_raw_R1_P12002
├── fastqc_raw_R1_P14052
├── fastqc_raw_R2_P12002
├── fastqc_raw_R2_P14052
├── multiqc_raw_R1_P12002
├── multiqc_raw_R1_P14052
├── multiqc_raw_R2_P12002
└── multiqc_raw_R2_P14052

Now, to firstly try out SLURM on UPPMAX we will test run with this simple script.

#!/bin/bash

#SBATCH --job-name test_FastQC
#SBATCH -A naiss2024-5-1
#SBATCH -p core -n 4
#SBATCH -t 01:00:00
#SBATCH --output=slurm-%j.out
#SBATCH --error=slurm-%j.err

# Start time and date
echo "$(date)       Start"

# Load in modules
module load bioinfo-tools
module load FastQC/0.11.9

# P12002 forward reads
fastqc 00-RAW/P12002_101_R1.fastq.gz 00-RAW/P12002_102_R1.fastq.gz 00-RAW/P12002_103_R1.fastq.gz \
    -t 4 \
    --outdir 01-QC/fastqc_raw_R1_P12002

# End time and date
echo "$(date)       End"
ndreey commented 2 months ago

Three reads of total 8.9GB took 8min to process with 4 cores (6.4 GB RAM/core). Giving us 1.11GB/min or rather 0.90min/GB. Lets see if it goes quicker with total of 8 cores.

andbou@rackham4: (qc) CONURA_WGS: du -h 00-RAW/P12002_{101,102,103}_R1.fastq.gz 
3.8G    00-RAW/P12002_101_R1.fastq.gz
2.7G    00-RAW/P12002_102_R1.fastq.gz
2.4G    00-RAW/P12002_103_R1.fastq.gz

$$\text{expected run time} = \frac{\text{tot GB}*0.90}{60}$$

to_run GB expected run time(h)
P12002_R1 344.1 5.2
P12002_R2 366.9 5.5
P14052_R1 74.2 1.1
P14052_R2 75.3 1.1
>du -h 00-RAW/P12002*R1* | cut -f1 | sed 's/G//g' | awk '{sum += $1} END {print sum}'
344.1
>du -h 00-RAW/P12002*R2* | cut -f1 | sed 's/G//g' | awk '{sum += $1} END {print sum}'
366.9
>du -h 00-RAW/P14052*R1* | cut -f1 | sed 's/G//g' | awk '{sum += $1} END {print sum}'
74.2
> du -h 00-RAW/P14052*R2* | cut -f1 | sed 's/G//g' | awk '{sum += $1} END {print sum}'
75.3

Lets create four separate scripts for each fastqc run and add two extra hours to each expected run time for good measure. As well to generate multiqc reports.

This is the script i sent to slurm where i altered the jobnames/slurm-logs and $proj and $R

#!/bin/bash

#SBATCH --job-name P12002_R1_QC_raw
#SBATCH -A naiss2024-5-1
#SBATCH -p core -n 8
#SBATCH -t 07:15:00
#SBATCH --output=SLURM-%j-P12002_R1_QC_RAW.out
#SBATCH --error=SLURM-%j-P12002_R1_QC_RAW.err

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load FastQC/0.11.9 MultiQC/1.12

# Variables for R1/R2 and sequence project
proj="P12002"
R="R1"

# FastQC run with 8 cores
fastqc 00-RAW/${proj}*${R}* \
    -t 8 \
    --outdir 01-QC/fastqc_raw_${R}_${proj}

# FastQC end timestamp
echo "$(date)       [FastQC Complete]"

# We add --profile-runtime to see the runtime.
multiqc 01-QC/fastqc_raw_${R}_${proj} \
    --outdir 01-QC/multiqc_raw_${R}_${proj} \
    --profile-runtime

# End time and date
echo "$(date)       [End]"

They are running now 2024-04-06 10:44

ndreey commented 2 months ago

P14052

The P14052 finished in 33min with 8 cores. Thus, it handled around 2.2GB/min or rather, it took 0.45min/GB. Hence, doubling the number of cores increased the run time speed, and in hindsight, four cores means it handles four files at the time. We only tested with three which was not optimal.

Estimated was 1.1 hour and it ran in ~0.5h, booked the cores a little to long but overall good.

ndreey commented 2 months ago

P12002

The P12002 finished in ~1h 40min with 8 cores. Thus it handled around ~3.6GB/min or, 0.28min/GB. Much quicker than what we expected, We expected 5.1h, booked 7h, and according to P14052 it should have taken 2.5h. Ohh well, good that it is done!