Open ndreey opened 7 months ago
Three reads of total 8.9GB took 8min to process with 4 cores (6.4 GB RAM/core). Giving us 1.11GB/min or rather 0.90min/GB. Lets see if it goes quicker with total of 8 cores.
andbou@rackham4: (qc) CONURA_WGS: du -h 00-RAW/P12002_{101,102,103}_R1.fastq.gz
3.8G 00-RAW/P12002_101_R1.fastq.gz
2.7G 00-RAW/P12002_102_R1.fastq.gz
2.4G 00-RAW/P12002_103_R1.fastq.gz
$$\text{expected run time} = \frac{\text{tot GB}*0.90}{60}$$
to_run | GB | expected run time(h) |
---|---|---|
P12002_R1 | 344.1 | 5.2 |
P12002_R2 | 366.9 | 5.5 |
P14052_R1 | 74.2 | 1.1 |
P14052_R2 | 75.3 | 1.1 |
>du -h 00-RAW/P12002*R1* | cut -f1 | sed 's/G//g' | awk '{sum += $1} END {print sum}'
344.1
>du -h 00-RAW/P12002*R2* | cut -f1 | sed 's/G//g' | awk '{sum += $1} END {print sum}'
366.9
>du -h 00-RAW/P14052*R1* | cut -f1 | sed 's/G//g' | awk '{sum += $1} END {print sum}'
74.2
> du -h 00-RAW/P14052*R2* | cut -f1 | sed 's/G//g' | awk '{sum += $1} END {print sum}'
75.3
Lets create four separate scripts for each fastqc run and add two extra hours to each expected run time for good measure. As well to generate multiqc reports.
This is the script i sent to slurm where i altered the jobnames/slurm-logs and $proj
and $R
#!/bin/bash
#SBATCH --job-name P12002_R1_QC_raw
#SBATCH -A naiss2024-5-1
#SBATCH -p core -n 8
#SBATCH -t 07:15:00
#SBATCH --output=SLURM-%j-P12002_R1_QC_RAW.out
#SBATCH --error=SLURM-%j-P12002_R1_QC_RAW.err
# Start time and date
echo "$(date) [Start]"
# Load in modules
module load bioinfo-tools
module load FastQC/0.11.9 MultiQC/1.12
# Variables for R1/R2 and sequence project
proj="P12002"
R="R1"
# FastQC run with 8 cores
fastqc 00-RAW/${proj}*${R}* \
-t 8 \
--outdir 01-QC/fastqc_raw_${R}_${proj}
# FastQC end timestamp
echo "$(date) [FastQC Complete]"
# We add --profile-runtime to see the runtime.
multiqc 01-QC/fastqc_raw_${R}_${proj} \
--outdir 01-QC/multiqc_raw_${R}_${proj} \
--profile-runtime
# End time and date
echo "$(date) [End]"
They are running now 2024-04-06 10:44
The P14052 finished in 33min with 8 cores. Thus, it handled around 2.2GB/min or rather, it took 0.45min/GB. Hence, doubling the number of cores increased the run time speed, and in hindsight, four cores means it handles four files at the time. We only tested with three which was not optimal.
Estimated was 1.1 hour and it ran in ~0.5h, booked the cores a little to long but overall good.
The P12002 finished in ~1h 40min with 8 cores. Thus it handled around ~3.6GB/min or, 0.28min/GB. Much quicker than what we expected, We expected 5.1h, booked 7h, and according to P14052 it should have taken 2.5h. Ohh well, good that it is done!
Quality Control
I generate raw-samples.txt which has the sample_name, R1 and R2 as columns.
I then prepare for quality control results by setting up 01-QC with subdirectories (using expansions)
Now, to firstly try out SLURM on UPPMAX we will test run with this simple script.