As mentioned in the "first date" (#1), each sample is split in multiple .fastq.gz files. We will keep the fastq files separated by lane and MAYBE merge further downstream if required (don't think so).
To our aid we have:
metadata-no-hybrids.csv: All samples to analyze (no hybrids).
<sample_name>.lst fastq files with pathways.
Example of .lst file. Notice how it is not sorted in a great way.
The last line is always the <sample_name>.md5 file.
So given a directory (P12002/P14052) we need to reach each .lst and get the path for the R1 and R2 so we can trim and output trimmed reads in CONURA_WGS/02-TRIM. The .lst gives the path from this absolute path: /crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata and we also need to skip the last line which holds the .md5 path. Noteworthy, we also need to sort the fastq files.
Setting up for the array
lane-paths-by-samples.sh uses the metadata-no-hybrids.csv to pipe the paths in a more iteration friendly way. NOTE, i removed the header for the metadata-no-hybrids.csv. It generates a folder with all the .lst files with a r1,r2 structure + generates a list of the sample.txt files to doc/.
#!/bin/bash
# Start time and date
echo "$(date) [Start]"
IFS=","; while read SAMPLE POP HP REGION RANGE; do
echo "Processing: $SAMPLE"
# Get the project name.
# %%_* removes everything after the first underscore
PROJ="${SAMPLE%%_*}"
# Path from where the .lst files path from
RAWDIR="/crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata/${PROJ}"
# Path to .lst file
LSTFILE="${RAWDIR}/${SAMPLE}.lst"
# Create a temporary folder for sorted .lst files
TMPLST="00-RAW-LANES-LST"
if [ ! -d "$TMPLST" ]; then
mkdir -p "$TMPLST"
fi
# Sorts and removes .md5 files from .lst
cat $LSTFILE | grep -v ".md5" | sort | paste - - > ${TMPLST}/${SAMPLE}.txt
done < doc/metadata-no-hybrids.csv
# Store the sample.lst's into a text file so we can iterate in trimming script.
ls -1 $TMPLST > doc/sample_lst_lane_paths.txt
# Start time and date
echo "$(date) [End]"
Trimming
With the sample_lst_lane_paths.txt file i can generate an array where i trim the fastq files in parallel. Trimming the P12002 reads took about 7hrs and P14052 took ca 3h. Using the array and parallelization, the script trimmed all 114 fastq files in less than 1h. Exactly, it seems that each task took about 5min, so to make it even quicker in queue time (was already quick), set -t 00:30:00 in the future
fastp-trim-per-lane.sh
#!/bin/bash
#SBATCH --job-name fastp
#SBATCH --array=1-114%20
#SBATCH -A naiss2023-22-412
#SBATCH -p core -n 16
#SBATCH -t 05:35:00
#SBATCH --output=SLURM-%j-fastp_task-%a.out
#SBATCH --error=SLURM-%j-fastp_task-%a.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Start time and date
echo "$(date) [Start]"
# Load in modules
module load bioinfo-tools
module load fastp/0.23.4
# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}
# Path to folder with .lst files
LST_DIR="00-RAW-LANES-LST"
# Get the .lst file for this task
SAMPLE_LST=$(sed -n "${JOBID}p" doc/sample_lst_lane_paths.txt)
# Get the project name and complete sample_id.
# %%_* removes everything starting from the first underscore
PROJ="${SAMPLE_LST%%_*}"
SAMPLE="${SAMPLE_LST%%.*}"
# Path to where .lst paths from.
RAWDIR="/crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata/${PROJ}"
# Create directory for fastp reports
FASTP_DIR="01-QC/fastp"
if [ ! -d "$FASTP_DIR" ]; then
mkdir -p "$FASTP_DIR"
fi
# Create directory for trimmed reads
TRIM_DIR="02-TRIM"
if [ ! -d "$TRIM_DIR" ]; then
mkdir -p "$TRIM_DIR"
fi
# Loop through the raw files and trim.
while read R1 R2; do
# Input for fastp
R1_IN="${RAWDIR}/${R1}"
R2_IN="${RAWDIR}/${R2}"
# Basename for the lane fastq files
R1_TRIMD=$(basename $R1 .fastq.gz)
R2_TRIMD=$(basename $R2 .fastq.gz)
# Output name for R1 and R2
R1_OUT="${TRIM_DIR}/${R1_TRIMD}-trim.fastq.gz"
R2_OUT="${TRIM_DIR}/${R2_TRIMD}-trim.fastq.gz"
fastp \
--in1 $R1_IN --in2 $R2_IN \
--out1 $R1_OUT --out2 $R2_OUT \
--html "${FASTP_DIR}/${R2_TRIMD}-fastp.html" \
--json "${FASTP_DIR}/${R2_TRIMD}-fastp.json" \
--average_qual 20 \
--length_required 36 \
--detect_adapter_for_pe \
--trim_poly_g \
--dedup \
--thread 16 \
--verbose
done < ${LST_DIR}/${SAMPLE_LST}
As mentioned in the "first date" (#1), each sample is split in multiple
.fastq.gz
files. We will keep the fastq files separated by lane and MAYBE merge further downstream if required (don't think so).To our aid we have:
metadata-no-hybrids.csv
: All samples to analyze (no hybrids).<sample_name>.lst
fastq files with pathways.Example of
.lst
file. Notice how it is not sorted in a great way.The last line is always the
<sample_name>.md5
file.So given a directory (P12002/P14052) we need to reach each
.lst
and get the path for the R1 and R2 so we can trim and output trimmed reads inCONURA_WGS/02-TRIM
. The.lst
gives the path from this absolute path:/crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata
and we also need to skip the last line which holds the.md5
path. Noteworthy, we also need to sort the fastq files.Setting up for the array
lane-paths-by-samples.sh
uses themetadata-no-hybrids.csv
to pipe the paths in a more iteration friendly way. NOTE, i removed the header for themetadata-no-hybrids.csv
. It generates a folder with all the .lst files with a r1,r2 structure + generates a list of the sample.txt files todoc/
.Trimming
With the
sample_lst_lane_paths.txt
file i can generate an array where i trim the fastq files in parallel. Trimming the P12002 reads took about 7hrs and P14052 took ca 3h. Using the array and parallelization, the script trimmed all 114 fastq files in less than 1h. Exactly, it seems that each task took about 5min, so to make it even quicker in queue time (was already quick), set-t 00:30:00
in the futurefastp-trim-per-lane.sh