ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

RE-RUNNING QUALITY CONTROL: No merger, handle lane per lane. #32

Open ndreey opened 2 months ago

ndreey commented 2 months ago

As mentioned in the "first date" (#1), each sample is split in multiple .fastq.gz files. We will keep the fastq files separated by lane and MAYBE merge further downstream if required (don't think so).

To our aid we have:

Example of .lst file. Notice how it is not sorted in a great way.

$ cat ../../data/WGS/rawdata/P12002/P12002_102.lst

P12002_102/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_102_S46_L006_R2_001.fastq.gz
P12002_102/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_102_S46_L003_R1_001.fastq.gz
P12002_102/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_102_S46_L006_R1_001.fastq.gz
P12002_102/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_102_S46_L003_R2_001.fastq.gz
P12002_102/02-FASTQ/190111_ST-E00214_0276_BHWL33CCXY/P12002_102_S1_L001_R1_001.fastq.gz
P12002_102/02-FASTQ/190111_ST-E00214_0276_BHWL33CCXY/P12002_102_S1_L001_R2_001.fastq.gz
P12002_102.md5

The last line is always the <sample_name>.md5 file.

So given a directory (P12002/P14052) we need to reach each .lst and get the path for the R1 and R2 so we can trim and output trimmed reads in CONURA_WGS/02-TRIM. The .lst gives the path from this absolute path: /crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata and we also need to skip the last line which holds the .md5 path. Noteworthy, we also need to sort the fastq files.

Setting up for the array

lane-paths-by-samples.sh uses the metadata-no-hybrids.csv to pipe the paths in a more iteration friendly way. NOTE, i removed the header for the metadata-no-hybrids.csv. It generates a folder with all the .lst files with a r1,r2 structure + generates a list of the sample.txt files to doc/.

#!/bin/bash

# Start time and date
echo "$(date)       [Start]"

IFS=","; while read SAMPLE POP HP REGION RANGE; do

    echo "Processing: $SAMPLE"
    # Get the project name.
    # %%_* removes everything after the first underscore
    PROJ="${SAMPLE%%_*}"

    # Path from where the .lst files path from
    RAWDIR="/crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata/${PROJ}"

    # Path to .lst file
    LSTFILE="${RAWDIR}/${SAMPLE}.lst"

    # Create a temporary folder for sorted .lst files
    TMPLST="00-RAW-LANES-LST"
    if [ ! -d "$TMPLST" ]; then
        mkdir -p "$TMPLST"
    fi

    # Sorts and removes .md5 files from .lst
    cat $LSTFILE | grep -v ".md5" | sort | paste - - > ${TMPLST}/${SAMPLE}.txt

done < doc/metadata-no-hybrids.csv

# Store the sample.lst's into a text file so we can iterate in trimming script.
ls -1 $TMPLST > doc/sample_lst_lane_paths.txt

# Start time and date
echo "$(date)       [End]"

Trimming

With the sample_lst_lane_paths.txt file i can generate an array where i trim the fastq files in parallel. Trimming the P12002 reads took about 7hrs and P14052 took ca 3h. Using the array and parallelization, the script trimmed all 114 fastq files in less than 1h. Exactly, it seems that each task took about 5min, so to make it even quicker in queue time (was already quick), set -t 00:30:00 in the future

fastp-trim-per-lane.sh

#!/bin/bash

#SBATCH --job-name fastp
#SBATCH --array=1-114%20
#SBATCH -A naiss2023-22-412
#SBATCH -p core -n 16
#SBATCH -t 05:35:00
#SBATCH --output=SLURM-%j-fastp_task-%a.out
#SBATCH --error=SLURM-%j-fastp_task-%a.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load fastp/0.23.4 

# SLURM array jobid
JOBID=${SLURM_ARRAY_TASK_ID}

# Path to folder with .lst files
LST_DIR="00-RAW-LANES-LST"

# Get the .lst file for this task
SAMPLE_LST=$(sed -n "${JOBID}p" doc/sample_lst_lane_paths.txt)

# Get the project name and complete sample_id.
# %%_* removes everything starting from the first underscore
PROJ="${SAMPLE_LST%%_*}"
SAMPLE="${SAMPLE_LST%%.*}"

# Path to where .lst paths from.
RAWDIR="/crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata/${PROJ}"

# Create directory for fastp reports
FASTP_DIR="01-QC/fastp"
if [ ! -d "$FASTP_DIR" ]; then
    mkdir -p "$FASTP_DIR"
fi

# Create directory for trimmed reads
TRIM_DIR="02-TRIM"
if [ ! -d "$TRIM_DIR" ]; then
    mkdir -p "$TRIM_DIR"
fi

# Loop through the raw files and trim.
while read R1 R2; do

    # Input for fastp
    R1_IN="${RAWDIR}/${R1}"
    R2_IN="${RAWDIR}/${R2}"

    # Basename for the lane fastq files
    R1_TRIMD=$(basename $R1 .fastq.gz)
    R2_TRIMD=$(basename $R2 .fastq.gz)

    # Output name for R1 and R2
    R1_OUT="${TRIM_DIR}/${R1_TRIMD}-trim.fastq.gz"
    R2_OUT="${TRIM_DIR}/${R2_TRIMD}-trim.fastq.gz"

    fastp \
        --in1 $R1_IN --in2 $R2_IN \
        --out1 $R1_OUT --out2 $R2_OUT \
        --html "${FASTP_DIR}/${R2_TRIMD}-fastp.html" \
        --json "${FASTP_DIR}/${R2_TRIMD}-fastp.json" \
        --average_qual 20 \
        --length_required 36 \
        --detect_adapter_for_pe \
        --trim_poly_g \
        --dedup \
        --thread 16 \
        --verbose 

done < ${LST_DIR}/${SAMPLE_LST}