ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

merge_samples_across_lanes.sh #3

Open ndreey opened 3 months ago

ndreey commented 3 months ago

Script to merge each sample

As mentioned in the "first date" (#1), each sample is split in multiple .fastq.gz files. We need to merge each sample.

To our aid:

cat ../../data/WGS/rawdata/P12002/P12002_102.lst
P12002_102/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_102_S46_L006_R2_001.fastq.gz
P12002_102/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_102_S46_L003_R1_001.fastq.gz
P12002_102/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_102_S46_L006_R1_001.fastq.gz
P12002_102/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_102_S46_L003_R2_001.fastq.gz
P12002_102/02-FASTQ/190111_ST-E00214_0276_BHWL33CCXY/P12002_102_S1_L001_R1_001.fastq.gz
P12002_102/02-FASTQ/190111_ST-E00214_0276_BHWL33CCXY/P12002_102_S1_L001_R2_001.fastq.gz
P12002_102.md5

The last line is always the <sample_name>.md5 file.

Of all the .lst files, there are 100 that are associated with P12002 samples, one is miscellaneous.lst which we will need to ignore. Furthermore, it is interesting that there are 97 in total according from all.pops.metadata.tsv but we find 100 just from P12002.

 ls ../../../data/WGS/rawdata/P12002/*.lst | tr " " "\n" | wc -l
101

So given a directory (P12002/P14052) we need to reach each .lst and concatenate the R1 and R2 separately and skipping the .md5 file (last line). The .lst gives the path from this absolute path: /crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata

Here is the first version of _merge_samples_across_lanes.sh"

#!/bin/bash

# Each sample in the sequence project ($dir) are found in the .lst file.
# This script reads through each .lst and generates a concatenated fastq 
# at user specified destination ($out).
# OBS, each run over runs previous runs.

# Directories to find the project .lst files (dir) and where we will generate the
# new fastq.gz files (out).
dir="P14052"
out="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/00-RAW"

# Path from where .lst files path from.
data="/crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata"

# Finds the .lst files in dir that follows <sample_id>.lst
find "$data/$dir" -name "${dir}_*.lst" | while read lst; do

    # Get the sample name from $lst which has the whole path.
    sample=$(basename $lst .lst)
    echo "Processing files for sample: $sample"

    # Out files for merged R1 and R2
    R1_out="${out}/${sample}_R1.fastq.gz"
    R2_out="${out}/${sample}_R2.fastq.gz"

    # Generate empty fastq.gz files for R1 and R2 to avoid duplication
    touch "$R1_out" "$R2_out"

    # Lets loop through the files in the .lst
    while read -r line; do
        # Only handle lines that end with .fastq.gz using pattern (=~)
        if [[ "$line" =~ ".fastq.gz" ]]; then
            fq_path="${data}/${line}"

            # Determines if its R1 or R2 fastq file
            if [[ "$line" =~ "_R1_" ]]; then
                zcat "$fq_path" >> "$R1_out"

            elif [[ "$line" =~ "_R2_" ]]; then
                zcat "$fq_path" >> "$R2_out"
            fi
        fi
    done < "$lst"
done
ndreey commented 3 months ago

Update 2

#!/bin/bash

# Each sample in the sequence project ($dir) are found in the .lst file.
# This script reads through each .lst and generates a concatenated fastq 
# at user specified destination ($out).
# OBS, each run over runs previous runs.

# Directories to find the project .lst files (dir) and where we will generate the
# new fastq.gz files (out).
dir="P14052"
out="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/00-RAW"

# Path from where .lst files path from.
data="/crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata/${dir}"

# Finds the .lst files in dir that follows <sample_id>.lst
find "$data" -name "${dir}_*.lst" | while read lst; do

    sample=$(basename $lst .lst)
    echo "Processing files for sample: $sample"

    # Out files for merged R1 and R2
    R1_out="${out}/${sample}_R1.fastq"
    R2_out="${out}/${sample}_R2.fastq"

    # Generate empty fastq.gz files for R1 and R2 to avoid duplication
    touch "$R1_out" "$R2_out"

    # Lets loop through the files in the .lst
    while read -r line; do
        # Only handle lines that end with .fastq.gz using pattern (=~)
        if [[ "$line" =~ ".fastq.gz" ]]; then
            fq_path="${data}/${line}"

            # Determines if its R1 or R2 fastq file
            if [[ "$line" =~ "_R1_" ]]; then
                zcat "$fq_path" >> "$R1_out"

            elif [[ "$line" =~ "_R2_" ]]; then
                zcat "$fq_path" >> "$R2_out"
            fi        
        fi
    done < "$lst"
done
ndreey commented 3 months ago

This is the final version of script. Double checked with P12002_145_R1 reads.

# Merged reads
zcat 00-RAW/P12002_145_R1.fastq.gz | grep "@" | wc -l
37840146

# From each lane
zcat ../../../data/WGS/rawdata/P12002/P12002_145/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_145_S94_L004_R1_001.fastq.gz | grep "@" | wc -l
13635386

zcat ../../../data/WGS/rawdata/P12002/P12002_145/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_145_S94_L007_R1_001.fastq.gz | grep "@" | wc -l
12866735

zcat ../../../data/WGS/rawdata/P12002/P12002_145/02-FASTQ/190111_ST-E00214_0276_BHWL33CCXY/P12002_145_S49_L002_R1_001.fastq.gz | grep "@" | wc -l
11338025

It checks out, 13635386 + 12866735 +11338025 = 37840146!

More adjustments can be made, but this will do for now for the script :)

#!/bin/bash

# Each sample in the sequence project ($dir) are found in the .lst file.
# This script reads through each .lst and generates a concatenated fastq 
# at user specified destination ($out).
# OBS, each run over runs previous runs.

# Directories to find the project .lst files (dir) and where we will generate the
# new fastq.gz files (out).
dir="P12002"
out="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/00-RAW"

# Path from where .lst files path from.
data="/crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata/${dir}"

# Finds the .lst files in dir that follows <sample_id>.lst
find "$data" -name "${dir}_*.lst" | while read lst; do

    sample=$(basename $lst .lst)
    echo "Processing files for sample: $sample"

    # Out files for merged R1 and R2
    R1_out="${out}/${sample}_R1.fastq.gz"
    R2_out="${out}/${sample}_R2.fastq.gz"

    # Generate empty fastq.gz files for R1 and R2 to avoid duplication
    touch "$R1_out" "$R2_out"

    # Lets loop through the files in the .lst
    while read -r line; do
        # Only handle lines that end with .fastq.gz using pattern (=~)
        if [[ "$line" =~ ".fastq.gz" ]]; then
            fq_path="${data}/${line}"

            # Determines if its R1 or R2 fastq file
            if [[ "$line" =~ "_R1_" ]]; then
                cat "$fq_path" >> "$R1_out"

            elif [[ "$line" =~ "_R2_" ]]; then
                cat "$fq_path" >> "$R2_out"
            fi        
        fi
    done < "$lst"
done