Open ndreey opened 3 months ago
Update 2
#!/bin/bash
# Each sample in the sequence project ($dir) are found in the .lst file.
# This script reads through each .lst and generates a concatenated fastq
# at user specified destination ($out).
# OBS, each run over runs previous runs.
# Directories to find the project .lst files (dir) and where we will generate the
# new fastq.gz files (out).
dir="P14052"
out="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/00-RAW"
# Path from where .lst files path from.
data="/crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata/${dir}"
# Finds the .lst files in dir that follows <sample_id>.lst
find "$data" -name "${dir}_*.lst" | while read lst; do
sample=$(basename $lst .lst)
echo "Processing files for sample: $sample"
# Out files for merged R1 and R2
R1_out="${out}/${sample}_R1.fastq"
R2_out="${out}/${sample}_R2.fastq"
# Generate empty fastq.gz files for R1 and R2 to avoid duplication
touch "$R1_out" "$R2_out"
# Lets loop through the files in the .lst
while read -r line; do
# Only handle lines that end with .fastq.gz using pattern (=~)
if [[ "$line" =~ ".fastq.gz" ]]; then
fq_path="${data}/${line}"
# Determines if its R1 or R2 fastq file
if [[ "$line" =~ "_R1_" ]]; then
zcat "$fq_path" >> "$R1_out"
elif [[ "$line" =~ "_R2_" ]]; then
zcat "$fq_path" >> "$R2_out"
fi
fi
done < "$lst"
done
This is the final version of script.
Double checked with P12002_145_R1
reads.
# Merged reads
zcat 00-RAW/P12002_145_R1.fastq.gz | grep "@" | wc -l
37840146
# From each lane
zcat ../../../data/WGS/rawdata/P12002/P12002_145/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_145_S94_L004_R1_001.fastq.gz | grep "@" | wc -l
13635386
zcat ../../../data/WGS/rawdata/P12002/P12002_145/02-FASTQ/190111_ST-E00214_0275_AHWLHMCCXY/P12002_145_S94_L007_R1_001.fastq.gz | grep "@" | wc -l
12866735
zcat ../../../data/WGS/rawdata/P12002/P12002_145/02-FASTQ/190111_ST-E00214_0276_BHWL33CCXY/P12002_145_S49_L002_R1_001.fastq.gz | grep "@" | wc -l
11338025
It checks out, 13635386 + 12866735 +11338025 = 37840146!
More adjustments can be made, but this will do for now for the script :)
#!/bin/bash
# Each sample in the sequence project ($dir) are found in the .lst file.
# This script reads through each .lst and generates a concatenated fastq
# at user specified destination ($out).
# OBS, each run over runs previous runs.
# Directories to find the project .lst files (dir) and where we will generate the
# new fastq.gz files (out).
dir="P12002"
out="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/00-RAW"
# Path from where .lst files path from.
data="/crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata/${dir}"
# Finds the .lst files in dir that follows <sample_id>.lst
find "$data" -name "${dir}_*.lst" | while read lst; do
sample=$(basename $lst .lst)
echo "Processing files for sample: $sample"
# Out files for merged R1 and R2
R1_out="${out}/${sample}_R1.fastq.gz"
R2_out="${out}/${sample}_R2.fastq.gz"
# Generate empty fastq.gz files for R1 and R2 to avoid duplication
touch "$R1_out" "$R2_out"
# Lets loop through the files in the .lst
while read -r line; do
# Only handle lines that end with .fastq.gz using pattern (=~)
if [[ "$line" =~ ".fastq.gz" ]]; then
fq_path="${data}/${line}"
# Determines if its R1 or R2 fastq file
if [[ "$line" =~ "_R1_" ]]; then
cat "$fq_path" >> "$R1_out"
elif [[ "$line" =~ "_R2_" ]]; then
cat "$fq_path" >> "$R2_out"
fi
fi
done < "$lst"
done
Script to merge each sample
As mentioned in the "first date" (#1), each sample is split in multiple
.fastq.gz
files. We need to merge each sample.To our aid:
all.pops.metadata.tsv
to aid us in getting sample_names<sample_name>.lst
files with pathways.The last line is always the
<sample_name>.md5
file.Of all the .lst files, there are 100 that are associated with P12002 samples, one is
miscellaneous.lst
which we will need to ignore. Furthermore, it is interesting that there are 97 in total according fromall.pops.metadata.tsv
but we find 100 just from P12002.So given a directory (P12002/P14052) we need to reach each
.lst
and concatenate the R1 and R2 separately and skipping the.md5
file (last line). The.lst
gives the path from this absolute path:/crex/proj/snic2020-6-222/Projects/Tconura/data/WGS/rawdata
Here is the first version of _merge_samples_across_lanes.sh"