ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

Broad approach: All CH and all CO metagenome #45

Open ndreey opened 3 weeks ago

ndreey commented 3 weeks ago

Merging reads per hostplant

As a comparative measure against population based approach, we will assemble metagenomes using all CH and CO samples separately.

# New directory for hostplant approach
mkdir 08-ANVIO-HP-APPROACH

# Create a .txt file for the samples
ls -1 05-CLEAN-MERGED/CH* | paste - - > doc/all-CH-clean-reads.txt
ls -1 05-CLEAN-MERGED/CO* | paste - - > doc/all-CO-clean-reads.txt

# Checking so populations match
cat doc/all-{CH,CO}-clean-reads.txt
05-CLEAN-MERGED/CHES_R1-clean.fq.gz     05-CLEAN-MERGED/CHES_R2-clean.fq.gz
05-CLEAN-MERGED/CHFI_R1-clean.fq.gz     05-CLEAN-MERGED/CHFI_R2-clean.fq.gz
05-CLEAN-MERGED/CHSC_R1-clean.fq.gz     05-CLEAN-MERGED/CHSC_R2-clean.fq.gz
05-CLEAN-MERGED/CHSK_R1-clean.fq.gz     05-CLEAN-MERGED/CHSK_R2-clean.fq.gz
05-CLEAN-MERGED/CHST_R1-clean.fq.gz     05-CLEAN-MERGED/CHST_R2-clean.fq.gz
05-CLEAN-MERGED/COES_R1-clean.fq.gz     05-CLEAN-MERGED/COES_R2-clean.fq.gz
05-CLEAN-MERGED/COGE_R1-clean.fq.gz     05-CLEAN-MERGED/COGE_R2-clean.fq.gz
05-CLEAN-MERGED/COLI_R1-clean.fq.gz     05-CLEAN-MERGED/COLI_R2-clean.fq.gz
05-CLEAN-MERGED/COSK_R1-clean.fq.gz     05-CLEAN-MERGED/COSK_R2-clean.fq.gz

_merge_hostplantreads.sh

#!/bin/bash

# Start time and date
echo "$(date)       [Start]"

# Loops through CH and CO to generate .fq.gz files with all hostplant
# reads, respectively.
for HOSTP in CH CO; do
    echo "Merging reads for ${HOSTP} hostplant"
    READS_TXT=doc/all-${HOSTP}-clean-reads.txt
    R1_OUT=05-CLEAN-MERGED/${HOSTP}_R1-clean.fq.gz
    R2_OUT=05-CLEAN-MERGED/${HOSTP}_R2-clean.fq.gz

    while read R1 R2; do
        # Write R1 and R2 to new merged file
        cat ${R1} >> ${R1_OUT}
        cat ${R2} >> ${R2_OUT}
    done < ${READS_TXT}
done

# Start time and date
echo "$(date)       [End]"

Lets confirm we have the same amount of reads and if the ID's line up

# Checking that the number of reads is equal in both R1 and R2.
 zcat 05-CLEAN-MERGED/CH_R1-clean.fq.gz | grep "^@" | wc -l; zcat 05-CLEAN-MERGED/CH_R2-clean.fq.gz | grep "^@" | wc -l
8756145
8756145

# Checking that reads are lined up in start and end of file.
zcat 05-CLEAN-MERGED/CH_R1-clean.fq.gz | grep "^@" | head -n 2; zcat 05-CLEAN-MERGED/CH_R2-clean.fq.gz | grep "^@" | head -n 2
@ST-E00214:276:HWL33CCXY:1:1101:1184:42007/1
@ST-E00214:276:HWL33CCXY:1:1101:1194:57741/1
@ST-E00214:276:HWL33CCXY:1:1101:1184:42007/2
@ST-E00214:276:HWL33CCXY:1:1101:1194:57741/2

zcat 05-CLEAN-MERGED/CH_R1-clean.fq.gz | grep "^@" | tail -n 2; zcat 05-CLEAN-MERGED/CH_R2-clean.fq.gz | grep "^@" | tail -n 2
@ST-E00214:275:HWLHMCCXY:7:2224:31801:67656/1
@ST-E00214:275:HWLHMCCXY:7:2224:31862:22721/1
@ST-E00214:275:HWLHMCCXY:7:2224:31801:67656/2
@ST-E00214:275:HWLHMCCXY:7:2224:31862:22721/2

# Quick check for CO reads as well.
zcat 05-CLEAN-MERGED/CO_R1-clean.fq.gz | grep "^@" | wc -l; zcat 05-CLEAN-MERGED/CO_R2-clean.fq.gz | grep "^@" | wc -l
7396184
7396184

ALL GOOD

ndreey commented 3 weeks ago

Metagenome Assembly for each hostplant

For CH, we will use long reads (HybridSPAdes) and for CO we will use normal short read metaSPAdes. However, we could possibly use long reads for CO as well, but that can be done downstream. Furthermore, when we have high quality MAGs, perhaps we could use them as well as reference based assembly... Food for thought.

These scripts require you to also give an argument specifying which prefix to assemble. Hence, in this case $1=CH for hybrid, and CO for short read. hybridspades-assembly.sh

#!/bin/bash

#SBATCH --job-name hybridSPAdes
#SBATCH -A naiss2024-22-580
#SBATCH -p node -n 1
#SBATCH -t 06:15:00
#SBATCH -C mem1TB
#SBATCH --output=slurm-logs/assembly/SLURM-%j-hybridSPAdes-CH.out
#SBATCH --error=slurm-logs/assembly/SLURM-%j-hybridSPAdes-CH.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load spades/3.15.5

# Set variables
POP=$1
SR_DIR="05-CLEAN-MERGED"
LR_DIR="04-CLEAN-FASTQ/hifi-pacbio"

# Create directory for trimmed reads if not existing
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/06-ASSEMBLY/${POP}"
if [ ! -d "$outdir" ]; then
    mkdir -p "$outdir"
fi

# Assembling the metagenome
spades.py \
    --meta \
    --only-assembler \
    -k auto \
    --threads 20 \
    --memory 1000 \
    -1 ${SR_DIR}/${POP}_R1-clean.fq.gz \
    -2 ${SR_DIR}/${POP}_R2-clean.fq.gz \
    --pacbio ${LR_DIR}/pt_042_001_cell1-clean.fastq.gz \
    --pacbio ${LR_DIR}/pt_042_001_cell2-clean.fastq.gz \
    --pacbio ${LR_DIR}/pt_042_001_cell3-clean.fastq.gz \
    -o $outdir

# Restarting from checkpoint
#spades.py --continue -o $outdir

# End time and date
echo "$(date)       [End]"

metaspades-assembly.sh

#!/bin/bash

#SBATCH --job-name hybridSPAdes
#SBATCH -A naiss2024-22-580
#SBATCH -p node -n 1
#SBATCH -t 06:15:00
#SBATCH -C mem1TB
#SBATCH --output=slurm-logs/assembly/SLURM-%j-hybridSPAdes-CO.out
#SBATCH --error=slurm-logs/assembly/SLURM-%j-hybridSPAdes-CO.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load spades/3.15.5

# Set variables
POP=$1
SR_DIR="05-CLEAN-MERGED"

# Create directory for trimmed reads if not existing
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/06-ASSEMBLY/${POP}"
if [ ! -d "$outdir" ]; then
    mkdir -p "$outdir"
fi

# Assembling the metagenome
spades.py \
    --meta \
    --only-assembler \
    -k auto \
    --threads 20 \
    --memory 1000 \
    -1 ${SR_DIR}/${POP}_R1-clean.fq.gz \
    -2 ${SR_DIR}/${POP}_R2-clean.fq.gz \
    -o $outdir

# Restarting from checkpoint
#spades.py --continue -o $outdir

# End time and date
echo "$(date)       [End]"