ndreey / CONURA_WGS

Metagenomic analysis on whole genome sequencing data from Tephritis conura (IN PROGRESS)
0 stars 0 forks source link

KneadData: Host-decontamination (FAILED) #27

Open ndreey opened 2 months ago

ndreey commented 2 months ago

Using 03_removedandThenMaskedMicrobialContamination_reference_genome.fasta as reference for Tconura which was found at: ../../results/00_refGenome_pt_042/refGenome_microbFiltering/

A       C       G       T       N       IUPAC   Other   GC      GC_stdev
0.3198  0.1802  0.1803  0.3196  0.0134  0.0000  0.0000  0.3606  0.0295

Main genome scaffold total:             2563
Main genome contig total:               640009
Main genome scaffold sequence total:    1946.195 MB
Main genome contig sequence total:      1920.204 MB     1.335% gap
Main genome scaffold N/L50:             319/1.742 MB
Main genome contig N/L50:               89532/6.455 KB
Main genome scaffold N/L90:             1189/398.914 KB
Main genome contig N/L90:               302349/1.761 KB
Max scaffold length:                    8.968 MB
Max contig length:                      221.479 KB
Number of scaffolds > 50 KB:            2231
% main genome in scaffolds > 50 KB:     99.41%

Minimum         Number          Number          Total           Total           Scaffold
Scaffold        of              of              Scaffold        Contig          Contig  
Length          Scaffolds       Contigs         Length          Length          Coverage
--------        --------------  --------------  --------------  --------------  --------
    All                  2,563         640,009   1,946,194,716   1,920,204,323    98.66%
   5 KB                  2,563         640,009   1,946,194,716   1,920,204,323    98.66%
  10 KB                  2,562         640,006   1,946,187,472   1,920,197,421    98.66%
  25 KB                  2,515         639,525   1,945,239,894   1,919,272,300    98.67%
  50 KB                  2,231         628,381   1,934,714,323   1,909,285,595    98.69%
 100 KB                  1,920         612,641   1,912,895,667   1,888,172,579    98.71%
 250 KB                  1,452         572,562   1,835,768,641   1,812,795,694    98.75%
 500 KB                  1,051         514,160   1,690,623,403   1,670,101,324    98.79%
   1 MB                    624         411,623   1,378,771,933   1,362,408,122    98.81%
 2.5 MB                    179         200,851     683,668,414     675,737,218    98.84%
   5 MB                     31          57,855     193,639,010     191,346,369    98.82%

The reference was copied and renamed to data/Tconura_reference_genome/Tconura_ref.fasta in the working directory of CONURA_WGS

The goal is to decontaminate the reads that map to the reference genome using KneadData v0.12.0.

Indexing the reference

KneadData utilizes different programs for the different steps. Here, we will skip the trimming part of KneadData and only use their existing scripts to sort the alignments. First step is to build a index for the reference. This was done in an interactive session.

# Move to reference genome directory
cd data/Tconura_reference_genome/

# Index
bowtie2-build --threads 12 Tconura_ref.fasta Tconura_ref

# Move back to working directory for the analysis
cd ../..

Constructing new metadata file.

Originally we had all.pops.metadata.tsv and over time we have acquired allScots.Tcon.txt and samples_to_use.txt. By using the sample_id's from samples_to_use.txt i map and extend the values from the other two files using R. Note, i removed the tabs from the .tsv file and only used the sample id's from samples_to_use.txt. Furthermore, i extended allScots.Tcon.txt with column values matching all.pops.metadata.tsv manually. In the end, i could just provide this metadata file. Meaning, i can probably edit the merging and trimming scripts using this file as the "base".

Either way, we now have one file holding info of all the samples that are not hybrids so we can easily group samples based on hostplant or population.

library(tidyverse)

# Load in data
trimmed <- read.table("trimmed_reads_meta.txt", header=TRUE, sep=",")

scot <- read.table("allScots.Tcon.txt", header=TRUE, sep="\t")

all <- read.table("all.pops.metadata.csv", header=TRUE, sep=",")

# Combine scots and all
df_combined <- bind_rows(scot, all)

# Map and join by sample_id.
df_trim <- trimmed %>%
  inner_join(df_combined, by="sample_id")

# Write to csv file
write.csv(df_trim, "metadata-no-hybrids.csv", row.names=FALSE, quote=FALSE)

metadata-no-hybrids.csv

sample_id,pop,hostplant,transect,hostrange
P12002_101,CHSK,CH,West,Sympatric
P12002_102,CHSK,CH,West,Sympatric
P12002_103,CHSK,CH,West,Sympatric
...
P14052_126,CHSC,CH,West,Sympatric
P14052_127,CHSC,CH,West,Sympatric

KneadData

KneadData will use the reference to align the trimmed reads and determine which are "host contaminants", producing what they call clean reads. Overall, the program takes ca 3 hours per read. I ran it as an array.

#!/bin/bash

#SBATCH --job-name KneadData
#SBATCH --array=1-9%3
#SBATCH -A naiss2023-22-412
#SBATCH -p core -n 16
#SBATCH -t 03:35:00
#SBATCH --output=SLURM-%j-kneaddata.out
#SBATCH --error=SLURM-%j-kneaddata.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL

# Start time and date
echo "$(date)       [Start]"

# Load in modules
module load bioinfo-tools
module load KneadData/0.12.0

# Create directory for decontaminated reads
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/03-DECON"
if [ ! -d "$outdir" ]; then
    mkdir -p "$outdir"
fi

# Path to trimmed reads and reference database of Tconura.
trim_dir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/02-TRIM"
DB_REF="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/data/Tconura_reference_genome/Tconura_ref"

# SLURM array jobid
jobid=${SLURM_ARRAY_TASK_ID}

# Read the line corresponding to the SLURM array task ID from CHST_samples.txt
IFS="," read -r sample pop hp geo range < <(sed -n "${jobid}p" doc/CHST_samples.txt)

# R1 and R2 input
R1="${trim_dir}/${sample}_R1-trim.fastq.gz"
R2="${trim_dir}/${sample}_R2-trim.fastq.gz"

# Create specific folder for sample.
out="$outdir/${pop}_${sample}"
mkdir -p $out

# Host decontamination without trimming or removal of tandem repeats.
kneaddata \
    --verbose \
    --bypass-trim \
    --bypass-trf \
    --threads 16 \
    --input1 $R1 \
    --input2 $R2 \
    --output $out \
    --reference-db $DB_REF \
    --output-prefix "${pop}_${sample}"

# End time and date
echo "$(date)       [End]"

There seems to be an issue though, KneadData cannot handle spaces in the sequencer id's. This is something Illumina apparently implemented not too long ago. Which leads KneadData to not utilize their paired algorithms to determine contamination.

Trimmed read which has space between ..1608 and 1:N:0...: @ST-E00214:275:HWLHMCCXY:6:1101:6502:1608 1:N:0:NCCGGAGA+NAAGATTA

I will have to edit each header and add a "/" or some other special character. These will be my guides: https://forum.biobakery.org/t/all-paired-end-read-unmatched/2895/8 https://forum.biobakery.org/t/paired-end-data-results-in-unpaired-output/928

Nevertheless, we can get an idea of the host contamination.

P12002_114: Trimmed has 30169363 reads, 29905591 aligned to the genome. Meaning 0.99% is host contamination. Leaving us to 263772 reads to work with from read 1. The last two calls shows that KneadData clearly cant handle the spaces as the pair clearly exists. The end #0/1 is from KneadData.

# Number of trimmed reads in R1
andbou@r483: (assembly) CONURA_WGS: zcat 02-TRIM/P12002_114_R1-trim.fastq.gz | grep "^@" | wc -l
30169363

# Number of reads aligning to genome
andbou@r483: (assembly) CONURA_WGS: cat 03-DECON/CHST_P12002_114/CHST_P12002_114_Tconura_ref_bowtie2_unmatched_1_contam.fastq | grep "^@" | wc -l
29905591

# Number of reads not aligning to genome
andbou@r483: (assembly) CONURA_WGS: cat 03-DECON/CHST_P12002_114/CHST_P12002_114_unmatched_1.fastq  | grep "^@" | wc -l
263772

# Unmatched read from input 1
andbou@rackham3: (assembly) CONURA_WGS: cat 03-DECON/CHST_P12002_114/CHST_P12002_114_unmatched_1.fastq | grep "^@" | head -n 1
@ST-E00214:275:HWLHMCCXY:6:1101:5842:1801.1:N:0:NCCGGAGA+NAAGATTA#0/1

# Unmatched read from input 2.
andbou@rackham3: (assembly) CONURA_WGS: cat 03-DECON/CHST_P12002_114/CHST_P12002_114_unmatched_2.fastq | grep "6:1101:5842:1801"
@ST-E00214:275:HWLHMCCXY:6:1101:5842:1801.2:N:0:NCCGGAGA+NAAGATTA#0/2

SeqID fix

As described by the user, he used this sed command to fix this issue.

# Test file
zcat 02-TRIM/P12002_114_R1-trim.fastq.gz | head -n 8 > testR1.fastq

# Replace the space so it just ends with /1.
sed 's/ 1.*/\/1/g' < testR1.fastq > new.textR1.fastq
andbou@r483: (assembly) CONURA_WGS: cat testR1.fastq 
@ST-E00214:275:HWLHMCCXY:6:1101:6502:1608 1:N:0:NCCGGAGA+NAAGATTA
NATTAATTTATTTTTTTCAATTATCCGACACCCAATGTACAATGGACATAACATGAAATTGCTTTCAATAACTTTATTTAAAAGTTTAAAATATTGGAACAAGTACTTACATATGTACTAATTCTCTATAGCCCTGCAGAAATTGTTGGCT
+
#-AFF-<<-JJA<FFJJJJJJJJJAJJJJ<FJFJFJFJJ-F7JJAAJ--FJFA-<-FFJJJ-FJJFFJFJJJJ-<-<<--7-FJFJJFFFJJ--AAAF-7J7-F-7--F7FJFFFJ<A--A--F<-FAJ---7<7<FF)AJF<A--7<A)-
@ST-E00214:275:HWLHMCCXY:6:1101:12023:1608 1:N:0:NCCGGAGA+NAAGATTA
NTTGAAATGTTTCAACTGGCGCTAAGTGATACGCAAGTAATTTAAGTTTCAAGCGATAACAACCAATCAGGTGGAGAAAATTGTTAGCGAATTCAAATATAAGATAGGCGGCAGAAAGCTTCTATCTGATGGAGTACTCAAATACTGTATA
+
#AAFFJJJJJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJFFJJFJJJFJJFJ<JJJJAFJJJJJJJJJJJJJJAJJJFJJJJFJJJJJJJJJJA-FJFJJAJJJJJJ<JJFJA7FF7JJJJJJAFAJ7A

andbou@r483: (assembly) CONURA_WGS: cat new.textR1.fastq
@ST-E00214:275:HWLHMCCXY:6:1101:6502:1608/1
NATTAATTTATTTTTTTCAATTATCCGACACCCAATGTACAATGGACATAACATGAAATTGCTTTCAATAACTTTATTTAAAAGTTTAAAATATTGGAACAAGTACTTACATATGTACTAATTCTCTATAGCCCTGCAGAAATTGTTGGCT
+
#-AFF-<<-JJA<FFJJJJJJJJJAJJJJ<FJFJFJFJJ-F7JJAAJ--FJFA-<-FFJJJ-FJJFFJFJJJJ-<-<<--7-FJFJJFFFJJ--AAAF-7J7-F-7--F7FJFFFJ<A--A--F<-FAJ---7<7<FF)AJF<A--7<A)-
@ST-E00214:275:HWLHMCCXY:6:1101:12023:1608/1
NTTGAAATGTTTCAACTGGCGCTAAGTGATACGCAAGTAATTTAAGTTTCAAGCGATAACAACCAATCAGGTGGAGAAAATTGTTAGCGAATTCAAATATAAGATAGGCGGCAGAAAGCTTCTATCTGATGGAGTACTCAAATACTGTATA
+
#AAFFJJJJJJJJJJJJJJJJJJJJJFJJFJJJJJFJJJJJFJJJJJJJJJJJJJJJJJJJFFJJFJJJFJJFJ<JJJJAFJJJJJJJJJJJJJJAJJJFJJJJFJJJJJJJJJJA-FJFJJAJJJJJJ<JJFJA7FF7JJJJJJAFAJ7A

I will test it out using one P12002_114. If it works, i will redo all.

CONCLUSION and NEW APPROACH

Even with fixed seqids it did not work. We get the same results. Furthermore, Rachel means that the contamination should be 90% from previous studies. Hence we will use Kraken2 to determine which reads.