Using 03_removedandThenMaskedMicrobialContamination_reference_genome.fasta as reference for Tconura which was found at: ../../results/00_refGenome_pt_042/refGenome_microbFiltering/
A C G T N IUPAC Other GC GC_stdev
0.3198 0.1802 0.1803 0.3196 0.0134 0.0000 0.0000 0.3606 0.0295
Main genome scaffold total: 2563
Main genome contig total: 640009
Main genome scaffold sequence total: 1946.195 MB
Main genome contig sequence total: 1920.204 MB 1.335% gap
Main genome scaffold N/L50: 319/1.742 MB
Main genome contig N/L50: 89532/6.455 KB
Main genome scaffold N/L90: 1189/398.914 KB
Main genome contig N/L90: 302349/1.761 KB
Max scaffold length: 8.968 MB
Max contig length: 221.479 KB
Number of scaffolds > 50 KB: 2231
% main genome in scaffolds > 50 KB: 99.41%
Minimum Number Number Total Total Scaffold
Scaffold of of Scaffold Contig Contig
Length Scaffolds Contigs Length Length Coverage
-------- -------------- -------------- -------------- -------------- --------
All 2,563 640,009 1,946,194,716 1,920,204,323 98.66%
5 KB 2,563 640,009 1,946,194,716 1,920,204,323 98.66%
10 KB 2,562 640,006 1,946,187,472 1,920,197,421 98.66%
25 KB 2,515 639,525 1,945,239,894 1,919,272,300 98.67%
50 KB 2,231 628,381 1,934,714,323 1,909,285,595 98.69%
100 KB 1,920 612,641 1,912,895,667 1,888,172,579 98.71%
250 KB 1,452 572,562 1,835,768,641 1,812,795,694 98.75%
500 KB 1,051 514,160 1,690,623,403 1,670,101,324 98.79%
1 MB 624 411,623 1,378,771,933 1,362,408,122 98.81%
2.5 MB 179 200,851 683,668,414 675,737,218 98.84%
5 MB 31 57,855 193,639,010 191,346,369 98.82%
The reference was copied and renamed to data/Tconura_reference_genome/Tconura_ref.fasta in the working directory of CONURA_WGS
The goal is to decontaminate the reads that map to the reference genome using KneadData v0.12.0.
Indexing the reference
KneadData utilizes different programs for the different steps. Here, we will skip the trimming part of KneadData and only use their existing scripts to sort the alignments. First step is to build a index for the reference. This was done in an interactive session.
# Move to reference genome directory
cd data/Tconura_reference_genome/
# Index
bowtie2-build --threads 12 Tconura_ref.fasta Tconura_ref
# Move back to working directory for the analysis
cd ../..
Constructing new metadata file.
Originally we had all.pops.metadata.tsv and over time we have acquired allScots.Tcon.txt and samples_to_use.txt.
By using the sample_id's from samples_to_use.txt i map and extend the values from the other two files using R. Note, i removed the tabs from the .tsv file and only used the sample id's from samples_to_use.txt. Furthermore, i extended allScots.Tcon.txt with column values matching all.pops.metadata.tsv manually. In the end, i could just provide this metadata file. Meaning, i can probably edit the merging and trimming scripts using this file as the "base".
Either way, we now have one file holding info of all the samples that are not hybrids so we can easily group samples based on hostplant or population.
library(tidyverse)
# Load in data
trimmed <- read.table("trimmed_reads_meta.txt", header=TRUE, sep=",")
scot <- read.table("allScots.Tcon.txt", header=TRUE, sep="\t")
all <- read.table("all.pops.metadata.csv", header=TRUE, sep=",")
# Combine scots and all
df_combined <- bind_rows(scot, all)
# Map and join by sample_id.
df_trim <- trimmed %>%
inner_join(df_combined, by="sample_id")
# Write to csv file
write.csv(df_trim, "metadata-no-hybrids.csv", row.names=FALSE, quote=FALSE)
KneadData will use the reference to align the trimmed reads and determine which are "host contaminants", producing what they call clean reads. Overall, the program takes ca 3 hours per read. I ran it as an array.
#!/bin/bash
#SBATCH --job-name KneadData
#SBATCH --array=1-9%3
#SBATCH -A naiss2023-22-412
#SBATCH -p core -n 16
#SBATCH -t 03:35:00
#SBATCH --output=SLURM-%j-kneaddata.out
#SBATCH --error=SLURM-%j-kneaddata.err
#SBATCH --mail-user=andbou95@gmail.com
#SBATCH --mail-type=ALL
# Start time and date
echo "$(date) [Start]"
# Load in modules
module load bioinfo-tools
module load KneadData/0.12.0
# Create directory for decontaminated reads
outdir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/03-DECON"
if [ ! -d "$outdir" ]; then
mkdir -p "$outdir"
fi
# Path to trimmed reads and reference database of Tconura.
trim_dir="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/02-TRIM"
DB_REF="/crex/proj/snic2020-6-222/Projects/Tconura/working/Andre/CONURA_WGS/data/Tconura_reference_genome/Tconura_ref"
# SLURM array jobid
jobid=${SLURM_ARRAY_TASK_ID}
# Read the line corresponding to the SLURM array task ID from CHST_samples.txt
IFS="," read -r sample pop hp geo range < <(sed -n "${jobid}p" doc/CHST_samples.txt)
# R1 and R2 input
R1="${trim_dir}/${sample}_R1-trim.fastq.gz"
R2="${trim_dir}/${sample}_R2-trim.fastq.gz"
# Create specific folder for sample.
out="$outdir/${pop}_${sample}"
mkdir -p $out
# Host decontamination without trimming or removal of tandem repeats.
kneaddata \
--verbose \
--bypass-trim \
--bypass-trf \
--threads 16 \
--input1 $R1 \
--input2 $R2 \
--output $out \
--reference-db $DB_REF \
--output-prefix "${pop}_${sample}"
# End time and date
echo "$(date) [End]"
There seems to be an issue though, KneadData cannot handle spaces in the sequencer id's. This is something Illumina apparently implemented not too long ago. Which leads KneadData to not utilize their paired algorithms to determine contamination.
Trimmed read which has space between ..1608 and 1:N:0...:
@ST-E00214:275:HWLHMCCXY:6:1101:6502:1608 1:N:0:NCCGGAGA+NAAGATTA
Nevertheless, we can get an idea of the host contamination.
P12002_114:
Trimmed has 30169363 reads, 29905591 aligned to the genome. Meaning 0.99% is host contamination. Leaving us to 263772 reads to work with from read 1. The last two calls shows that KneadData clearly cant handle the spaces as the pair clearly exists. The end #0/1 is from KneadData.
# Number of trimmed reads in R1
andbou@r483: (assembly) CONURA_WGS: zcat 02-TRIM/P12002_114_R1-trim.fastq.gz | grep "^@" | wc -l
30169363
# Number of reads aligning to genome
andbou@r483: (assembly) CONURA_WGS: cat 03-DECON/CHST_P12002_114/CHST_P12002_114_Tconura_ref_bowtie2_unmatched_1_contam.fastq | grep "^@" | wc -l
29905591
# Number of reads not aligning to genome
andbou@r483: (assembly) CONURA_WGS: cat 03-DECON/CHST_P12002_114/CHST_P12002_114_unmatched_1.fastq | grep "^@" | wc -l
263772
# Unmatched read from input 1
andbou@rackham3: (assembly) CONURA_WGS: cat 03-DECON/CHST_P12002_114/CHST_P12002_114_unmatched_1.fastq | grep "^@" | head -n 1
@ST-E00214:275:HWLHMCCXY:6:1101:5842:1801.1:N:0:NCCGGAGA+NAAGATTA#0/1
# Unmatched read from input 2.
andbou@rackham3: (assembly) CONURA_WGS: cat 03-DECON/CHST_P12002_114/CHST_P12002_114_unmatched_2.fastq | grep "6:1101:5842:1801"
@ST-E00214:275:HWLHMCCXY:6:1101:5842:1801.2:N:0:NCCGGAGA+NAAGATTA#0/2
SeqID fix
As described by the user, he used this sed command to fix this issue.
# Test file
zcat 02-TRIM/P12002_114_R1-trim.fastq.gz | head -n 8 > testR1.fastq
# Replace the space so it just ends with /1.
sed 's/ 1.*/\/1/g' < testR1.fastq > new.textR1.fastq
I will test it out using one P12002_114. If it works, i will redo all.
CONCLUSION and NEW APPROACH
Even with fixed seqids it did not work. We get the same results. Furthermore, Rachel means that the contamination should be 90% from previous studies. Hence we will use Kraken2 to determine which reads.
Using
03_removedandThenMaskedMicrobialContamination_reference_genome.fasta
as reference for Tconura which was found at:../../results/00_refGenome_pt_042/refGenome_microbFiltering/
The reference was copied and renamed to
data/Tconura_reference_genome/Tconura_ref.fasta
in the working directory ofCONURA_WGS
The goal is to decontaminate the reads that map to the reference genome using KneadData v0.12.0.
Indexing the reference
KneadData utilizes different programs for the different steps. Here, we will skip the trimming part of KneadData and only use their existing scripts to sort the alignments. First step is to build a index for the reference. This was done in an interactive session.
Constructing new metadata file.
Originally we had
all.pops.metadata.tsv
and over time we have acquiredallScots.Tcon.txt
andsamples_to_use.txt
. By using the sample_id's fromsamples_to_use.txt
i map and extend the values from the other two files using R. Note, i removed the tabs from the .tsv file and only used the sample id's from samples_to_use.txt. Furthermore, i extendedallScots.Tcon.txt
with column values matchingall.pops.metadata.tsv
manually. In the end, i could just provide this metadata file. Meaning, i can probably edit the merging and trimming scripts using this file as the "base".Either way, we now have one file holding info of all the samples that are not hybrids so we can easily group samples based on hostplant or population.
metadata-no-hybrids.csv
KneadData
KneadData will use the reference to align the trimmed reads and determine which are "host contaminants", producing what they call clean reads. Overall, the program takes ca 3 hours per read. I ran it as an array.
There seems to be an issue though, KneadData cannot handle spaces in the sequencer id's. This is something Illumina apparently implemented not too long ago. Which leads KneadData to not utilize their paired algorithms to determine contamination.
Trimmed read which has space between ..1608 and 1:N:0...:
@ST-E00214:275:HWLHMCCXY:6:1101:6502:1608 1:N:0:NCCGGAGA+NAAGATTA
I will have to edit each header and add a "/" or some other special character. These will be my guides: https://forum.biobakery.org/t/all-paired-end-read-unmatched/2895/8 https://forum.biobakery.org/t/paired-end-data-results-in-unpaired-output/928
Nevertheless, we can get an idea of the host contamination.
P12002_114: Trimmed has 30169363 reads, 29905591 aligned to the genome. Meaning 0.99% is host contamination. Leaving us to 263772 reads to work with from read 1. The last two calls shows that KneadData clearly cant handle the spaces as the pair clearly exists. The end
#0/1
is from KneadData.SeqID fix
As described by the user, he used this sed command to fix this issue.
I will test it out using one P12002_114. If it works, i will redo all.
CONCLUSION and NEW APPROACH
Even with fixed seqids it did not work. We get the same results. Furthermore, Rachel means that the contamination should be 90% from previous studies. Hence we will use Kraken2 to determine which reads.