Open ndreey opened 1 year ago
#!/bin/bash
echo " _ _ _ _____ _ _"
echo "| |_| |___ ___| |_ ___| _ |___| |_|_|___ "
echo "| . | | .'|_ -| _|___| |___| . | | |"
echo "|___|_|__,|___|_| |__|__| |___|_|_|_|"
echo "$(date) ndreey"
echo ""
echo ""
# The different replicants
for repli in "goldst" "raw" "x1000c" "x700c" "maxbin"; do
for i in {1..11}
do
# Different host-contamination level
hc_level=("00" "01" "02" "03" "04" "05" "06" "07" "08" "090" "095")
# Define prefix based on array id
hc_prefix=${hc_level[$i-1]}
# Set the path to the bins
binpath=/home/andbo/concoct_arena/${repli}/${hc_prefix}_concoct/fasta_bins
if [ ! -d "${repli}" ]; then
mkdir "${repli}"
fi
touch ${repli}/X${hc_prefix}_blast.tsv
echo "######################################################"
echo "################# blast-a-bin ########################"
echo "------------------------------------------------------"
echo "$(date) Accessed: ${repli} samples"
echo "$(date) Sample: X${hc_prefix} ENGAGED"
echo "######################################################"
echo ""
for bin in $(ls ${binpath}); do
echo ""
echo "$(date) BLASTn of ${repli}/X${hc_prefix}/${bin}"
blastn -db database/mockdb \
-query ${binpath}/${bin} \
-outfmt "6 qseqid sseqid pident evalue bitscore" \
-out tmp/tmp.tsv \
-evalue 1e-100 \
-num_threads 4
# Adding the bin variable
awk -v bin=b${bin} 'BEGIN {OFS="\t"} {$6=bin; print}' tmp/tmp.tsv > tmp/tmp2.tsv
# Adding to the sample summary file
echo "$(date) Writing results to ${repli}/X${hc_prefix}_blast.tsv"
cat tmp/tmp2.tsv >> ${repli}/X${hc_prefix}_blast.tsv
rm tmp/*
echo "$(date) BLASTn of ${repli}/X${hc_prefix}/${bin} COMPLETE"
echo ""
done
echo "######################################################"
echo "################# blast-a-bin ########################"
echo "------------------------------------------------------"
echo "$(date) Sample: X${hc_prefix} COMPLETE"
echo "######################################################"
echo ""
done
echo "$(date) ${repli} samples COMPLETE"
echo ""
done
echo " _ _ _ _____ _ _"
echo "| |_| |___ ___| |_ ___| _ |___| |_|_|___ "
echo "| . | | .'|_ -| _|___| |___| . | | |"
echo "|___|_|__,|___|_| |__|__| |___|_|_|_|"
echo "$(date) ndreey"
echo " DONE"
Each reference genome has a unique >sequence_header
. I can therefore acquire the most abundant genome for each bin using grep
on the header, determine which header was most abundant then map that to the genome. This will be done in R
.
In the end, I want to measure
raw vs maxbin
Each bin got assigned a group based on how many times it aligned against a reference genome and the score.
Proportion of bins that belong to group (comboDB)
Proportion of bins that belong to group (endoDB)
Number of endophytic bins for each sample
Because there was nothing to filter in X00 samples, x1000c, x700c, and maxbin use the X00 bins from raw (x1000c uses X00 and X01). However, for some reason, I had used the raw 1000c bins instead of raw 500c. Resulting in raw X00 differing from the others. This is now corrected.
Panic at the disco
The initial plan was to use AMBER to evaluate how well the post-sequencing host depletion was, but it failed as AMBER requires a gold standard binning mapping file. Therefore, I decided to put all my eggs in one basket and focus on the metaQUAST results on the assembly to see if the method had significance. However, I had missed that the
--min-contig
defaults to 500, meaning a lot of my data would be discarded (median = ~450). Especially when the median decreased significantly towards the higher HC levels. This failure resulted in an inefficient amount of data to determine anything. Furthermore, it dawned on me that the assembly metrics don't really tell if the method worked or not. The initial results really only confirmed how sequencing depth affects assemblies.Looks like binning is back on the menu boys!
As I really want to see if the method is valid or not, I dropped the assembly part and put all my remaining time into solving a way to evaluate the binning.
Programs
Running the bins through a classifier like Kraken2, Centrifuge, Kaiju or EukRep required large databases and time to get it set up on the HPC. Initial tries failed as I could not get the path to the already installed databases on the HPC to work. On my local PC, I tried with EukRep, but it was unsuccessful as the contigs were too short or did not cover enough markers because of the low sequencing depth. Therefore, classification was dropped.
The completeness of the bins can be determined through various programs. I gave BUSCO, CheckM2, and EukCC a try but received the same results. Not enough core genes or specific HMMs were not met. CheckM2 did, however managed to give data regarding completeness and contamination for each bin. I would need to verify the results, however, or read more about the algorithm as it is based on machine learning. This can still be of interest and seems to be the best solution using already-built programs.
blast-A-bin
A last-ditch effort was made by blasting random bins using the NCBI-BLASTn web tool. Some were too short, while others actually came back with results that seemed to be correct. One contig gave back a CDS for Tulasnella spp, and another gave for a Protobacteria... something. Manually pasting bins and evaluating which results fit best was not realistic. Therefore, I looked into running BLASTn locally with a custom database. Eureeka, it existed, the beautiful toolkit BLAST+.
Building a mock database
As all reads were made in silico using 100 specific genomes, i could create a database only comprising of these genomes to BLASTn against. I first concatenated all genomes into one fasta file called
reference.fa
. I then used the BLAST+ commandmakeblastdb -in reference.fa -dbtype nucl -parse_seqids -out mockdb
to generate a database.Script
I then created a bash script that generates a
.tsv
holding each replicant HC samples bin stats in long table format. The headings are:qseqid sseqid pident evalue bitscore bin
Example from
X00_blast.tsv
It seems that BLASTn can align the same bin contig against the same reference sequence but with a lesser score. My guess is that it frameshifts the sequence. This causes the
.tsv
to be bloated. Ideally, I would want it to take first, but I could not figure out how, so this will be dealt in post-processing downstream. Through some subsampling i decided to set theevalue
threshold to1e-100
to mitigate the bloating but still get solid results.