ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

blast-a-bin #71

Open ndreey opened 1 year ago

ndreey commented 1 year ago

Panic at the disco

The initial plan was to use AMBER to evaluate how well the post-sequencing host depletion was, but it failed as AMBER requires a gold standard binning mapping file. Therefore, I decided to put all my eggs in one basket and focus on the metaQUAST results on the assembly to see if the method had significance. However, I had missed that the --min-contig defaults to 500, meaning a lot of my data would be discarded (median = ~450). Especially when the median decreased significantly towards the higher HC levels. This failure resulted in an inefficient amount of data to determine anything. Furthermore, it dawned on me that the assembly metrics don't really tell if the method worked or not. The initial results really only confirmed how sequencing depth affects assemblies.

Looks like binning is back on the menu boys!

As I really want to see if the method is valid or not, I dropped the assembly part and put all my remaining time into solving a way to evaluate the binning.

Programs

Running the bins through a classifier like Kraken2, Centrifuge, Kaiju or EukRep required large databases and time to get it set up on the HPC. Initial tries failed as I could not get the path to the already installed databases on the HPC to work. On my local PC, I tried with EukRep, but it was unsuccessful as the contigs were too short or did not cover enough markers because of the low sequencing depth. Therefore, classification was dropped.

The completeness of the bins can be determined through various programs. I gave BUSCO, CheckM2, and EukCC a try but received the same results. Not enough core genes or specific HMMs were not met. CheckM2 did, however managed to give data regarding completeness and contamination for each bin. I would need to verify the results, however, or read more about the algorithm as it is based on machine learning. This can still be of interest and seems to be the best solution using already-built programs.

blast-A-bin

A last-ditch effort was made by blasting random bins using the NCBI-BLASTn web tool. Some were too short, while others actually came back with results that seemed to be correct. One contig gave back a CDS for Tulasnella spp, and another gave for a Protobacteria... something. Manually pasting bins and evaluating which results fit best was not realistic. Therefore, I looked into running BLASTn locally with a custom database. Eureeka, it existed, the beautiful toolkit BLAST+.

Building a mock database

As all reads were made in silico using 100 specific genomes, i could create a database only comprising of these genomes to BLASTn against. I first concatenated all genomes into one fasta file called reference.fa. I then used the BLAST+ command makeblastdb -in reference.fa -dbtype nucl -parse_seqids -out mockdb to generate a database.

Script

I then created a bash script that generates a .tsv holding each replicant HC samples bin stats in long table format. The headings are: qseqid sseqid pident evalue bitscore bin

Example from X00_blast.tsv

tig00000014_14_from_2639580_to_2640373_total_794    tig00000014_14  100.000 0.0 1467    b114.fa
tig00000014_14_from_3037225_to_3037967_total_743    tig00000014_14  100.000 0.0 1373    b114.fa
tig00000017_13_from_1432296_to_1433056_total_761    tig00000017_13  100.000 0.0 1406    b114.fa
tig00000017_13_from_1567194_to_1567925_total_732    tig00000017_13  100.000 0.0 1352    b114.fa
tig00000017_13_from_2062545_to_2063289_total_745    tig00000017_13  100.000 0.0 1376    b114.fa
tig00000017_13_from_2062545_to_2063289_total_745    tig00000159_9   85.503  0.0 774 b114.fa
tig00000017_13_from_2062545_to_2063289_total_745    tig00000017_3   79.922  9.47e-151   540 b114.fa
tig00000017_13_from_2062545_to_2063289_total_745    tig00019122 79.400  4.44e-144   518 b114.fa
tig00000017_13_from_2062545_to_2063289_total_745    tig00000004_11  78.285  1.64e-123   449 b114.fa

It seems that BLASTn can align the same bin contig against the same reference sequence but with a lesser score. My guess is that it frameshifts the sequence. This causes the .tsv to be bloated. Ideally, I would want it to take first, but I could not figure out how, so this will be dealt in post-processing downstream. Through some subsampling i decided to set the evalue threshold to 1e-100 to mitigate the bloating but still get solid results.

~  home / blast_arena ~ bash blast-a-bin.sh
 _   _         _       _____     _   _
| |_| |___ ___| |_ ___|  _  |___| |_|_|___
| . | | .'|_ -|  _|___|     |___| . | |   |
|___|_|__,|___|_|     |__|__|   |___|_|_|_|
Sun May 14 11:31:37 CEST 2023      ndreey

######################################################
################# blast-a-bin ########################
------------------------------------------------------
Sun May 14 11:31:37 CEST 2023    Accessed: goldst samples
Sun May 14 11:31:37 CEST 2023    Sample: X00 ENGAGED
######################################################

Sun May 14 11:31:37 CEST 2023        BLASTn of goldst/X00/0.fa
Sun May 14 11:31:37 CEST 2023        Writing results to goldst/X00_blast.tsv
Sun May 14 11:31:37 CEST 2023        BLASTn of goldst/X00/0.fa COMPLETE

Sun May 14 11:31:37 CEST 2023        BLASTn of goldst/X00/1.fa
Sun May 14 11:31:38 CEST 2023        Writing results to goldst/X00_blast.tsv
Sun May 14 11:31:38 CEST 2023        BLASTn of goldst/X00/1.fa COMPLETE
ndreey commented 1 year ago

blast-A-bin script

#!/bin/bash

echo " _   _         _       _____     _   _"
echo "| |_| |___ ___| |_ ___|  _  |___| |_|_|___ "
echo "| . | | .'|_ -|  _|___|     |___| . | |   |"
echo "|___|_|__,|___|_|     |__|__|   |___|_|_|_|"
echo "$(date)      ndreey"
echo ""
echo ""

# The different replicants
for repli in "goldst" "raw" "x1000c" "x700c" "maxbin"; do

    for i in {1..11}
    do

        # Different host-contamination level
        hc_level=("00" "01" "02" "03" "04" "05" "06" "07" "08" "090" "095")

        # Define prefix based on array id
        hc_prefix=${hc_level[$i-1]}

        # Set the path to the bins
        binpath=/home/andbo/concoct_arena/${repli}/${hc_prefix}_concoct/fasta_bins

        if [ ! -d "${repli}" ]; then
            mkdir "${repli}"
        fi

        touch ${repli}/X${hc_prefix}_blast.tsv
        echo "######################################################"
        echo "################# blast-a-bin ########################"
        echo "------------------------------------------------------"
        echo "$(date)    Accessed: ${repli} samples"
        echo "$(date)    Sample: X${hc_prefix} ENGAGED"
        echo "######################################################"
        echo ""

        for bin in $(ls ${binpath}); do
            echo ""
            echo "$(date)        BLASTn of ${repli}/X${hc_prefix}/${bin}"

            blastn -db database/mockdb \
                -query ${binpath}/${bin} \
                -outfmt "6 qseqid sseqid pident evalue bitscore" \
                -out tmp/tmp.tsv \
                -evalue 1e-100 \
                -num_threads 4

            # Adding the bin variable
            awk -v bin=b${bin} 'BEGIN {OFS="\t"} {$6=bin; print}' tmp/tmp.tsv > tmp/tmp2.tsv

            # Adding to the sample summary file
            echo "$(date)        Writing results to ${repli}/X${hc_prefix}_blast.tsv"        
            cat tmp/tmp2.tsv >> ${repli}/X${hc_prefix}_blast.tsv
            rm tmp/*

            echo "$(date)        BLASTn of ${repli}/X${hc_prefix}/${bin} COMPLETE"
            echo ""               
        done
        echo "######################################################"
        echo "################# blast-a-bin ########################"
        echo "------------------------------------------------------"
        echo "$(date)    Sample: X${hc_prefix} COMPLETE"
        echo "######################################################"
        echo ""
    done
        echo "$(date)    ${repli} samples COMPLETE"
        echo ""
done

echo " _   _         _       _____     _   _"
echo "| |_| |___ ___| |_ ___|  _  |___| |_|_|___ "
echo "| . | | .'|_ -|  _|___|     |___| . | |   |"
echo "|___|_|__,|___|_|     |__|__|   |___|_|_|_|"
echo "$(date)                   ndreey"
echo "                DONE"
ndreey commented 1 year ago

Downstream Analysis

Each reference genome has a unique >sequence_header. I can therefore acquire the most abundant genome for each bin using grep on the header, determine which header was most abundant then map that to the genome. This will be done in R.

In the end, I want to measure

image

ndreey commented 1 year ago

INITIAL late night RESULTS

Each bin got assigned a group based on how many times it aligned against a reference genome and the score.

Proportion of bins that belong to group (comboDB) proportion_combodb

Proportion of bins that belong to group (endoDB) proportion_nohostdb

Number of endophytic bins for each sample counts_line_comboDB

ndreey commented 1 year ago

Correcting X00

Because there was nothing to filter in X00 samples, x1000c, x700c, and maxbin use the X00 bins from raw (x1000c uses X00 and X01). However, for some reason, I had used the raw 1000c bins instead of raw 500c. Resulting in raw X00 differing from the others. This is now corrected.

image

image

image