Open ndreey opened 1 year ago
Let's take a deeper look into CONCOCT.
optional arguments:
--coverage_file COVERAGE_FILE
specify the coverage file, containing a table where
each row correspond to a contig, and each column
correspond to a sample. The values are the average
coverage for this contig in that sample. All values
are separated with tabs.
--composition_file COMPOSITION_FILE
specify the composition file, containing sequences in
fasta format. It is named the composition file since
it is used to calculate the kmer composition (the
genomic signature) of each contig.
-l LENGTH_THRESHOLD, --length_threshold LENGTH_THRESHOLD
specify the sequence length threshold, contigs shorter
than this value will not be included. Defaults to
1000.
-r READ_LENGTH, --read_length READ_LENGTH
specify read length for coverage, default 100
-s SEED, --seed SEED Specify an integer to use as seed for clustering. 0
gives a random seed, 1 is the default seed and any
other positive integer can be used. Other values give
ArgumentTypeError.
The *_coverage_table.tsv
is generated by the concoct_coverage_table.py,
where it takes a BED file holding info about the contigs and how the reads are mapped to assembly in the form of a BAM file. Thus, the coverage information to be clustered is determined before the concoct run.
# Builds index of assembly
bowtie2-build --threads 6 ${assembly} ${hc_prefix}_index
# Mapping reads
bowtie2 -p 6 -q \
-x ${hc_prefix}_index \
-1 output_R1.fastq\
-2 output_R2.fastq \
-S ${hc_prefix}_map.sam
echo "$(date) Bowtie2 Disengaged!"
echo "$(date) SAMtools Engaged!"
samtools view -@ 6 -b -S ${hc_prefix}_map.sam > ${hc_prefix}_map.bam
echo "$(date) SAMtools view complete"
samtools sort -@ 6 ${hc_prefix}_map.bam -o ${hc_prefix}_map_sorted.bam
Hence, I should be able to use the filtered GSA to align the filtered reads in creating the BAM file.
The filtered GSA should be used as we now need to cluster based on sequence composition.
As mentioned, setting it to 500 resulted in 390 bins with a mean purity of 99.8%, all belonging to the host and showing that I should have set the benchmark to 500 to create a larger bin_ref.fa
.
After the trim, R1 and R2 have a mean length of 90 and 68 bp, respectively. If I understand it correctly, the -r
option is a way to normalize the data. From the paper:
To make the over and under estimation more equal, ill set this to the mean of the means, 79.
I will set the seed to 13371337.
Determine if there exists a correlation with HC and metric.
Final approach before data analysis
Objective
The main objective of this study is to demonstrate a correlation between high levels of host contamination (HC) and poor assembly and binning performance. Additionally, the study aims to determine if host bins can serve as a reference genome to reduce host contamination and improve assembly and binning performance. The proposed method involves assembling and binning the host-contaminated samples, concatenating the resulting host bins to create a reference genome, filtering out reads that map to this reference genome, reassembling the remaining reads, and then binning again. The goal is to evaluate this approach's validity to improve assembly and binning performance.
Where I will evaluate these metrics:
Assembly
Binning
Current stand
After determining, installing, and troubleshooting the different programs, there is now a working workflow to achieve the data (however, not automatic).
Assembly
Each sample has been benchmarked using the gold standard assembly (GSA) with metaQUAST. Here, the sequencing depth for the sample and the proportion of microbes have been held constant. The only thing differing is the proportion of HC, which thus minimizes the sequencing depth for each individual endophyte. I believe enough data has been accumulated to draw statistically significant conclusions if a correlation exists.
Binning
A benchmark for each HC level has been set using the trimmed reads, GSA, and gold standard bin map (GSB). However, the benchmark was set using contigs >= 1000 bp. Now in hindsight, I wonder if I should change this.
The number of contigs that have the length >= x can be found in the table and Figure 1 shows the distribution.
Figure 1![image](https://user-images.githubusercontent.com/72344166/236179239-c59e0bcc-acbe-44f4-b234-147b208850b1.png)
The proportion of contigs that have a length equal to or below 500 bp is 92.6%. When I binned using the default of 1000 bp, I discarded more than 99.8% of the data. This illuminates the limitation of the 2Gb sequence depth and raises the question if I should use a lower contig length threshold.
Evaluating bin removal
Using CONCOCT with the default threshold (-l 1000), resulted in 69 bins where P. zijinensis was the most abundant genome. Every bin had a purity of 1.00 except for one that had 0.961. These bins were concatenated into one fasta file,
bin_ref.fa
.Assembly
It would be incorrect to remove the contigs of
bin_ref.fa
from the gsa as it would not tell me how the assembly performed. Instead, i assembled raw 095 reads (095_raw) and filtered 095 reads (095_filt) using MEGAHIT. Utilizing bowtie2, SAMtools, and BEDtools i could remove the reads aligning with thebin_ref.fa
. With the GSA, 095_raw, and 095_filt, a true comparison can be made. Therefore, perhaps this should be done for samples 06 to 095 (81-98% HC) to see if a trend in performance boost exists.Binning
As AMBER needs a GSB I cannot bin the 095_filt assembly as the contig's names do not correspond with the GSB. Therefore, a Python script was used to filter away all the contigs belonging to
bin_ref.fa
from the GSA. Now, the binned contigs are removed, and I can bin against it using the 095_filt reads... *or?Binning
It was here I noticed that I could not use CONCOCT with the same settings as before. Because I had removed the +1000bp contigs, CONCOCT could not determine the coverage. I, therefore, set
--length_threshold 500
. This resulted in a CONCOCT run that took 22h and 32min on my 8 core 16GB RAM laptop, which determined 390 new bins (all belonging to P. zijinensis) with a purity ranging from 0.83 to 1 (mean 0.998).As mentioned, this persuades me to re-evaluate my binning evaluation methodology.