ndreey commented 1 year ago

Binning with CONCOCT

What makes CONCOCT an unsupervised binner is that it bins based on the characteristics of the sequences (sequence composition) and coverage information. Tetranucleotide frequencies (TNFs) are how CONCOCT measures the sequence composition, while coverage of a contig refers to the number of reads that map to the contig.

CONCOCT requires two input files, the assembly.fa and coverage table.

Assembly: Fasta file containing all the contigs from assembling the PE reads.
Coverage table: A tab separated file that shows the coverage of each contig by counting how many reads that mapped to each contig.

The pipeline

Map reads to the assembly using bowtie2 to generate a mapping.bam file.
Use samtools to sort and index the bam file to generate sorted_map.bam.bai
Run the cut_up_fasta.py module from concoct using the assembly.fa to generate sub_contigs.fa and sub_contigs.bed
Run concoct_coverage_table.py using the sub_contigs.bed and sorted_map.bam.bai files to generate coverage_table.tsv
Bin the contigs using concoct to generate a tsv file with these headings
- @@SEQUENCEID BINID TAXID _LENGTH

Bowtie2

Bowtie2 uses a variant of the Burrows-Wheller Aligner to generate a fast and memory-efficient way to align reads to references. It does this using a variant of the Burrows-Wheelers Transform (BWT) to generate an index. In simple terms, the index is a data structure that allows for faster retrieval of data by creating a reference to the location of data in a larger data set. Similar to the index of a course book that holds each keyword in alphabetic order and the page number where they appear. So instead of going through the whole book to find info on topoisomerase, one can go through the index to find exactly where the word is mentioned.

With bowtie2-build, an index of the reference is created that is then used with bowtie2 to generate Sequence Alignment/Map (SAM) files so that we can get an idea of the coverage for each contig. To save space and time, we will convert the SAM to BAM, which involves compressing the data and generating an index file for faster access using SAMtools.

SAMtools

One of the most referenced bioinformatics tools. SAMtools is a software that is built for analyzing and manipulating SAM and BAM files.

Some of the most commonly used SAMtools commands include:

samtools view: Converts SAM/BAM files to other formats or filters the data based on various criteria. samtools sort: Sorts BAM files by reference position. samtools index: Creates an index file for a sorted BAM file, which allows for faster random access to specific genomic regions. samtools flagstat: Generates summary statistics about the alignment quality of a BAM file, such as the number of aligned and unaligned reads, and the percentage of properly paired reads. samtools mpileup: Generates a summary of the read coverage at each genomic position in a BAM file.

We will use SAMtools to sort and index the BAM file generated by bowtie2 to further increase the efficiency of the index.

CONCOCT

As mentioned beforehand, CONCOCT is an unsupervised inner based on composition and coverage. The coverage_table.tsv will be used to cluster the contigs based on coverage. Besides repetitive regions/ambiguous alignment, increasing the coverage of the contigs. Species abundance correlates to the abundance of each contig, therefore, clusters based on coverage depth can be acquired.

The sub_contigs.fa will be used to refine the clustering using the TNFs of the assembly. What has been found is that species have a codon bias, which is an evolutionary artifact as well as a way to control gene expression. This means that genomes have distinct TNFs, where greater phylogenic distances correlate with greater differences in TNFs. Thus, CONCOCT groups the contigs that have similar TNF profiles using a Gaussian Mixture Model (GMM) to model the distribution of TNFs across contigs.

The combination of these factors will result in different bins representing different genomes (hopefully).

concoct --composition_file sub_contigs.fa --coverage_file coverage_table.tsv --basename concoct_bin/

ndreey commented 1 year ago

Because of the low simulated sequencing depth (1Gb) we are not achieving high alignment rates. Even the GSA is only reaching 16.29%...

GSA assembly: 16.29% Alignment Meta-sens assembly: 3.29%

part of the problem but i ran CONCOCT incorrect, i am getting initial results of 57% alignment rate now

ndreey commented 1 year ago

Getting this error when trying to run everything at once on Mjolnir

[bam_sort_core] merging from 0 files and 4 in-memory blocks...
samtools: /maps/projects/mjolnir1/apps/conda/concoct-1.1.0/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /maps/projects/mjolnir1/apps/conda/concoct-1.1.0/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /maps/projects/mjolnir1/apps/conda/concoct-1.1.0/bin/../lib/libncursesw.so.6: no version information available (required by samtools)

ndreey commented 1 year ago

Runtime info for 02 sample

Run	CPU	GB	TIME
bowtie2	4	6	00:09:19
samtools	4	6	00:00:51
concoct	4	6	00:00:51

ndreey commented 1 year ago

Because of the issue with the host genome FASTA structure #65 i have re-run CONCOCT, the overall alignment rate is now 95-99%. Overall, it took 8hrs and sample 06, 07 and 090 failed because of "duplicates in header of SAM file"

ndreey / ghost-magnet

Binning: CONCOCT #62

Binning with CONCOCT

The pipeline

Bowtie2

SAMtools

CONCOCT

part of the problem but i ran CONCOCT incorrect, i am getting initial results of 57% alignment rate now