Open ndreey opened 1 year ago
Because of the low simulated sequencing depth (1Gb) we are not achieving high alignment rates. Even the GSA is only reaching 16.29%...
GSA assembly: 16.29% Alignment Meta-sens assembly: 3.29%
Getting this error when trying to run everything at once on Mjolnir
[bam_sort_core] merging from 0 files and 4 in-memory blocks...
samtools: /maps/projects/mjolnir1/apps/conda/concoct-1.1.0/bin/../lib/libtinfow.so.6: no version information available (required by samtools)
samtools: /maps/projects/mjolnir1/apps/conda/concoct-1.1.0/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
samtools: /maps/projects/mjolnir1/apps/conda/concoct-1.1.0/bin/../lib/libncursesw.so.6: no version information available (required by samtools)
Runtime info for 02 sample
Run | CPU | GB | TIME |
---|---|---|---|
bowtie2 | 4 | 6 | 00:09:19 |
samtools | 4 | 6 | 00:00:51 |
concoct | 4 | 6 | 00:00:51 |
Because of the issue with the host genome FASTA structure #65 i have re-run CONCOCT, the overall alignment rate is now 95-99%. Overall, it took 8hrs and sample 06, 07 and 090 failed because of "duplicates in header of SAM file"
Binning with CONCOCT
What makes CONCOCT an unsupervised binner is that it bins based on the characteristics of the sequences (sequence composition) and coverage information. Tetranucleotide frequencies (TNFs) are how CONCOCT measures the sequence composition, while coverage of a contig refers to the number of reads that map to the contig.
CONCOCT requires two input files, the assembly.fa and coverage table.
The pipeline
bowtie2
to generate amapping.bam
file.samtools
to sort and index the bam file to generatesorted_map.bam.bai
cut_up_fasta.py
module fromconcoct
using the assembly.fa to generatesub_contigs.fa
andsub_contigs.bed
concoct_coverage_table.py
using thesub_contigs.bed
andsorted_map.bam.bai
files to generatecoverage_table.tsv
concoct
to generate a tsv file with these headings@@SEQUENCEID BINID TAXID _LENGTH
Bowtie2
Bowtie2 uses a variant of the Burrows-Wheller Aligner to generate a fast and memory-efficient way to align reads to references. It does this using a variant of the Burrows-Wheelers Transform (BWT) to generate an index. In simple terms, the index is a data structure that allows for faster retrieval of data by creating a reference to the location of data in a larger data set. Similar to the index of a course book that holds each keyword in alphabetic order and the page number where they appear. So instead of going through the whole book to find info on topoisomerase, one can go through the index to find exactly where the word is mentioned.
With
bowtie2-build
, an index of the reference is created that is then used withbowtie2
to generate Sequence Alignment/Map (SAM) files so that we can get an idea of the coverage for each contig. To save space and time, we will convert the SAM to BAM, which involves compressing the data and generating an index file for faster access usingSAMtools
.SAMtools
One of the most referenced bioinformatics tools. SAMtools is a software that is built for analyzing and manipulating SAM and BAM files.
Some of the most commonly used SAMtools commands include:
samtools view
: Converts SAM/BAM files to other formats or filters the data based on various criteria.samtools sort
: Sorts BAM files by reference position.samtools index
: Creates an index file for a sorted BAM file, which allows for faster random access to specific genomic regions.samtools flagstat
: Generates summary statistics about the alignment quality of a BAM file, such as the number of aligned and unaligned reads, and the percentage of properly paired reads.samtools mpileup
: Generates a summary of the read coverage at each genomic position in a BAM file.We will use
SAMtools
to sort and index the BAM file generated bybowtie2
to further increase the efficiency of the index.CONCOCT
As mentioned beforehand, CONCOCT is an unsupervised inner based on composition and coverage. The
coverage_table.tsv
will be used to cluster the contigs based on coverage. Besides repetitive regions/ambiguous alignment, increasing the coverage of the contigs. Species abundance correlates to the abundance of each contig, therefore, clusters based on coverage depth can be acquired.The
sub_contigs.fa
will be used to refine the clustering using the TNFs of the assembly. What has been found is that species have a codon bias, which is an evolutionary artifact as well as a way to control gene expression. This means that genomes have distinct TNFs, where greater phylogenic distances correlate with greater differences in TNFs. Thus, CONCOCT groups the contigs that have similar TNF profiles using a Gaussian Mixture Model (GMM) to model the distribution of TNFs across contigs.The combination of these factors will result in different bins representing different genomes (hopefully).