Open ndreey opened 1 year ago
The number of base pairs (bp) of the most abundant genome (g) in bin (b) is determined as True Positives (TP), while the number bp belonging to other genomes is determined as False Positives (FP). False Negatives (FN) are measured by subtracting TP from the total length of g.
To determine if the bin is chimeric, one can calculate the purity of the bin by taking (TP) / (TP + FP). If there are any FP, the purity will be less than 1.
To determine how complete the metagenome (bin b) is, one calculates the completeness by taking the (TP) / (TP + FN). This will give a value of how much b covers the reference genome.
To determine the performance of the clustering, the Rand Index can be calculated, where 1 indicates 100% accuracy. The Rand index measures how well the clustering is compared to the gold standard. In our case, it is how many base pairs of the same genome that were binned together in the same b that are determined (TP) and TN are base pairs belonging to different genomes that were placed in separate bins. Meaning TN is the number of bp that CONCOCT correctly identified as not belonging to the same genome and placed in separate bins. The Rand Index is thus calculated by taking the (TP + TN) / (tot bp). Because the Rand Index can be greater than 0 by chance, ARI is adjusted Rand Index, which corrects for this random chance.
How much of the genomic data was binned.
CONCOCT failed in binning 06, 07, 090. I think i will hold off on troubleshooting these (ERROR: something duplicate in alignment bowtie2). This was solved #67.
Evaluating the performance of removing bins on the 08 and 095 samples should be a sufficient start. how i will evaluate if removal of bins is valid method to increase assembly and binning quality
As we are looking to see if host bin removal is a valid method, the focus will be on the high HC% samples (06-095).
SampleID | HC% | host_bins |
---|---|---|
06 | 81.8 | 48 |
07 | 87.8 | 58 |
08 | 92.3 | 58 |
090 | 96.4 | 56 |
095 | 98.3 | 70 |
Because the sequencing depth is quite small, Completeness will not be a relevant metric to choose bins from. These three are relevant, however, coverage is not measured yet.
Therefore, I believe I will start with the highest HC%, 095. If time is available, it could be interesting to see if the validity of the method decreased with lower HC%.
Between Run 3 and the last run, the number of bins is the previous number of bins times two. n0 to n2 are set, then n3 to n5 follow the pattern of being n-1*2 etc.
_Set B = {bin1, bin2, ..., binn} B is sorted in descending manner based on the metrics (size).
I will evaluate assembly and binning after all bins have been removed to get an idea if Bin by bit will work or be viable.
In CAMI II, they used;
Samples