ndreey commented 1 year ago

Samples

SampleID: Abundance i wanted to generate...
Size(Gb): Amount of bases generated
Tot.Reads: Number of + and - reads generated summed
host.reads: Number of + and - host reads generated summed
HC%: The percentage of host-contamination in generated sample

SampleID	Size(Gb)	Tot.Reads	host.reads	HC%
00	1.84	12325434	0	0,0
01	1.92	12804618	3251606	25,4
02	1.95	13063198	5606170	42,9
03	1.95	13045134	7389790	56,6
04	1.98	13233904	8787688	66,4
05	1.95	13053776	9912918	75,9
06	1.98	13252530	10837938	81,8
07	1.98	13218802	11611980	87,8
08	1.99	13288036	12269324	92,3
090	1.99	13310662	12834170	96,4
095	1.99	13316552	13087900	98,3

ndreey commented 1 year ago

Evaluation of binning

The number of base pairs (bp) of the most abundant genome (g) in bin (b) is determined as True Positives (TP), while the number bp belonging to other genomes is determined as False Positives (FP). False Negatives (FN) are measured by subtracting TP from the total length of g.

Purity (precision)

To determine if the bin is chimeric, one can calculate the purity of the bin by taking (TP) / (TP + FP). If there are any FP, the purity will be less than 1.

Completeness (recall)

To determine how complete the metagenome (bin b) is, one calculates the completeness by taking the (TP) / (TP + FN). This will give a value of how much b covers the reference genome.

NOTE: Because of the low sequencing depth (2Gb), large genomes like the host genome greatly affect the average completeness.
Adjusted Rand Index (ARI)

To determine the performance of the clustering, the Rand Index can be calculated, where 1 indicates 100% accuracy. The Rand index measures how well the clustering is compared to the gold standard. In our case, it is how many base pairs of the same genome that were binned together in the same b that are determined (TP) and TN are base pairs belonging to different genomes that were placed in separate bins. Meaning TN is the number of bp that CONCOCT correctly identified as not belonging to the same genome and placed in separate bins. The Rand Index is thus calculated by taking the (TP + TN) / (tot bp). Because the Rand Index can be greater than 0 by chance, ARI is adjusted Rand Index, which corrects for this random chance.

Percentage of binned bp

How much of the genomic data was binned.

ndreey commented 1 year ago

Benchmark: Initial thougts..

CONCOCT failed in binning 06, 07, 090. I think i will hold off on troubleshooting these (ERROR: something duplicate in alignment bowtie2). This was solved #67.

Evaluating the performance of removing bins on the 08 and 095 samples should be a sufficient start. how i will evaluate if removal of bins is valid method to increase assembly and binning quality

Which hc levels should be evaluated? (0.8-0.95??)
Remove reads mapping to x bins?
- Start with 1 bin, then 2, 4, 8, 16, ..., n?
- Script a program that determines top 3 bins with most coverage?

ndreey commented 1 year ago

Benchmark

As we are looking to see if host bin removal is a valid method, the focus will be on the high HC% samples (06-095).

SampleID	HC%	host_bins
06	81.8	48
07	87.8	58
08	92.3	58
090	96.4	56
095	98.3	70

Because the sequencing depth is quite small, Completeness will not be a relevant metric to choose bins from. These three are relevant, however, coverage is not measured yet.

Purity
Bin size (bp or seq??)
Coverage ??

Flowchart

Remove x bins from set B
Re-run "Measure Assembly and Binning Quality" (MABQ).
Analyse data from all MABQs to determine if bin removal correlated with better MABQ.

Therefore, I believe I will start with the highest HC%, 095. If time is available, it could be interesting to see if the validity of the method decreased with lower HC%.

Remove x bins from set B

Bin by bit

Between Run 3 and the last run, the number of bins is the previous number of bins times two. n0 to n2 are set, then n3 to n5 follow the pattern of being n-1*2 etc.

_Set B = {bin1, bin2, ..., binn} B is sorted in descending manner based on the metrics (size).

Benchmark: 0 bins removed.
Run1: Bin1 removed
Run2: Bin1 and Bin2 removed.
Run3: Bin1, Bin2, Bin3 and Bin 4 removed
Run4: Bin1, 2, 3, 4, 5, 6, 7 and Bin8 removed ....
Run_n: Bin1, ..., Bin70 removed This should result in 8 runs.

Removal of all bins

I will evaluate assembly and binning after all bins have been removed to get an idea if Bin by bit will work or be viable.

ndreey commented 1 year ago

Evaluation of Assembly

In CAMI II, they used;

Strain recall: The fraction of all genomes that got a genome fraction > 90%
Precision: How many high-quality assemblies (genome fraction > 90%) were found compared to GSA
Mismatches per 100 kb
Duplication ratio
Misassemblies
Genome fraction
NGA50

ndreey / ghost-magnet

Evaluation #66

Samples

Evaluation of binning

Purity (precision)

Completeness (recall)

Adjusted Rand Index (ARI)

Percentage of binned bp

Benchmark: Initial thougts..

Benchmark

Flowchart

Remove x bins from set B

Bin by bit

Removal of all bins

Evaluation of Assembly