ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

Evaluation #66

Open ndreey opened 1 year ago

ndreey commented 1 year ago

Samples

SampleID Size(Gb) Tot.Reads host.reads HC%
00 1.84 12325434 0 0,0
01 1.92 12804618 3251606 25,4
02 1.95 13063198 5606170 42,9
03 1.95 13045134 7389790 56,6
04 1.98 13233904 8787688 66,4
05 1.95 13053776 9912918 75,9
06 1.98 13252530 10837938 81,8
07 1.98 13218802 11611980 87,8
08 1.99 13288036 12269324 92,3
090 1.99 13310662 12834170 96,4
095 1.99 13316552 13087900 98,3
ndreey commented 1 year ago

Evaluation of binning

The number of base pairs (bp) of the most abundant genome (g) in bin (b) is determined as True Positives (TP), while the number bp belonging to other genomes is determined as False Positives (FP). False Negatives (FN) are measured by subtracting TP from the total length of g.

Purity (precision)

To determine if the bin is chimeric, one can calculate the purity of the bin by taking (TP) / (TP + FP). If there are any FP, the purity will be less than 1.

Completeness (recall)

To determine how complete the metagenome (bin b) is, one calculates the completeness by taking the (TP) / (TP + FN). This will give a value of how much b covers the reference genome.

Percentage of binned bp

How much of the genomic data was binned.

ndreey commented 1 year ago

Benchmark: Initial thougts..

CONCOCT failed in binning 06, 07, 090. I think i will hold off on troubleshooting these (ERROR: something duplicate in alignment bowtie2). This was solved #67.

Evaluating the performance of removing bins on the 08 and 095 samples should be a sufficient start. how i will evaluate if removal of bins is valid method to increase assembly and binning quality

ndreey commented 1 year ago

Benchmark

As we are looking to see if host bin removal is a valid method, the focus will be on the high HC% samples (06-095).

SampleID HC% host_bins
06 81.8 48
07 87.8 58
08 92.3 58
090 96.4 56
095 98.3 70

Because the sequencing depth is quite small, Completeness will not be a relevant metric to choose bins from. These three are relevant, however, coverage is not measured yet.

Flowchart

  1. Remove x bins from set B
  2. Re-run "Measure Assembly and Binning Quality" (MABQ).
  3. Analyse data from all MABQs to determine if bin removal correlated with better MABQ.

Therefore, I believe I will start with the highest HC%, 095. If time is available, it could be interesting to see if the validity of the method decreased with lower HC%.

Remove x bins from set B

Bin by bit

Between Run 3 and the last run, the number of bins is the previous number of bins times two. n0 to n2 are set, then n3 to n5 follow the pattern of being n-1*2 etc.

_Set B = {bin1, bin2, ..., binn} B is sorted in descending manner based on the metrics (size).

  1. Benchmark: 0 bins removed.
  2. Run1: Bin1 removed
  3. Run2: Bin1 and Bin2 removed.
  4. Run3: Bin1, Bin2, Bin3 and Bin 4 removed
  5. Run4: Bin1, 2, 3, 4, 5, 6, 7 and Bin8 removed ....
  6. Run_n: Bin1, ..., Bin70 removed This should result in 8 runs.

Removal of all bins

I will evaluate assembly and binning after all bins have been removed to get an idea if Bin by bit will work or be viable.

ndreey commented 1 year ago

Evaluation of Assembly

In CAMI II, they used;