ndreey / ghost-magnet

Molecular Bioinformatics BSc thesis project at University of Skövde
MIT License
1 stars 0 forks source link

Benchmark: host-contamination analysis #41

Open ndreey opened 1 year ago

ndreey commented 1 year ago

It took 3hours to generate 10GB NGS data (19GB total) on my laptop (4 cores)

ndreey commented 1 year ago
  • How many samples do we require to make it statistically correct?
    • We are interested in host-contamination impact.

Replicates of each host-contamination (HC) level will not be required as we want to compare the different HC levels against the control (0% HC). With that said, we do have a small sample size (1 control, 4 treated). Depending on the variability of the data and the magnitude of the effects, this could limit the statistical power of the analysis.

However, assumptions of normal distribution will be difficult to assume, therefore, non-parametric tests will be the chosen approach.

  • Does the randomness of the abundances of the specific species matter when their endophytic group abundance is already set?

Not for the aim of determining the impact of host-contamination. The overall performance of the assembly and binning will be measured. The hypothesis is that the overall performance becomes worse with higher HC. The only important bin we are interested in is that of the host as we want to find a way to determine the host contigs and remove them.

  • We want to determine the difference between 0% host-contamination to X% host-contamination.

    • Difference in assembly and genome binning performance.
    • Control vs different levels of treatment.

Yes, and this will be determined by following the same benchmarking method that was presented by the second challenge of CAMI.

Plant vs gold standard assembly genome_binning_plant_short_read_GSA

Plant vs MEGAHIT assembly genome_binning_plant_short_read_MA

  • We are also interested in how well the binner bins the host and how much the purity is of that bin.

    • Because the aim is to identify the host bin and remove its contigs (or other filter method).
    • So we can rerun metagenomics analysis without the host-contamination.

Purity and completeness will be very interesting metrics to determine how much other contigs would be removed if we remove that bin + how much host-contamination we might be missing.

ndreey commented 1 year ago

PROGRAMS TO BE / ALREADY INSTALLED

Request correct versions and programs by March 26