usegalaxy-eu / project-ideas

A collection of project ideas suitable for Master and Bachelor students
MIT License
9 stars 2 forks source link

Reproducing CAMI (metagenomics) challenges in Galaxy #24

Open bebatut opened 2 years ago

bebatut commented 2 years ago

Global context

Metagenomics analysis relies on sophisticated computational approaches: assembly, binning, taxonomic classification, etc. Any downstream analyses (comparative, etc) are only meaningful if the outcome of these initial data processing methods makes sense. Despite the tremendous progress in the last years, none of these approaches can completely recover the complex information encoded in metagenomes. They all rely on simplifying assumptions that can lead to strong limitations.

When presenting novel or improved methods, the accuracy of computational methods in metagenomics is often evaluated. But usually, these evaluations are hardly comparable: no general standard for the assessment of computational methods in metagenomics. This may result in users not well informed and misinterpretations of computational predictions.

To tackle this problem. the initiative for the "Critical Assessment of Metagenome Interpretation" (CAMI) was founded in 2014. It evaluates methods in metagenomics independently, comprehensively and without bias. The initiative supplies users with exhaustive quantitative data about the performance of methods in all relevant scenarios. It therefore guides users in the selection and application of methods and in their proper interpretation. Furthermore it provides valuable information to developers, allowing them to identify promising directions for their future work.

Project context

The 2nd CAMI offers several challenges: an assembly, a genome binning, a taxonomic binning and a taxonomic profiling challenge, on several multi-sample data sets from different environments, including long and short read data. Participants registered for download of the challenge datasets. They ran different tools, with different parameters on the different datasets. For reproducibility, participants could submit either a Docker container containing the complete workflow, a bioconda script or a software repository with detailed installation instructions, specifying all parameter settings and reference databases used. Altogether 5,002 submissions of 76 programs were received for the four challenge datasets, from 30 external teams and CAMI developers. The CAMI developers evaluated then the results using standardized metrics and then make sense from the different results Meyer et al, 2020

In this project, we would to show that Galaxy could be used as a platform to support the next CAMI challenges:

Objectives of the project

Proposed agenda for the project

  1. Read CAMI 2 paper: Meyer et al, 2020
  2. List the tools for the different challenges and check if they are available in Galaxy (and which version)
  3. Get familiar with tool integration in Galaxy
  4. Select one of the challenge (assembly, profiling, genome binning, taxon binning, clinical pathogen detection) and run it
    1. Add the "winning" tools in Galaxy
    2. Add the input data in Galaxy
    3. Run the different tools on the data and try to identify the best set of parameters for each tool/version
    4. Compare the results to the ones in CAMI
    5. Share the workflows via IWC
  5. Run similarly other challenges

Prerequisites

Further reading and useful links