usegalaxy-eu / project-ideas

A collection of project ideas suitable for Master and Bachelor students
MIT License
9 stars 2 forks source link

Systematic tool testing / validation #40

Closed tuncK closed 5 months ago

tuncK commented 11 months ago

Systematic tool testing / validation

Supervisor: Tunc Kayikcioglu For degree: Bachelor/Project/Master Status: Open Keywords: tools, testing, simulation, validation, QA, QC

Global Biological/Research context

Galaxy provides a broad audience with graphical access to tools that are otherwise command-line based only. For an external tool to be served by galaxy, we need an xml file ("wrapper") that describes which buttons there should be to click, which help text should be displayed and what command should be executed upon invocation of this tool.

In addition to the datasets provided by the user, some of the incorporated tools need access to some external data sources, such as a reference DB to lookup. Such reference data can be fetched during runtime, or we can explicitly decide to cache a local copy on the HPC, which is beneficial especially if they are big in size. The xml file should then contain instructions about how to locate these local data resources. For some of the tools there are different version of such reference DBs, it is not necessarily the case that all DB versions are compatible with all releases of the tools.

Objectives of the project

While we have already implemented the functions to execute some tools and also to manage their DBs, we suspect that we might not be fully aware of which tools can be used with which DBs. In the best case, the tool might generate a fatal error, but in the worst case it will exit successfully albeit introducing numerical errors. Your task will be to identify such failure cases.

Proposed agenda for the project

  1. Learn how to execute the tool(s) of interest via Galaxy GUI. If interested in a more automated approach, also bioblend
  2. Test all versions of the tool with all DBs to identify fatal failures.
  3. Generate simulated input datasets with known ground truth
  4. Analyse the test datasets with different tool & DB versions to quantify numeric accuracy.
  5. Propose hard constraints to be implemented on Galaxy to disable or discourage usage of bad tool and DB combinations.
  6. Check relevant WFs or galaxy tutorials to see if they still work after these changes.

Prerequisites