nnf-cbn / 2019-unconference

Organisation of the 2019 unconference
3 stars 4 forks source link

BOF: How to select software for a specific job? #8

Open ftmashari opened 5 years ago

ftmashari commented 5 years ago

As high throughput methods are generating large amounts of data, the reliance of scientists on software is becoming more evident. For the analysis of large biological datasets, one usually needs to use multiple software tools. However, often there are multiple software packages that do the same task. In order to select the tool that best performs the task in hand, many rely on word of the mouth packages, the impact of the journal that has published the software, number of citations, etc. However it has been shown that none of these is actually able to predict the accuracy of software (see https://www.biorxiv.org/content/early/2017/01/02/092205.full.pdf). In this session, we will discuss good strategies for selecting software for a specific job.

ftmashari commented 5 years ago

Overall, it is a complicated problem. Benchmarks can help, but still they cannot completely capture the accuracy of a package compared to other packages. The results of benchmarks can be conflicting since the datasets are different.

bug1303 commented 5 years ago

We also discussed places where to look for software/tools/resources.

There is e.g. https://bio.tools/ (an Elixir effort), which classifies tools with certain "Topic" tags, and uses an ontology of tool functions, one can further filter by tools that accept a certain Data type as input/output, or whether it should be a command line tool or a web interface, e.g. However, the issue to choose which tool is the best one for a given task, is trickier to solve. A ranking by citations or whatever is not necessarily helpful (see paper referenced in initial post), every method usually comes with a benchmark to show that it's better than previous methods, but which one is best for your data / your use case, might be different.

For some disciplines, there exist web tools which offer a meta approach, i.e., running multiple tools on the given data and finding the consensus of multiple predictions. However, this is not feasible in other areas.

There also exist curated lists of "awesome" resources on github, such as https://github.com/danielecook/Awesome-Bioinformatics and several alike for specific domains, such as: https://github.com/seandavi/awesome-single-cell