nf-core / tools

Python package with helper tools for the nf-core community.
https://nf-co.re
MIT License
233 stars 187 forks source link

New tool: Required publication references #236

Open ewels opened 5 years ago

ewels commented 5 years ago

It would be nice to make it easier for people to know what should be referenced if they use a pipeline in a manuscript. For example, nf-core references <pipeline-name> could return a list of the references that you need to add into your paper. (alt names: nf-core refs, nf-core bib..?)

Different flags could give different output formats, but perhaps the default could be prose text. For example:

Data was processed using nf-core/rnaseq [pipeline DOI, nf-core paper]. This pipeline is built using nextflow [nextflow paper] and uses the following tools: FastQC (Quality control of raw data) [ref], TrimGalore! (Trimming of adapter sequence contamination) [ref], STAR (Alignment of RNA-seq reads to the reference genome) [ref] …etc

Need to think about where and how to capture this information in the pipeline files. For example, a simple YAML file could work nicely:

- tools:
    - fastqc:
        - name: FastQC
        - description: Quality control of raw data
        - ref: https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
    - trimgalore:
        - name: Trim Galore!
        - description: Trimming of adapter sequence contamination
        - ref:
            - https://www.bioinformatics.babraham.ac.uk/projects/trim_galore/
            - 10.14806/ej.17.1.200
    - star:
        - name: STAR
        - description: Alignment of RNA-seq reads to the reference genome
        - ref: 10.1093/bioinformatics/bts635

Requirements:

Output options could be:

The nextflow and nf-core references can be hardcoded. The workflow DOI can be lifted from README.md I guess. Or could potentially be added as a new workflow.metadata variable?

Thoughts / feedback?

Phil

maxulysse commented 5 years ago

I would also add the versions of each tools too

drpatelh commented 5 years ago

It might be good to host a central database (e.g. yaml) of tools and their associated information. This can then be used to parse the conda yaml to create a tool specific publication description that would be linked by release to the pipeline. It would be much neater to just reference the pipeline in papers (if morally possible) - with a sentence pointing to the pipeline for all the tool-specific citations. I've often been asked to trim down text and a decision may need to be made as to which tools you cite... I generally provide a short description of the tool, version, reference and pubmed id. Maybe we can provide this as a file that gets bundled with the pipeline that can be linked on the pipeline home page?

sven1103 commented 5 years ago

Hm, I was just thinking that we get this information for free over the Anaconda API, right?

For example: https://api.anaconda.org/package/bioconda/samtools

Although package maintainers do not always provide all fields info (which is bad!).

So instead of having another yaml file, we could use the environment.yml. If a package does not provide a description, it might be good practise to contact the package maintainer to do so?

maxulysse commented 5 years ago

We might want to add extra informations, like an actual publication or DOI for the pipeline

sven1103 commented 5 years ago

hm, i see. There is no such thing as a tool registry with DOI and publication URIs, right? Maybe we need this...

ewels commented 5 years ago

It might be good to host a central database (e.g. yaml) of tools and their associated information.

I see where you're going with this, however I quite like that all pipelines are totally self-sufficient currently. Especially if this will be used within tool execution, as many users run offline.

It would be much neater to just reference the pipeline in papers (if morally possible)

I don't think that it is morally good to do this. If people decide that they need to do this then that can be on their shoulders, but I don't think that we should help them.

I generally provide a short description of the tool, version, reference and pubmed id.

Yes - this is basically the information that I was thinking of listing (though DOI instead of pubmed). A table with this information would be a nice output option too though..

Maybe we can provide this as a file that gets bundled with the pipeline that can be linked on the pipeline home page?

Yes, that could be very nice actually. We have an ACKNOWLEDGMENTS.txt file that we deliver with all data from our centre to try to help people to mention us in their paper. The pipelines could do the same here, so that it's obviously alongside the results files when the pipeline runs.

ewels commented 5 years ago

Hm, I was just thinking that we get this information for free over the Anaconda API, right?

Not really - we're already using this for the nf-core licenceses command, but it doesn't have any info about publications that I'm aware of.. It's specifically the DOI / publication reference that I'm thinking of here.

Tying the names in with environment.yml and potentially using the descriptions would be a nice idea though 👍 The summary field where available should contain this. It will not describe how it's used in the pipeline though, so not as good as a specific string.

sven1103 commented 5 years ago

Maybe should activate this discussion again: https://github.com/nextflow-io/nextflow/issues/866

Tools and parameters that are used in Nextflow should be descripted in a structured way, so humans and machines can work with it.

I also see the tools metadata such as URI, URL, description and parameters there combined... Just brain-storming here.

drpatelh commented 5 years ago

How about tool-specific parameters? e.g. if you aren't using the defaults. I generally provide these as a double-quoted string for full traceability and reproducibility. Would it be enough to have these defined within main.nf bearing in mind that these may also change between releases.

ewels commented 5 years ago

Yes, I wondered about putting this kind of information alongside the parameter schema described in that issue. However, parameters and tool metadata are distinct, so it may not make sense. For example, it could break parsing by the general tools form-building tools discussed on that thread. A section of nextflow.config dedicated towards describing tools could work though, especially alongside the feature request for parsing tool version numbers at run time. Any thoughts @pditommaso?

How about tool-specific parameters

This is getting a bit off-topic now 😅But yes, I think having them defined in main.nf is enough - this file is tagged with each release so easy to find again. They're also in the trace and reports that are saved with the results. Personally, I think it improves code readability if they're in main.nf alongside the command template, instead of separately held in a different location.

pditommaso commented 5 years ago

IMO maintaining a separate annotation file does not work because very easily it gets out of sync with the actual tools used in the pipeline script.

Ideally these info should be inferred during process execution https://github.com/nextflow-io/nextflow/issues/879. Alternatively we could add an annotation in the module/process definition https://github.com/nextflow-io/nextflow/issues/984.

Otherwise the best approximation could be the Conda environment file tho, if I'm understanding well, the problem is that it does not include the citation/paper DOI, right? Not sure but I think using the tool name and version it should be possible to infer the related metadata from biotools.

Pinging @bgruening and @ypriverol who should know about the state of the art of bioconda/containers /biotools interoperability.

bgruening commented 5 years ago

@ewels @pditommaso we actually do include identifier into conda, see here: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/multiqc/meta.yaml#L137

This means you can infer this from the conda package or bio.tools. A DOI can, and should, be added to the conda package as well.

Does this answer your quesiton?

sven1103 commented 5 years ago

Uh, this is actually very nice.

Just checked the API request for fasttree: https://bio.tools/api/tool/fasttree

Seems that we get the information we need from it, so no need to have an additional file.

ewels commented 5 years ago

Fantastic - this is is great news! Many thanks @bgruening - I didn't know that this lookup existed. However, it looks like the identifier isn't given in the Anaconda API 😞 https://api.anaconda.org/package/bioconda/multiqc

Any ideas on how we can best fetch this information? If we can it would be great to use this method. If we want, we could even get the linter to warn if the biotools identifier is missing.

A DOI can, and should, be added to the conda package as well.

Also under the identifiers section, as done here I guess? Cool! I'll add this to the MultiQC recipe.

bgruening commented 5 years ago

Any ideas on how we can best fetch this information? If we can it would be great to use this method. If we want, we could even get the linter to warn if the biotools identifier is missing.

Short answer is that its part of the tarball and with this part of the installation, afaik. Long answer is that we are working on a central service (bio.tools) to make this all way easier and also independent of conda ... so a unified interface to pkgs and containers.

Also under the identifiers section, as done here I guess? Cool! I'll add this to the MultiQC recipe.

Yes :)

ewels commented 5 years ago

ok cool, thanks!

Then I wonder if the best bet is to just try pinging the biotools API with the conda package name if it's in the bioconda channel. I guess that the two will essentially always be the same.. This won't match up versions and could in some weird edge cases give the wrong information, so not ideal. But I don't really fancy downloading and extracting all software just for this fast little utility command.

ewels commented 5 years ago

..could also just grab the raw bioconda meta.yml directly from GitHub and parse the identifiers from that. But again, will be tricky to match up versions so not a whole lot better I guess.

bgruening commented 5 years ago

This depends if you always have internet access during the workflow run. I guess querying the API is ok. I suppose digging the information out of conda is also easy - which should be already available locally.

ewels commented 5 years ago

Ah true, there are two different use cases here. I was thinking primarily about a new nf-core references cli tool which would run totally separately from the workflow.

For using the data within a workflow run (eg. saving it to an ACKNOWLEDGMENTS.txt file), I think we need all of the data locally because so many people run without internet access. How would we go about finding the information from a local conda install? I've had a quick dig around but haven't found the meta.yml file yet.

ewels commented 5 years ago

..but we'd still need an internet connection for bio.tools. I think that this needs to be a separate cli tool. If we want the output as a results file with the pipeline then this should probably be a static file which is saved separately I think. If we want automation, the lint tool could check that it exists and is up to date (maybe on --release only for the latter).

bgruening commented 5 years ago

Have a look at miniconda3/pkgs/samtools-1.8-3/info/recipe/meta.yaml

ewels commented 3 years ago

This issue is getting much more manageable with DSL2 modules, where we have a meta file for each tool that includes DOI 🎉 (typically taken from Bioconda).

This could potentially be used both for a command line tool but also within pipelines as the meta file should be bundled within each pipeline.

jfy133 commented 1 year ago

Following on from: https://github.com/nf-core/tools/pull/2326 (which starts providing a framework to insert this into a MultiQC report):

@maxulysse and @mashehu have both said we should automate this even more and should be possible via the DOIs in the meta.yml.

From @maxulysse a conceptual plan:

Initial problems I see: