Galaxy tool bulk import

arash77 commented 6 months ago

This pull request adds a GitHub action for bulk importing galaxy tools from https://github.com/galaxyproject/galaxy_tool_metadata_extractor.

paulzierep commented 6 months ago

Thank you very much @arash77 ! We got some questions concerning the import logic:

We are using the bio.tools ID to match the entries in the galaxy metadata to the RSEc metadata. What should we do with entries without bio.tools ID ? Leave out ? Use the tools name / ID ?
If there is a tool in the Galaxy metadata list that has a bio.tools ID that does not match with the folders in the RSEc metadata, should we add the folder ?

This is an example of an entry that is added:


{
    "Conda id": "abricate",
    "Conda version": "1.0.1",
    "Description": "Mass screening of contigs for antiobiotic resistance genes",
    "EDAM operation": "Antimicrobial resistance prediction",
    "EDAM topic": "Genomics, Microbiology",
    "Galaxy tool ids": "abricate, abricate_list, abricate_summary",
    "Galaxy wrapper id": "abricate",
    "Galaxy wrapper owner": "iuc",
    "Galaxy wrapper parsed folder": "https://github.com/galaxyproject/tools-iuc/tree/main/tools/abricate",
    "Galaxy wrapper source": "https://github.com/galaxyproject/tools-iuc/tree/master/tools/abricate/",
    "Galaxy wrapper version": "1.0.1",
    "No. of tool users (2022-2023) (usegalaxy.eu)": 1764,
    "Source": "https://github.com/tseemann/abricate",
    "Status": "Up-to-date",
    "ToolShed categories": "Sequence Analysis",
    "ToolShed id": "abricate",
    "Total tool usage (usegalaxy.eu)": 496717,
    "bio.tool description": "Mass screening of contigs for antimicrobial resistance or virulence genes.",
    "bio.tool id": "ABRicate",
    "bio.tool ids": "ABRicate",
    "bio.tool name": "ABRicate",
    "https://usegalaxy.eu": "(3/3)",
    "https://usegalaxy.fr": "(3/3)",
    "https://usegalaxy.org": "(3/3)",
    "https://usegalaxy.org.au": "(3/3)"
}

Currently, the job recreates each entry, but @bgruening commented the json should rather be updated then recreated, one concern here is that our source galaxy_tool_metadata_extractor is still highly WIP so the json name/value pairs will still change a lot over the next months. E.g.: the "https://usegalaxy.eu", "https://usegalaxy.fr" ... columns will probably be changed to one column ... how should this propagate over here? We could only update changed json name/value pairs if that is the preferred way, but if a column is removed, that will then dangle around forever (which might not be a problem at all).

Any metadata we should explicitly not port here ?

@hmenager @bgruening @matuskalas what do you think ?

bgruening commented 6 months ago

If there is a tool in the Galaxy metadata list that has a bio.tools ID that does not match with the folders in the RSEc metadata, should we add the folder ?

No, those cases are "interesting" and we should find out why they are not matching. E.g. is Galaxy wrongly annotated? Is the folder here missing?

We are using the bio.tools ID to match the entries in the galaxy metadata to the RSEc metadata. What should we do with entries without bio.tools ID ? Leave out ? Use the tools name / ID ?

I'm not sure if the Galaxy tool name (what is that?) is a good choice. Ideally, it will be something like the conda package name (assuming this more the upstream name).

paulzierep commented 6 months ago

I'm not sure if the Galaxy tool name (what is that?) is a good choice. Ideally, it will be something like the conda package name (assuming this more the upstream name).

We could check if the conda id does match with the Galaxy wrapper id (to avoid python / wget / R) conda ids, and only take the conda id for those cases.

paulzierep commented 6 months ago

Adding the bulk import also to the content repository: https://github.com/research-software-ecosystem/content/pull/656

paulzierep commented 6 months ago

@arash77 we decided to store all tool JSON that do not yet have a corresponding folder in the content/data dir in a dedicated galaxy-tools folder in this location: https://github.com/research-software-ecosystem/content/tree/master/datasets Can you add this to the CI ?

hmenager commented 6 months ago

Thanks to all of you for this great work! There are multiple potential explanations to a Galaxy tool missing a bio.tools ID. I can think of a Galaxy annotation problem, or the actual tool not being in bio.tools. I would propose the following:

import everything (with or without bio.tools IDs) to a dedicated folder. I initially suggested datasets/galaxy, but I would rather go somewhere else, as "datasets" has been historically used for many "test imports" and it remains pretty much uncontrolled. so, maybe imports/galaxy?
then move, or copy tools with a bio.tools ID to the "consolidated" data folder. I agree with @bgruening that we should not try to "guess" bio.tools IDs at this stage (we've done it for other sources, and it was a mistake). I'd rather be interested in listing all the Galaxy tools with no bio.tools IDs and feed the information back to bio.tools and Galaxy teams so that the bio.tools entries can be created and the Galaxy tools can be linked at the source.

arash77 commented 5 months ago

Currently, it will import the file without bio.tools IDs into imports/galaxy, while the file with bio.tools IDs will be imported into the data folder.

hmenager commented 5 months ago

Thank you for the modifications @arash77 ! Would you be able to import everything to imports/galaxy? It would make it easier to count the amount of entries for Galaxy. Plus, eventually, we would adopt this for all "import sources" (bioconda, debian med, biii, etc.).

arash77 commented 5 months ago

Now, it will import all the tools into imports/galaxy and copy those with bio.tools IDs into the datafolder, if the tool folder exists.

research-software-ecosystem / utils

Galaxy tool bulk import #12