Closed arash77 closed 5 months ago
Thank you very much @arash77 ! We got some questions concerning the import logic:
This is an example of an entry that is added:
{
"Conda id": "abricate",
"Conda version": "1.0.1",
"Description": "Mass screening of contigs for antiobiotic resistance genes",
"EDAM operation": "Antimicrobial resistance prediction",
"EDAM topic": "Genomics, Microbiology",
"Galaxy tool ids": "abricate, abricate_list, abricate_summary",
"Galaxy wrapper id": "abricate",
"Galaxy wrapper owner": "iuc",
"Galaxy wrapper parsed folder": "https://github.com/galaxyproject/tools-iuc/tree/main/tools/abricate",
"Galaxy wrapper source": "https://github.com/galaxyproject/tools-iuc/tree/master/tools/abricate/",
"Galaxy wrapper version": "1.0.1",
"No. of tool users (2022-2023) (usegalaxy.eu)": 1764,
"Source": "https://github.com/tseemann/abricate",
"Status": "Up-to-date",
"ToolShed categories": "Sequence Analysis",
"ToolShed id": "abricate",
"Total tool usage (usegalaxy.eu)": 496717,
"bio.tool description": "Mass screening of contigs for antimicrobial resistance or virulence genes.",
"bio.tool id": "ABRicate",
"bio.tool ids": "ABRicate",
"bio.tool name": "ABRicate",
"https://usegalaxy.eu": "(3/3)",
"https://usegalaxy.fr": "(3/3)",
"https://usegalaxy.org": "(3/3)",
"https://usegalaxy.org.au": "(3/3)"
}
Currently, the job recreates each entry, but @bgruening commented the json should rather be updated then recreated, one concern here is that our source galaxy_tool_metadata_extractor is still highly WIP so the json name/value pairs will still change a lot over the next months. E.g.: the "https://usegalaxy.eu", "https://usegalaxy.fr" ... columns will probably be changed to one column ... how should this propagate over here? We could only update changed json name/value pairs if that is the preferred way, but if a column is removed, that will then dangle around forever (which might not be a problem at all).
Any metadata we should explicitly not port here ?
@hmenager @bgruening @matuskalas what do you think ?
If there is a tool in the Galaxy metadata list that has a bio.tools ID that does not match with the folders in the RSEc metadata, should we add the folder ?
No, those cases are "interesting" and we should find out why they are not matching. E.g. is Galaxy wrongly annotated? Is the folder here missing?
We are using the bio.tools ID to match the entries in the galaxy metadata to the RSEc metadata. What should we do with entries without bio.tools ID ? Leave out ? Use the tools name / ID ?
I'm not sure if the Galaxy tool name (what is that?) is a good choice. Ideally, it will be something like the conda package name (assuming this more the upstream name).
I'm not sure if the Galaxy tool name (what is that?) is a good choice. Ideally, it will be something like the conda package name (assuming this more the upstream name).
We could check if the conda id does match with the Galaxy wrapper id (to avoid python / wget / R) conda ids, and only take the conda id for those cases.
Adding the bulk import also to the content repository: https://github.com/research-software-ecosystem/content/pull/656
@arash77 we decided to store all tool JSON that do not yet have a corresponding folder in the content/data dir in a dedicated galaxy-tools folder in this location: https://github.com/research-software-ecosystem/content/tree/master/datasets Can you add this to the CI ?
Thanks to all of you for this great work! There are multiple potential explanations to a Galaxy tool missing a bio.tools ID. I can think of a Galaxy annotation problem, or the actual tool not being in bio.tools. I would propose the following:
datasets/galaxy
, but I would rather go somewhere else, as "datasets" has been historically used for many "test imports" and it remains pretty much uncontrolled. so, maybe imports/galaxy
?data
folder.
I agree with @bgruening that we should not try to "guess" bio.tools IDs at this stage (we've done it for other sources, and it was a mistake). I'd rather be interested in listing all the Galaxy tools with no bio.tools IDs and feed the information back to bio.tools and Galaxy teams so that the bio.tools entries can be created and the Galaxy tools can be linked at the source.Currently, it will import the file without bio.tools IDs
into imports/galaxy
, while the file with bio.tools IDs
will be imported into the data
folder.
Thank you for the modifications @arash77 ! Would you be able to import everything to imports/galaxy
? It would make it easier to count the amount of entries for Galaxy. Plus, eventually, we would adopt this for all "import sources" (bioconda, debian med, biii, etc.).
Now, it will import all the tools into imports/galaxy
and copy those with bio.tools IDs
into the data
folder, if the tool folder exists.
This pull request adds a GitHub action for bulk importing galaxy tools from https://github.com/galaxyproject/galaxy_tool_metadata_extractor.