Curation work to ensure all entries give "canonical" tool descriptions

research-software-ecosystem / content

A metadata commons to store research software metadata

Creative Commons Attribution 4.0 International

40 stars 29 forks source link

Curation work to ensure all entries give "canonical" tool descriptions #10

Open joncison opened 5 years ago

joncison commented 5 years ago

One of many issues around GitHub-based content management for bio.tools.

joncison commented 5 years ago

Curation work to remove remaining entry redundancy ensuring a non-redundant set of “canonical” tool descriptions - this is mostly done but see e.g. https://github.com/bio-tools/biotoolsRegistry/issues/282

joncison commented 5 years ago

@hansioan, can we make a definitive list of actions here? To my mind it's this:

[x] tools imported from cloudIFB (did these yesterday)
[x] tools imported from Galaxy pasteur - need to speak ideally today with @hmenager about this
[x] 100 other known duplicates ? (do you have a list we can work on Hans?)
[x] verification of all currently unverified IDs (see https://github.com/bio-tools/biotoolsRegistry/issues/357)
[x] resolving remaining redundancies & issues from systematic ID check
[x] checking that all homepage URLs are not broken (with tooling to auto-annotate ones which are down) (see https://github.com/bio-tools/biotoolsRegistry/issues/207)
[x] redundant descriptions / entries of highly prevalent tools, e.g. BLAST, HMMER ? (need to check the big names)

joncison commented 5 years ago

@bgruening @piotrgithub1 @matuskalas - me, Hans & Herve have been making a major push in content clean-up (mostly ID verification, tool names and redundancy removal) in preparation for data dump (https://github.com/bio-tools/content/issues/2).

Bearing in mind that the vision for bio.tools is to provide "canonical" descriptions of unique tools, may I ask please that if you have a view on clean-ups that need doing in this regard, to let us know very soon please. e.g. do we satisfy the requirement for integration of data from bioconda etc.

We hope to get the clean-up complete by end of next week.

bgruening commented 5 years ago

@joncison what do you need? Imho we can deal with this after the push. Bioconda will deal with whatever bio.tools drop. Bioconda has already started to annotate packages with bio.tools IDs, so ideally they should keep stable and the content should be YAML from our side. But otherwise, we will know more if we start working on it :)

joncison commented 5 years ago

I was wondering whether any of you guys know already of content issues that would make the integration hard, duplicates (which are now I think nearly all resolved) being an obvious case. We need also to do this clean-up for a paper soon to be submitted (we're all co-authors) - the main reason for doing it now. Rest assured the dump will go ahead ASAP.

bgruening commented 5 years ago

Thanks @joncison! My take on this is, we create the bot and create the content-validation scripts and if things fail, because of duplicates or such, we will know and can fix it.

joncison commented 5 years ago

very good - which would trap any currently unknown issues (and soon we'll have fixed all the known ones). ps for the validation angles we already have biotoolsLint (currently just harvesting ideas)

joncison commented 5 years ago

quick update @bgruening and @hmenager : @hansioan and me are making sweeping progress on above, but it's a huge job ... will keep you posted. The (clean) content dump will follow once we're done.

joncison commented 5 years ago

quick update @bgruening @piotrgithub1 me and @hansioan are done with the clean-ups (huge job) only thing left is a final verification of IDs (for things added in last weeks). Once that's done I'll close this issue. I'm not claiming all the content is now perfect, but it's a lot better than it was a couple of months ago in terms of redundancy, sensible IDs, ownership etc. cc @hmenager

joncison commented 5 years ago

UPDATE All things mooted on Feb 5 have been done, but keep this open because there will be further improvements to make, no doubt.