research-software-ecosystem / content

A metadata commons to store research software metadata
Creative Commons Attribution 4.0 International
40 stars 29 forks source link

Data dump to GitHub #2

Open joncison opened 5 years ago

joncison commented 5 years ago

From @joncison on September 4, 2018 11:27

Nightly dump of all content (in XML and JSON formats?) to GitHub, as a convenience (or least to begin, just a one-off dump)

Copied from original issue: bio-tools/biotoolsRegistry#355

joncison commented 5 years ago

We already have a repo for this (https://github.com/bio-tools/bio.tools-content) but the names maybe a bit crappy? How about:

Preferences? I'll need to spell out this is strictly for experimental purposes (like what I said here already).

And in what format:

Preferences?

I'd personally prefer XML because it will make the validation direct and easier (and avoid any drift to using not very rigorous JSON schema equivalents of biotoolsSchema etc.)

And what about the structure - I propose one folder per tool, where the folder name is the bio.tools toolID - which allows for adding other tool descriptors / files / formats under a common directory. Also one XML with everything in.

Preferences?

cc @bgruening @hmenager @hansioan : what do you think?

joncison commented 5 years ago

From @bgruening on December 8, 2018 10:56

My gut feeling is https://github.com/bio-tools/tools. Its bio.tools so tools makes a lot of sense :)

I would prefere YAML, as this is currently the most easiest format for people to edit in an editor or browser. This can change if we dump the final version and when we have an curation interface, but for now I would prefer YAML. The shim is hopefully not complicated to write and would be used on CI to 1) convert it to XML and 2) validate and changes.

Thanks @joncison for working on this.

joncison commented 5 years ago

OK thanks!

Just a note that https://dev.bio.tools/api/tool/ is currently serving a mess (in all of XML, JSON and YAML :) ) - but this has been sorted locally.

I plan to play more with shims next week, so let's see how this goes ...

And @bgruening - what about the directory structure; are you happy with folders as we talked about previously ?

joncison commented 5 years ago

From @scapella on December 8, 2018 20:22

IMHO I'd go either for https://github.com/bio-tools/resources or https://github.com/bio-tools/content as there is more content than just tools -

Regarding the structure of the repo I'd suggest to have a general folder for bio.tools at the repo and then one per tool. How do you want to handle versions? different subfolders in the same tools folders?

Cheers,

Salva

On Sat, Dec 8, 2018 at 12:01 PM Jon Ison notifications@github.com wrote:

OK thanks!

Just a note that https://dev.bio.tools/api/tool/ is currently serving a mess (in all of XML, JSON and YAML :) ) - but this has been sorted locally.

I plan to play more with shims next week, so let's see how this goes ...

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bio-tools/biotoolsRegistry/issues/355#issuecomment-445450737, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH4hzwI0N1GoOBxtGaOdBrdOnKQpWTzks5u25wdgaJpZM4WYzEr .

joncison commented 5 years ago

From @hansioan on December 10, 2018 11:39

@joncison @hmenager @bgruening Why not all of them? I would prefer it to be JSON of course :) , but perhaps the best is to have all three (JSON, XML, YAML). bio.tools supports that.

https://bio.tools/api/t?page=1&format=json https://bio.tools/api/t?page=1&format=yaml https://bio.tools/api/t?page=1&format=xml

In the case of biotoolsSchema xml for now we only have that on a per tool basis (example shown on dev but will soon work on production too) https://dev.bio.tools/api/signalp?format=xml

joncison commented 5 years ago

From @jlgelpi on December 10, 2018 11:45

I would go for a single format for the repository (one that can be easily checked against a schema). Having several formats may introduce inconsistences. Perhaps we can accept any input format (XML, json, yaml) and convert the data on the pull request.

joncison commented 5 years ago

Please let us know what you think @hmenager then I'll write back addressing all comments above ...

joncison commented 5 years ago

From @bgruening on December 10, 2018 11:59

And @bgruening - what about the directory structure; are you happy with folders as we talked about previously?

Yes. Folders are good.

How do you want to handle versions? different subfolders in the same tools folders?

Most likely. Would make sense. Whatever we do, we can change this easily later one. So nothing is set in stone imho.

I would prefer it to be JSON of course

@hansioan any reason? JSON is a subset of YAML so that should be fine for both worlds and conversion is easy.

Perhaps we can accept any input format (XML, json, yaml) and convert the data on the pull request.

I guess the idea was to accept only one format and then on CI add all validation. This validation could happen by intermediate conversion to XML if @joncison thinks that's best. I would prefer only one format in the mast repo to not confuse users, but if other formats are needed we can have a bot that converts them automatically and syncs it so a bio.tools-json repo etc. ...

joncison commented 5 years ago

From @redmitry on December 10, 2018 12:10

Hello,

I know that for the mere human being the form ?page=1&format=json is a natural way, as it permits to use usual browser for the GET requests, but talking about REST architecture, it is better to use headers:

Accept: application/json Range: tools=10-30 Response: Content-Type: application/json Content-Range: tools 10-30/20000

The advantage of standard http pagination is that a client knows from the beginning the total size (headers go before the body) and may calculate the number of pages in the table, while loading only one page only.

Of course nobody prevent someone to implement both forms.

joncison commented 5 years ago

From @hansioan on December 10, 2018 12:20

Regarding the versions... Technically in bio.tools we don't have a fine grained track of tool versions. In the new schema 3.0 which will go in this week if not today we allow multiple version annotations per tool. The reason for this is that if there is no difference in annotation for a set of tool versions, they all go in the same tool annotation. If there IS a (significant) difference in tool functionality -> thus annotation between different tool versions, then that tool, along with the version will go into a separate tool entry, given its own tool id, with separate annotation and so on...

Thus I am not really sure if any of the folder structure is needed, at least not at the core of the tool descriptions. We can have the option of providing multiple copies of the same tool, separated by version, but that should be something that results from the initial structure, and not something that IS the initial structure.

joncison commented 5 years ago

From @scapella on December 10, 2018 12:49

@hansioan https://github.com/hansioan We have been arguing for quite a while about the versioning and which one is the best solution for tracking that information about. I think having that in mind in this effort from the very beginning might prevent having to invest much more time and efforts in a later stage.

I agree with you that when there is no changes among versions, it is easy to handle. However, when there are major changes among versions for the same program, it should be modelled in the same entry rather than having an independent entry. For instance, if I look for trimAl and then go to bio.tools entry, I'd like to have access to the latest version. It is quite likely I'm not aware that there are two versions and would look for the generic tools name.

Cheers,

Salva

On Mon, Dec 10, 2018 at 1:20 PM Hans Ienasescu notifications@github.com wrote:

Regarding the versions... Technically in bio.tools we don't have a fine grained track of tool versions. In the new schema 3.0 which will go in this week if not today we allow multiple version annotations per tool. The reason for this is that if there is no difference in annotation for a set of tool versions, they all go in the same tool annotation. If there IS a (significant) difference in tool functionality -> thus annotation between different tool versions, then that tool, along with the version will go into a separate tool entry, given its own tool id, with separate annotation and so on...

Thus I am not really sure if any of the folder structure is needed, at least not at the core of the tool descriptions. We can have the option of providing multiple copies of the same tool, separated by version, but that should be something that results from the initial structure, and not something that IS the initial structure.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/bio-tools/biotoolsRegistry/issues/355#issuecomment-445796755, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH4h9vYJBBlLIxDg3iKq2zO3picxK4Jks5u3lGLgaJpZM4WYzEr .

joncison commented 5 years ago

From @hansioan on December 10, 2018 13:15

Yes, but having version specific information for each tools gets us back to 2 years ago when tools were accessed like https://bio.tools/toolid/version

This way was basically creating a tool whenever a new version appeared, and in 90% of all cases there were no (zero) differences between the annotations, except for the version property. We had a very famous example of a tool that appeared over 10 times in bio.tools with the same annotation, because the people were just going in and updating the version information whenever they released a new version (e.g. new tool between tool version 1.2.23 and 1.2.24).

There is no good way to do separate versions for each tool except modeling this in the API request, and even if there was we would still have to store versioned tools in the database.

While this can certainly apply for things like conda, containers and other projects that require the exact versions, I don't think applies as much to bio.tools. We must remember that 90% of our users just want to find a tool that meets their scientific requirements (focus on find).

All this being said, I am not opposed to having a good solution that can work for everyone, it is just something which is complicated and not in our list of main tasks right now. We have opened the code and once all the remaining plumbing tasks are done and we are ready to accept pull requests, perhaps this can be one of the initial tasks for contributors.

joncison commented 5 years ago

From @bgruening on December 10, 2018 13:17

@redmitry Json will be served from the web service. But this question issue is about the data storage in GitHub, no one disagrees that Json will be served from the web service, imho.

@scapella @hansioan don't let the version discussion stop the initial drop, please. Versioning can be added at any time. Subfolders are easy to add for anyone if they care about the version or if the difference between the tools/benchmarks are too big. You simply need to adjust your bio.tools-github-parser to traverse recursively to all folders. And people that don't need this, can simply assume that the latest version is in the root dir. Really not a big deal. We can add this later if we need to.

joncison commented 5 years ago

From @hmenager on December 10, 2018 14:23

@joncison As far as I'm concerned YAML would probably be the best choice, because:

For the repository, I would go for either https://github.com/bio-tools/tools or https://github.com/bio-tools/content - not resources.

On Mon, Dec 10, 2018 at 2:17 PM Björn Grüning notifications@github.com wrote:

@redmitry https://github.com/redmitry Json will be served from the web service. But this question issue is about the data storage in GitHub, no one disagrees that Json will be served from the web service, imho.

@scapella https://github.com/scapella @hansioan https://github.com/hansioan don't let the version discussion stop the initial drop, please. Versioning can be added at any time. Subfolders are easy to add for anyone if they care about the version or if the difference between the tools/benchmarks are too big. You simply need to adjust your bio.tools-github-parser to traverse recursively to all folders. And people that don't need this, can simply assume that the latest version is in the root dir. Really not a big deal. We can add this later if we need to.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bio-tools/biotoolsRegistry/issues/355#issuecomment-445811549, or mute the thread https://github.com/notifications/unsubscribe-auth/ABFoAIThCXBZCOzQ8Ik1GNZF7tTWII02ks5u3l7lgaJpZM4WYzEr .

joncison commented 5 years ago

From @scapella on December 10, 2018 15:33

Fine with me to have in the radar the versioning stuff but no to stop the dumping process.

Salva

On Mon, Dec 10, 2018 at 3:23 PM Hervé Ménager notifications@github.com wrote:

@joncison As far as I'm concerned YAML would probably be the best choice, because:

  • it is the easiest format to track changes with git,
  • it is easier to manually edit for many people.

For the repository, I would go for either https://github.com/bio-tools/tools or https://github.com/bio-tools/content - not resources.

On Mon, Dec 10, 2018 at 2:17 PM Björn Grüning notifications@github.com wrote:

@redmitry https://github.com/redmitry Json will be served from the web service. But this question issue is about the data storage in GitHub, no one disagrees that Json will be served from the web service, imho.

@scapella https://github.com/scapella @hansioan https://github.com/hansioan don't let the version discussion stop the initial drop, please. Versioning can be added at any time. Subfolders are easy to add for anyone if they care about the version or if the difference between the tools/benchmarks are too big. You simply need to adjust your bio.tools-github-parser to traverse recursively to all folders. And people that don't need this, can simply assume that the latest version is in the root dir. Really not a big deal. We can add this later if we need to.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub < https://github.com/bio-tools/biotoolsRegistry/issues/355#issuecomment-445811549 , or mute the thread < https://github.com/notifications/unsubscribe-auth/ABFoAIThCXBZCOzQ8Ik1GNZF7tTWII02ks5u3l7lgaJpZM4WYzEr

.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/bio-tools/biotoolsRegistry/issues/355#issuecomment-445832344, or mute the thread https://github.com/notifications/unsubscribe-auth/AAH4hwquwN3cFLP0K0_YpF77ys0C7GW1ks5u3m5tgaJpZM4WYzEr .

joncison commented 5 years ago

Quick update - will be revisiting this in new year - but for now a few points:

Let's keep this issue for the data dump and use this for technical discussions about a GitHub-based content architecture.

Pls. bear in mind the priority on the DK side is getting the deployment and open-dev process sorted, critical / high priority issues scheduled for the 2019 Q1 release, the website redesign, and other features with direct impact on end-users.

The new content architecture under discussion would be awesome, but depends on other components including an independent curators interface e.g. based on edamToolAnnotator and independent validation mechanism, e.g. biotoolsLint. It's a lot of work, hence a matter of priorities.

joncison commented 5 years ago

UPDATE

Let's keep this thread specifically for issues around the data dump using e.g. https://github.com/bio-tools/content/issues/7 for discussion around formats/transforms.

@hansioan & I have started on curation work to assure all entries give "canonical" tool descriptions, which is a pre-requisite really to the complete dump.

cc @piotrgithub1 @hmenager @bgruening

joncison commented 5 years ago

PS. one huge help @bgruening would be to work with @piotrgithub1 to get the local deployment working - as all the community bio.tools dev will depend on it. I know you started looking at this with @hansioan in FR last year.

bgruening commented 5 years ago

Yeah, I think we got the backend to startup in a redistributable conda environment.

joncison commented 5 years ago

UPDATE

Thanks to @hansioan & @piotrgithub1 we now have 2000 tools in JSON format in https://github.com/bio-tools/content

PJ says ... "I have added 2000 tools from bio.tools registry to github repo in a prettified JSON format (https://github.com/bio-tools/content ). The root folder is called ‘data’ for the lack of a better name and as we have previously agreed almost everything about this is can change. I invite you to try it out and we should organize a call in a couple of weeks (18^th onwards?) to summarize the experiences and present and/or evaluate ideas on the data interoperability within the platform’s systems."

The curation work to assure all entries give "canonical" tool descriptions is almost done - some of the entries you see in the repo might disappear, or have their name / IDs changed over the next weeks.

cc @osallou @hmenager @bgruening @scapella

joncison commented 5 years ago

@piotrgithub1 now that the big curation work to ensure all entries give "canonical" tool descriptions is done (or "done enough" for now), can we dump everything (all 12,000+ entries) in JSON to https://github.com/bio-tools/content/tree/master/data ? I understand @hansioan has something to do the dump automatically?

On the call today there was general agreement it would be nice to do this, esp. in preparation for All-hands. Prob. easiest to delete what's there / start again (ID's hence file names will have changed in some cases) cc @bgruening @scapella @osallou

joncison commented 5 years ago

nudge-nudge @piotrgithub1 @hansioan - in case you think it'd be good to drop more files in time for All-hands