gnames / gnverifier

GNverifier verifies scientific names against more than 100 biodiversity databases
https://verifier.globalnames.org
MIT License
20 stars 1 forks source link

repeatability of results: online (gnames apis) vs standalone tools (gndiff & gnparser) #87

Open abubelinha opened 2 years ago

abubelinha commented 2 years ago

During the last weeks I opened many issues and questions about gnverifier / gnames / resolver / gnparser / gndiff , trying to tune them for my use cases ... which I hope are similar to those of many other users.

As long as those issues are solved, the results returned by apis are improving. But improving also implies "changing". And I think this is a major issue for some use cases as well.

When it comes to publishing scientific results (thesis, reports, papers, whatever), repeatability is a must. If I need to publish a curated list of scientific names, I can describe my protocol (i.e. #85), and I can also provide my data sources as attached files ... but there is no way I can provide the software I used to process those data following my protocol, because it was an api running on a remote server. And there is no way to change this, since old apis and servers need to be removed. And their names datasources need to be updated, so even the same api version might return different results because of those names updates.

On the other hand, as far as I can tell, if I do my work using a particular release version of gnparser and gndiff, my results will be 100% repeatable in the future as they work completely offline, am I right?

I am currently using online resolver/verifier for several use cases. For many of them a changing and up to date online api is the best option (i.e. daily checking names of new specimens entering in a collection). But for published works, a protocol where I download a given version of a datasource and process it offline is a much better option.

In this sense, I see gndiff+gnparser as the most important gnames' tools for scientific publications. I open this issue not only for encouraging you to further develop them, but also to raise the question about what to do (as of today) if I want to publish some work and describe a protocol which was based on results returned by a current or past gnames api version.

Is there any way of citing "I used this version of gnverifier api" and also provide some kind of link (github? edit: seen a couple of Zenodo links cited here) which exactly reflects its code at that time ... so whoever wants to repeat my results in the future do it can download the exact version from github and install everything needed to repeat my work? i.e., an exact replica of gnames services at a given moment in time (of course, given that I also downloaded the current gnames database dump at that time, and stored it in a permanent repository somewhere, and provided a link in my publication).

I know that would imply a lot of work and nobody would take the time to do that in practice. But in theory, would it be possible? As of today, can we state that a work which used gnverifier api results is "in theory" repeatable in 10 years from now, or is this not possible?

And if it is, I would suggest not only to document the how-to, but also that apis could somehow return the how-to info if we request it (some sort of citation parameters, providing necessary links to github, db dumps, etc).

dimus commented 2 years ago

It is indeed a problem. And it is not only code, because database evolves as well, although, it mostly stays backward compatible sofar. However, nothing prevents a situation where an important feature would break that backward compatibility. So I guess a solution would be

  1. Figure out how to monitor database versioning (database actually is defined by this internal package https://github.com/gnames/gnidump), which is an equivalent of walking around the house alone in pajamas (no docs, bad architecture, no versions). So it would need to be improved. It would need to get to v1, and every time there is a breaking change in the database, increae major version number to v2, v3 etc.
  2. Add version number to sql dump file at http://opendata.globalnames.org/dumps/
  3. gnames version should return its own version + version of gnmatcher
  4. Every major version of database dumps has one latest file (something like dump-v1.3.6, dump-v2.0.2)

That gives a theoretical possibility to put together verification system. Using particular version of gnames + gnmatcher + database.

It does not solve a problem of data changing all the time, but I think that in most cases for most data-sources data change is cumulative, so result should be close, albeit not identical sometimes.

abubelinha commented 2 years ago

Quite a lot of work.

So to be realistic, I think we are much closer to a day where I can create a replicable protocol using this combination:

All these are versionable, downloadable, easily citable and standalone executable. I will closely follow gndiff evolution ;)

dimus commented 2 years ago

Usually I use formula: work/users_num

I think it is something that everybody who publishes their results would need, so I think it is not so much work in the end. I'll keep it open and close when the system is in place

abubelinha commented 2 years ago

I think it is something that everybody who publishes their results would need, so I think it is not so much work in the end. I'll keep it open and close when the system is in place

Great. Not sure if you are now meaning gnverifier / gndiff option, or both. But any advances would be good as for "theorical" repeteability.

As for really practical, I think the gndiff approach is the only good one (it would be easy to replicate something as long as you use the same offline tools; but anybody would accomplish the task of replicating the whole gnames services as they were at some time in the past, just for reviewing goodness of a small experiment or checklist).

dimus commented 2 years ago

for gndiff it should be easy, it has no remote dependencies, so just its version defines the result

abubelinha commented 2 years ago

Yes I agree. Version plus a given combination of request parameters, since it would be best to give users the option to define as much as possible the matching behaviour (of course with default values for everything, to avoid undesired CLI complexity).

Either that or using an editable default config file, so users can see default values and modifiy as needed.

abubelinha commented 2 years ago

Somehow related, but a bit off-topic. I have seen some Zenodo links related to your work (i.e. https://doi.org/10.5281/zenodo.5111543). A couple of questions:

Just looking for advice so I might decide to use github and/or zenodo for versioning a checklist in the future.

Thanks a lot in advance

dimus commented 2 years ago

Someone wanted to cite gnames, so I created Zenodo link for that purpose. Being lazy, I prefer to avoid unnecessary work, so I decided not to update these links, until someone requests a change again :)

abubelinha commented 2 years ago

OK. I thought you used some kind of auto-backup from github and zenodo.

As for the difference between github tree v.xxx and release v.xxx, do you have any opinion?

dimus commented 2 years ago

I these tree/vx.x.x and vx.x.x mean the same. In case of github links I usually use something like https://github.com/gnames/gnverifier/releases/tag/v0.8.2