plazi / arcadia-project

2 stars 1 forks source link

taxonomicTreatment @ BLR: comments by Rich Pyle #112

Open myrmoteras opened 4 years ago

myrmoteras commented 4 years ago

From: Donat Agosti mailto:agosti@amnh.org Sent: Wednesday, October 9, 2019 2:23 AM To: Richard Pyle mailto:deepreef@bishopmuseum.org Subject: taxonomic treatments in BLR

Hi Rich We decided to go live with taxonomic treatments as a publication subtype at Zenodo, in which we deposit each treatment, included is extensive metadata, various download formats and a DataCite DOI. If you have a quiet moment, please have a look at some random treatments https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzenodo.org%2Fcommunities%2Fbiosyslit%2Fsearch%3Fpage%3D1%26size%3D20%26subtype%3Dtaxonomictreatment&data=02%7C01%7Cagosti%40amnh.org%7C52d6f77cc1be4c2dde7908d74e0f87c3%7Cbe0003e8c6b9496883aeb34586974b76%7C0%7C0%7C637063700942931436&sdata=yIYTy603LD17dBMYpG6L1uKPi7WiQfinRaQt4zLsW9s%3D&reserved=0 and let me know, whether there is something that is utterly wrong, could be better, could live with it.

All will eventually include a link to the related item in Zoobank, if not already.

The metadata serves to discover the treatments. The file to upload with all the taxonomic relevant information is in the attached XML file. Essentially, the goal is to be sure, once the processing and QC is over, that a new taxonomic name is available. This QC is in place already, and we ponder the idea to add a flag that states that the name is available. See eg https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fzenodo.org%2Fsearch%3Fpage%3D1%26size%3D20%26custom%3D%255Bdwc%3AtaxonomicStatus%255D%3Asp.%2520nov.&data=02%7C01%7Cagosti%40amnh.org%7C52d6f77cc1be4c2dde7908d74e0f87c3%7Cbe0003e8c6b9496883aeb34586974b76%7C0%7C0%7C637063700942931436&sdata=4f3XTrkkUMGbHQiXRpLbOh4GF3twubeAHqfDPMUQbYs%3D&reserved=0

Thanks for your feedback

Donat

myrmoteras commented 4 years ago

Thanks, Donat! This is great! I’m frantically trying to get ready to go to Berlin on Friday (Lepidoptera workshop next week). It requires human evaluation to make an assertion that a name is truly available (at least until we revise the Code), but I think this is very-much on the right track!

A few comments/questions:

Looking at this record: https://zenodo.org/record/3477515#.XZ-h_ehKiHs Why are there extra spaces within the ZooBank LSID? The XML looks like this:

urn:lsid:zoobank.org:act: D3322 A 0B-3133-4BA9-A752-F5009 E 03B1ED

Near the bottom of this page, in the section for the treatment.html file, there is “md5:463fa4d7fc49248c4838520f9ebe0001” What does that UUID refer to? Same for md5:1139dc5433be34e952ee881f58d10ca0 under the XML treatment. Are these unique identifiers for the binary files themselves?

Do you have any control over the filenames that are downloaded? Or is that entirely a Zenodo thing? If you have control, might make it easier to embed one of the UUIDs in the file name.

This is maybe the most concerning thing:

**I thought we agreed to share the UUIDs whenever possible?** Why mint a new UUID for the PLAZI treatment when you could just reuse the existing UUID from ZooBank? I thought that was the whole point of our conversation at TDWG to determine that TNU=Treatment, and therefore should share the same UUIDs whenever possible. That’s all I have time for now, but I’ll look a bit more closely later this weekend. Aloha, Rich
myrmoteras commented 4 years ago

Hi Rich,

please find my replies below.

Thanks, Donat! This is great! I’m frantically trying to get ready to go to Berlin on Friday (Lepidoptera workshop next week). It requires human evaluation to make an assertion that a name is truly available (at least until we revise the Code), but I think this is very-much on the right track.

A few comments/questions:

Looking at this record: https://zenodo.org/record/3477515#.XZ-h_ehKiHs Why are there extra spaces within the ZooBank LSID? The XML looks like this:

urn:lsid:zoobank.org:act: D3322 A 0B-3133-4BA9-A752-F5009 E 03B1ED

Near the bottom of this page, in the section for the treatment.html file, there is “md5:463fa4d7fc49248c4838520f9ebe0001” What does that UUID refer to? Same for md5:1139dc5433be34e952ee881f58d10ca0 under the XML treatment. Are these unique identifiers for the binary files themselves?

These guys are binary hashes (I think MD5) of the (immutable) primary deposition files, not identifiers outright. In IMF, we use the former for the latter, but that's IMF specific, and we do so only for the (often copyrighted and thus closed access) PDF articles proper. Treatment UUIDs are derived from in-PDF word positions and the PDF hash, though, and in a well-defined way.

Do you have any control over the filenames that are downloaded? Or is that entirely a Zenodo thing? If you have control, might make it easier to embed one of the UUIDs in the file name.

We (Plazi) have no control over who requests any content from Zenodo, sorry, and we do not even have any respective statistics, either. We create depositions on Zenodo, but who downloads them and how they use them is beyond our control. Maybe Alex has some stats available in that regard ...

This is maybe the most concerning thing:

I thought we agreed to share the UUIDs whenever possible? Why mint a new UUID for the PLAZI treatment when you could just reuse the existing UUID from ZooBank? I thought that was the whole point of our conversation at TDWG to determine that TNU=Treatment, and therefore should share the same UUIDs whenever possible.

Well, the HTTP URI is still the Plazi internal UUID. And said UUID is generated from the IMF (original PDF publication converted for analysis) in a well-defined way. We do import and resolve the ZooBank UUID if available as well, but please understand that any serious software has to be self-contained regarding its data IDs ... I'd be a pretty bad software architect if I didn't follow that principle. However, it is quite possible to change the Zenodo upload to prefer any ZooBank UUID (if available), of course, and the uploads do include ZooBank LSIDs wherever available. Let's discuss that in Leiden, looking forward to sharing another pint with you ...

That’s all I have time for now, but I’ll look a bit more closely later this weekend. If UUIDs are the only thing that jumped your eye thus far, I guess we can start mass-uploading. Unless you spot anything else, of course.

Best, Guido

myrmoteras commented 4 years ago

Hi Guido,

These guys are binary hashes (I think MD5) of the (immutable) primary deposition files, not identifiers outright.

Got it! That makes perfect sense.

Treatment UUIDs are derived from in-PDF word positions and the PDF hash, though, and in a well-defined way.

That's unfortunate. I see the value in doing this (basically same reason Dima would hash UUIDs out of name-strings). But for the same reasons that the binary hashes of the deposition files are not, strictly speaking, identifiers; then we shouldn't really treat these UUIDs as identifiers for treatments either.

We (Plazi) have no control over who requests any content from Zenodo

I was talking about the filenames used for the downloads, not metrics on who is downloading or how much/etc. The annoyance is that Zenodo assigns the same name to all the XML downloads. So if you fetch more than one they get appended with "(1)", "(2)", etc. Not a major issue, but still (slightly) annoying.

Well, the HTTP URI is still the Plazi internal UUID. And said UUID is generated from the IMF (original PDF publication converted for analysis) in a well-defined way.

Understood (see above).

We do import and resolve the ZooBank UUID if available as well, but please understand that any serious software has to be self-contained regarding its data IDs

Obviously! That's why we all mint our own. But what we all discussed in Venice(?) TDWG was that, whenever we can recognize that a UUID already exists for an object, we should all share it. Otherwise, there's practically no point in using UUIDs in the first place. For example, I will adopt all PLAZI UUIDs for treatments that are not already in GNUB. There were always going to be exceptions, but it's just a shame that now the exceptions will be the shared UUIDs, an the "normal" will be cross-linked (effectively redundant) UUIDs.

However, it is quite possible to change the Zenodo upload to prefer any ZooBank UUID (if available), of course, and the uploads do include ZooBank LSIDs wherever available.

The Zenodo representation doesn't matter. The dream/hope was that we'd all move to UUIDs to make it much easier to share identifiers (the reason I've pushed so hard for UUID over other identifiers is that they can be globally unique even if minted separately, yet still shared across all the data silos if/when identity discovery is made). The plan from Venice was for GNUB/PLAZI to set an example for the rest of the community. It's not a big deal, and now that you've minted so many new ones already, there's no point in changing things. I'll still adopt PLAZI treatement UUIDs for GNUB when GNUB doesn't have them already, and cross-link them when they do have them already -- it just would have been nice to have the latter as the exception, rather than the norm. In any case, it will all get sorted out eventually.

Let's discuss that in Leiden, looking forward to sharing another pint with you ...

Unfortunately, I won't be in Leiden. I will be in Berlin for a Lepidoptera workshop next week, but I have a prior conflict that can't be changed during BiodiversityNext.

If UUIDs are the only thing that jumped your eye thus far, I guess we can start mass-uploading. Unless you spot anything else, of course.

I'll try to give it a closer look. Are you mostly interested in feedback on the XML schema, or in how it's implemented? If the former, do you have a Schema template file you can send me? If the latter, then there may be other things like the broken up ZooBank UUID as noted in my previous email. I'll find time over the weekend to look more closely.

All the UUID stuff aside (which is relatively minor, in the grand scheme of things), the rest of it looks very promising!

Aloha, Rich

myrmoteras commented 4 years ago

an issue submitted to github.zenodo https://github.com/zenodo/zenodo/issues/1885

the download file for the deposit have all the same name, eg. treatment.html for taxonomic treatments (eg https://zenodo.org/record/3477523) , but it seems that in other cases they have individual names. (see eg https://zenodo.org/record/3479976#.XaBKVkYzZPY : big_344740.jpg

We have a suggestion from outside to rename these files so that thy eithre reflect the UUID of the treatment or the zenodo DOI so that when somebody downloads all of them she has not to rename it. Is this possible, and if so is this a done during the upload?