API upload changes entries

myrmoteras commented 6 years ago

When uploading a deposit, that is already in BLR, via GGI where it has been markedup, the entries are changed:

All the author affliations are deleted The document is changed to closed access The author names are changed

I think, at least open access should be kept, and author affiliation not deleted.

In case of conflicts, the entries from the article should be kept?

gsautter commented 6 years ago

In principle, I completely agree ... however, there's this one problem:

suppose the article comes from GGI batch and goes onto Zenodo
you then open it in GGI on your machine and make some corrections, say to the author names
after that, you save it back to the server, triggering an update on Zenodo

Now what should the Zenodo uploader do? Retain what's already on Zenodo, i.e., the errors you just corrected, or write your update through to Zenodo, like currently implemented?

This is hardly a bug, more like a general conundrum ... the "latest change wins" approach is quite frequently used in resolving related conflicts in data synchronization. This doesn't mean I'm not aware of the problem, but it does mean I currently cannot think of a solution that will cover all the cases.

gsautter commented 6 years ago

It might be an option to go differently about different elements in the metadata, but that would have to be adjusted continuously as we continue to expand the range of what metadata we extract. Not an ideal approach at all ...

gsautter commented 6 years ago

To illustrate the aforementioned conundrum a bit further, just imagine comparing two sets of metadata X and Y for the same publication, stemming from two respective locations A and B that synchronize with one another. Now if X contains some element E that is missing in Y, there is two options:

E originally was present in Y, but was removed there because it was erroneous. Now that correction is to be propagated to X.
E is a justified part of X, and the creators of Y simply failed to add it (maybe because their tools don't observe X). Now E has to be added to Y.

Looking at the data in X and Y alone, there is just no way of telling those two diametrically different situations apart. Now from a different point of view, Linked Open Data is a great tool for datasets to combine and fill in their mutual gaps. However, contradictions in such dataset merge-ups will result in the very same conundrum outlined above.

That all said, I for myself tend to think that in our aspiration to contribute to a world of useful Linked Open Data, we should prefer being somewhat sparse over being sometimes wrong ... However, this is only my personal take on the issue, up for discussion.

myrmoteras commented 6 years ago

a solution would be to download the article directly from Zenodo, including all the metadata that has been added to the deposit. There is already a function to open a file from URL, and the search function in add metadata allows searching and sometimes retrieves the data. this allows to download the file, and it opens. Now, to get the metadata seems not to work. this might be linked to this issue https://github.com/plazi/Plazi-Communications/issues/639

gsautter commented 6 years ago

Well, yes ... if only the server knew where the original PDF came from ...

Of course it's quite possible to pull metadata alongside a PDF imported from Zenodo, but that won't be able to resolve the above conundrum ... when comparing metadata, there is just no way of telling which version is correct in case of a difference. Might be possible if one is empty and the other is not, but even that approach comes with the danger of failing/refusing to delete wrongfully added data.

I'd like to discuss this topic in the November meeting.

myrmoteras commented 6 years ago

the situatiion we deal in this thread is the following. We (Plazi) upload a PDF to BLR. At some later date, we go grab this PDF and run it through GGI, and save it in TB.

This is the case for all the articles we upload for which we mint the DOI, such as Revue Suisse de Zoologie.

That means, in this cases we know the origin, because we uploaded it in the first hand, and then downloaded it to process in GGI. This seems to be straight forward.

We can discuss in Bern. https://docs.google.com/document/d/1A3n_hdRStvYUcEIeGG2aywSIun8jTrvmdZUvAX9DaRY/edit#heading=h.twwcwz52a96

gsautter commented 6 years ago

OK, I see some options for filtering here, e.g. for journal name, for Zenodo side creation date, etc.

However, that also means we have to define elements that are to be taken over from the Zenodo metadata. And it also means that GGI side corrections to these fields would not go through to Zenodo.

Requires discussion.

plazi / Biodiversity-Literature-Repository

API upload changes entries #14