punkish / zenodeo

`nodejs` interface to Zenodo/BLR community
https://zenodeo.org
Creative Commons Zero v1.0 Universal
6 stars 1 forks source link

*Deleted* attribute #20

Open gsautter opened 5 years ago

gsautter commented 5 years ago

Deleted attribute

(2) I have expressed my opinion on that above, and also outlined how you can easily remove deleted elements on your end (via the updateTime attribute). Plus, we do keep a ledger in the event log table, which is append-only. It's just not part of the stats or TB exports.

Note that without the deleted (or call it redacted or noLongerUsed or whatever) attribute, I will not be able to remove these from my database. Using the updateTime attribute doesn't do any good to me because it is the timestamp of the update of the entire treatment. Besides the programming complexity it introduces in my pipeline, it is completely useless for me when it comes to removing a part of the treatment. Imagine the following:

Someone reports that a particular treatment has a wrong materialsCitation or a wrong author or a wrong figureCitation. Turns out, for whatever reason, GGI marked up the text wrongly. You go back and fix the error and rerun. The new version now has all the right markup but now the author is different. That is, a piece of text that was earlier identified as author is no longer the author and a new piece of text is the author. Or perhaps, there are now simply two authors instead of the three that were there before. (I am only using authors as an example. The same logic would apply to materialsCitation, treatmentCitation, figureCitation, bibRefCitation). There is some logic (perhaps the updateTime) by which my program is able to download not just the new treatments (since the last download and extract) but also the old but now modified treatments. My program has to automatically detect the changes and redo all the tables. In this example, I have to not change the treatment and its materialsCitation, treatmentCitation, figureCitation, and bibRefCitation but mark one of its authors as no longer being an author. I cannot do this with updateTime.

Adding a simple attribute to each part that I am tracking, an attribute that reflects the current state of that part, whether it is valid or not (call it deleted or not, redacted or not, no-longer-used or not, whatever) is the only way my program can do this in the multitude of tables and the related indexes that need to be updated automatically.

In fact, without this attribute, I actually have only a partial need for the GUIDs. I can update the parts but I can no longer remove them from display.

Putting a login barrier is really simple. Apache has a simple server based login that can be used without creating a user accounts system. On the other hand, I don't think a login is even required. Our whole premise is that all our data are open. We are adding semantic intelligence to the data anyway, and reflecting how parts of the treatment are identified is a part of that. Everyone should have access to whatever I can access.

Now, I understand that if you remove something from a treatment, your new version simply doesn't have that fragment. it is really removed. The problem is data integrity. The data we are putting out is no longer guaranteed to be persistent and consistent. I mean, stuff has simply vanished from the new version. That can be mystifying to downstream consumers of the data. In my view, once data has been extracted and put out in public, stuff should never be removed from it. Instead, it should be marked as no longer being used.

So, how do we deal with the case when, let's say, someone issues a take-down notice and says this treatment should not be out in the open at all. Well, for one, we are only extracting treatments from open licensed articles, so no one can issue a take-down. Two, if someone does issue a takedown, we have to put a placeholder telling subsequent users that there was something there but is no longer there because of whatever reason. My proposal for a "deleted" attribute is simply that, a placeholder marker.

But then, as I said, if we collectively decide that we are not going to do the following – instead of removing something we will add an attribute indicating that that part is no longer used – then I will simply not be able to remove it from my database because I can't detect the absence of something.

Originally posted by @punkish in https://github.com/punkish/zenodeo/issues/14#issuecomment-515294056

gsautter commented 5 years ago

Thanks for explaining to me how to set up a login in Apache ... wasn't aware that's even possible. And I've never heard of .htaccess, either.

gsautter commented 5 years ago

Regarding data integrity, I'd find it more disturbing to still find something after deleting it ... does your machine hold on to deleted files even after you empty your recycling bin?

Processing-wise, such behavior would complicate but everything, as you always have to check everything for the presence of some deleted attribute before doing anything with it.

gsautter commented 5 years ago

Handling the aforementioned author would work like this in my proposal, assuming you store ID and update time of the parent treatment with all the derived elements:

The stats engine works like this, for instance, and perfectly fine so.

gsautter commented 5 years ago

Finally, we have that public ledger of everything we've ever done ... check out http://tb.plazi.org/GgServer/xmlHistory/730087F21E00FF81FF61FC34FDA561A5 - it has these deleted attributes, including user and time, on all removed elements that ever existed, plus full provenance on all extant elements, including all changes to element attributes.