punkish commented 5 years ago

Update: added a couple of important points to the text below Update 2: added a complete working example in the next comment

@gsautter

cc @tcatapano, @myrmoteras

I have a proposal that will involve minimal changes on your part but will greatly improve the data integrity for times to come.

Just like you assign a GUID to every treatment, it would be great if you could assign a GUID to every component part of the treatment as well, for example, materialCitations, treatmentCitations, figureCitation and such.

The GUID of the parent treatment and the GUIDs of the component parts do not have to have any relationship. They would exist only as a contract between us (the data providers) and the subsequent users of our data, a contract that guarantees that our data, once created, will always be trackable via those IDs, even if some of their info changes.

(Basically I need an ID for each component that is unique and immutable.) This ID would also be the primary way to make sure that we are talking about exactly the same component part even when referring to different databases.

We would never delete either a treatment or any of its related parts though we could add a field that says that it is redacted or redundant (effectively deleted). Modifying an existing part would not be a problem because its the existing GUID would ensure the info goes to the right place. And adding a new part would not be a problem as it would have its own GUID. The fictitious but realistic plan below shows how:

materialCitationId: 7E9081B5-9A6D-4CC1-A8C3-47F69FB4198D latitude: -81.435 longitude: 23.221

-- edit --

materialCitationId: 7E9081B5-9A6D-4CC1-A8C3-47F69FB4198D latitude: -81.635 longitude: 24.021

-- add --

materialCitationId: 7E9081B5-9A6D-4CC1-A8C3-47F69FB4198D latitude: -81.635 longitude: 24.021

materialCitationId: D1A5279D-B27D-4CD4-A05E-EFDD53D08E8D latitude: 17.3362 longitude: 43.143

materialCitationId: 7E9081B5-9A6D-4CC1-A8C3-47F69FB4198D latitude: -81.635 longitude: 24.021

materialCitationId: D1A5279D-B27D-4CD4-A05E-EFDD53D08E8D latitude: 17.3362 longitude: 43.143

-- delete --

materialCitationId: 7E9081B5-9A6D-4CC1-A8C3-47F69FB4198D latitude: -81.635 longitude: 24.021 deleted: true reason: "because blah blah"

materialCitationId: D1A5279D-B27D-4CD4-A05E-EFDD53D08E8D latitude: 17.3362 longitude: 43.143

The deleted and reason fields are embellishments that could be very useful over time. But my main intent really is to have a unique ID for every component part. I can create one myself, but that doesn't serve the purpose of linking it to the source. So, the best place for such an ID to begin is your workflow.

Plus, this is the only efficient way that I can implement updating a treatment and all its related parts once it has been created, without breaking any dependent relationship.

What do you think?

punkish commented 5 years ago

Here is a complete working example. All the magic is possible because every materialsCitation also has a unique ID.

We start with the following data. data_t0 is an array of treatments (like a table of treatments) whose every element is a record (like a row in a table, single treatment). The fourth element in every record is actually a reference to a another array of materialsCitations for this treatment.

const data_t0 = [
    {
        tId: '7E9081B59A6D4CC1A8C347F69FB4198D', tTitle: 'Jungle Fever', aName: 'Spike Lee',
        m: [
            {mId: '5EDEB36C9006467A8D04AFB6F62CD7D2', tId: '7E9081B59A6D4CC1A8C347F69FB4198D', lat: '22.345', lon: '21.546'},
            {mId: '283B67B2430F4E6F97E619041992C1B0', tId: '7E9081B59A6D4CC1A8C347F69FB4198D', lat: '81.432', lon: '-17.532'},
            {mId: 'B59511BD6A5F4DF09ECF562A108D8A2E', tId: '7E9081B59A6D4CC1A8C347F69FB4198D', lat: '22.442', lon: '21.687'}
        ]
    },
    {
        tId: '2F2B15A526154748BDABA124210F15EC', tTitle: 'Femme Fatal', aName: 'Brian de Palma',
        m: [
            {mId: '07D267AB712645EB8FFC51994D05F0B2', tId: '2F2B15A526154748BDABA124210F15EC', lat: '123.542', lon: '12.522'}
        ]
    },
    {
        tId: '0c74f13ffa834c489b3368921dd72463', tTitle: 'Toy Story 4', aName: 'Josh Cooley',
        m: [
            {mId: 'b4b2fb69c6244e5eb0698e0c6ec66618', tId: '0c74f13ffa834c489b3368921dd72463', lat: '-56.967', lon: '67.376'},
        ]
    }
];

A few days later, we have some changes. One treatment has few edits, and its related materialsCitations too have edits. In fact, a new one has been added as well, and one of them has been deleted. Also, another treatment has been completely deleted.

const data_t1 = [
    {
        // treatment edited
        tId: '7E9081B59A6D4CC1A8C347F69FB4198D', tTitle: 'Do The Right Thing', aName: 'Mr. Spike Lee',
        m: [
            // unchanged
            {mId: '5EDEB36C9006467A8D04AFB6F62CD7D2', tId: '7E9081B59A6D4CC1A8C347F69FB4198D', lat: '22.345', lon: '21.546'},

            // edited
            {mId: '283B67B2430F4E6F97E619041992C1B0', tId: '7E9081B59A6D4CC1A8C347F69FB4198D', lat: '83.433', lon: '-16.874'},

            // deleted
            {mId: 'B59511BD6A5F4DF09ECF562A108D8A2E', tId: '7E9081B59A6D4CC1A8C347F69FB4198D', lat: '22.442', lon: '21.687', del: 'true'},

            // added
            {mId: '677E2553DD4D43B09DA77414DB1EB8EA', tId: '7E9081B59A6D4CC1A8C347F69FB4198D', lat: '-45.348', lon: '-121.998'}
        ]
    },

    // entire treatment deleted
    {
        tId: '0c74f13ffa834c489b3368921dd72463', tTitle: 'Toy Story 4', aName: 'Josh Cooley', del: 'true',
        m: [
            {mId: 'b4b2fb69c6244e5eb0698e0c6ec66618', tId: '0c74f13ffa834c489b3368921dd72463', lat: '-56.967', lon: '67.376'},
        ]
    }
];

Given the following program

const createTable = function() {
    db.prepare(`
        CREATE TABLE IF NOT EXISTS t (
            id INTEGER PRIMARY KEY, 
            tId TEXT UNIQUE, 
            tTitle TEXT,
            aName TEXT,
            del TEXT DEFAULT 'false'
        )
    `).run();

    db.prepare(`
        CREATE TABLE IF NOT EXISTS m (
            id INTEGER PRIMARY KEY, 
            mId TEXT, 
            tId TEXT,
            lat TEXT, 
            lon TEXT,
            del TEXT DEFAULT 'false',
            UNIQUE (mId, tId)
        )
    `).run();
};

const upsertRows = function() {
    const upsert_t = db.prepare(`
        INSERT INTO t (tId, tTitle, aName, del) VALUES(?, ?, ?, ?) 
            ON CONFLICT(tId) DO UPDATE SET 
                tTitle=excluded.tTitle,
                aName=excluded.aName,
                del=excluded.del
    `);

    const upsert_m = db.prepare(`
        INSERT INTO m (mId, tId, lat, lon, del) VALUES(?, ?, ?, ?, ?) 
            ON CONFLICT(mId, tId) DO UPDATE SET 
                lat=excluded.lat,
                lon=excluded.lon,
                del=excluded.del
    `);

    const upt = function(t) {
        upsert_t.run(t.tId, t.tTitle, t.aName, t.del || 'false');
        t.m.forEach(upm);
    };

    const upm = function(m) {
        upsert_m.run(m.mId, m.tId, m.lat, m.lon, m.del || 'false');
    };

    data_t0.forEach(upt);
    data_t1.forEach(upt);
};

createTable();
upsertRows();

We get our database called test.sqlite.

$ sqlite3 test.sqlite 
SQLite version 3.28.0 2019-04-16 19:49:53
Enter ".help" for usage hints.
sqlite> SELECT * FROM t;
id          tId                               tTitle              aName          del       
----------  --------------------------------  ------------------  -------------  ----------
1           7E9081B59A6D4CC1A8C347F69FB4198D  Do The Right Thing  Mr. Spike Lee  false     <-- edited
2           2F2B15A526154748BDABA124210F15EC  Femme Fatal         Brian de Palm  false     
3           0c74f13ffa834c489b3368921dd72463  Toy Story 4         Josh Cooley    true      <-- deleted
sqlite> SELECT * FROM m;
id          mId                               tId                               lat         lon         del       
----------  --------------------------------  --------------------------------  ----------  ----------  ----------
1           5EDEB36C9006467A8D04AFB6F62CD7D2  7E9081B59A6D4CC1A8C347F69FB4198D  22.345      21.546      false
2           283B67B2430F4E6F97E619041992C1B0  7E9081B59A6D4CC1A8C347F69FB4198D  83.433      -16.874     false   <-- edited
3           B59511BD6A5F4DF09ECF562A108D8A2E  7E9081B59A6D4CC1A8C347F69FB4198D  22.442      21.687      true    <-- deleted
4           07D267AB712645EB8FFC51994D05F0B2  2F2B15A526154748BDABA124210F15EC  123.542     12.522      false     
5           b4b2fb69c6244e5eb0698e0c6ec66618  0c74f13ffa834c489b3368921dd72463  -56.967     67.376      false     
6           677E2553DD4D43B09DA77414DB1EB8EA  7E9081B59A6D4CC1A8C347F69FB4198D  -45.348     -121.998    false   <-- added
sqlite>

myrmoteras commented 5 years ago

@punkish check this XML version of including the IDs https://raw.githubusercontent.com/plazi/EJT-testbed/master/incoming/ejt-502_trietsch_miko_deans_xml_id_raw.xml?token=ABDFPJAS7O7DEQSUYOFTHMK5CXKLK

this is from this article: http://treatment.plazi.org/GgServer/summary/3F5A2711FFCC9800FFB2FFF8FFA6FFCD

that shows you that the IDs you want are already in the system

myrmoteras commented 5 years ago

see eg <materialsCitation id="FBA2E422FFC49808FF58FD06FC0FFCAD" collectedFrom="myrmecophile" collectionCode="MNHN" collectorName="H. Donisthorpe" country="United Kingdom" location="UNITED KINGDOM" pageId="8" pageNumber="9" specimenCode="EY22475, EY22463, EY22465" specimenCount="1" specimenCount-male="1" typeStatus="syntype">

punkish commented 5 years ago

cool! but two responses:

One, I don't really want an IDs for every item, word or line. I only care about the IDs for the things I care about, such as materialsCitation, treatmentCitation, figureCitation and bibRefCitation. Of course, I can just ignore all other IDs.

Two, you say that such IDs are *already there in the system. Then how come the XMLs I have don't have such IDs? Here is a materialsCitation from a real treatment XML D709975BC52956782DB3FD0E36B5FE66 and you can see, it doesn't have the ID. For my system to work, all and every component that I want to extract has to have such an ID. So, if this is something you all have recently started doing (and great job, if yes!!), you will probably need to process the entire corpus so all the historic XMLs also get such an ID.

Please clarify.

<materialsCitation collectingDate="2015-11-10" collectionCode="MALE" collectorName="Doe" country="Australia" latitude="-27.616667" location="Venman Bushland National Park" longLatPrecision="1236" longitude="153.2" municipality="Material" pageId="17" pageNumber="518" specimenCount="1" stateProvince="Queensland" typeStatus="holotype">

punkish commented 5 years ago

see eg <materialsCitation id="FBA2E422FFC49808FF58FD06FC0FFCAD" collectedFrom="myrmecophile" collectionCode="MNHN" collectorName="H. Donisthorpe" country="United Kingdom" location="UNITED KINGDOM" pageId="8" pageNumber="9" specimenCode="EY22475, EY22463, EY22465" specimenCount="1" specimenCount-male="1" typeStatus="syntype">

perfect! see the second part of my comment above

myrmoteras commented 5 years ago

no, this is not something new. It has been in the system. @gsautter can explain. It is just a more comprehensive view that is normally not shown. That also means, that we could export an XML version, as discssed before, that has only minimal structural elements taged, and those without ID,s and all predefinedd semantic elements (treatment, materialsCitation, ?) including their IDs.

One thing which came up in the discussion is that when you expose the ID, to what they resolve. I am not sure whether we have this in place for MC. Agains, @gsautter knows all.

myrmoteras commented 5 years ago

right now, you can get the ID by an export of the data from GGI.

punkish commented 5 years ago

One thing which came up in the discussion is that when you expose the ID, to what they resolve. I am not sure whether we have this in place for MC. Agains, @gsautter knows all.

At this time I have no plans to expose this ID. But, even if I do, the user will not have to worry about it, much like the user doesn't have to worry about the treatmentIds. I might use them in a link or to identify a record externally. But, the main use is to maintain the internal data integrity.

punkish commented 5 years ago

right now, you can get the ID by an export of the data from GGI.

Please note that I will need this comprehensive version of XMLs not just when I get the initial complete dump, but also for every subsequent update for new or modified XMLs. Unique IDs in every component part is pretty much the only way I can ensure data integrity.

I am now going to rewrite my database code with the new schema and then wait till I get the go-ahead from @gsautter to download the new dump with the IDs.

punkish commented 5 years ago

important

I realize, one thing I don't see in your existing comprehensive data export is how you treat deleted portions. Perhaps you haven't yet encountered them yet. But planning for that possibility, perhaps you would consider what I suggested: keep the "deleted" component (or even a complete treatment) but add an attribute called deleted and set its value to true. Perhaps even add a reaonsDeleted for completeness and add some descriptive text as to why it was deleted.

If you ever delete something (without a mechanism like above), I would have no way of knowing that that component has been deleted, and my database will continue to keep it as if it still existed.

punkish commented 5 years ago

one more very important note:

this ID (let's call it a partId), has to be present in every part of a treatment that I take out. So, for example, it has to be there for all of the following:

treatmentAuthor (each of them, not just the list)
materialsCitation
treatmentCitation
figureCitation
bibRefCitation

punkish commented 5 years ago

@gsautter and @myrmoteras

I will benefit greatly from a test dataset (perhaps 10-20K XMLs, not more than that), that will allow me to test my database schema, indexes, and insert-update queries. This dataset will have to conform to the suggestions made in this thread.

I have already prepared the program to generate the new schema and to do the inserts, but I can only proceed once I have the test data (and then, subsequently, the entire data).

Many thanks,

punkish commented 5 years ago

here is a complete proposal

Proposal to Improve Data Integrity

Treatment

Wrap each treatment in a tag named treatment
Embed all treatment attributes in the treatment tag
Add a GUID to each treatment tag
Add a BOOLEAN deleted attribute to the treatment tag only if the treatment is ever deleted, withdrawn or redacted

Related Parts

the related parts of a treatment are
- treatmentAuthor
- treatmentCitation
- materialsCitation
- figureCitation
- bibRefCitation
Wrap each related part in a tag named after the part
Embed all related part attributes in the related part tag
Add a GUID to each related part tag
Add a BOOLEAN deleted attribute to the related part tag only if the related part is ever deleted, withdrawn or redacted

Treatment Authors

While a list of all authors is embedded as an attribute in the treatment tag, we need each authors as a separate entity. Currently, individual authors are extracted from the mods\:namePart tag inside the mods\:role mods\:roleTerm tag where the role is "Author". It would be helpful if each author can be wrapped in a treatmentAuthor tag.

Treatment Citations

Currently each treatment citation is componsed from various tags and attribututes inside the treatmentCitation tag. Instead, the complete treatementCitation should be added as an attribute of the treatmentCitation tag.

Materials Citations

nothing more to be done here other than adding a GUID and a 'deleted' attribute, if required

Figure Citations

nothing more to be done here other than adding a GUID and a 'deleted' attribute, if required

BibRef Citations

nothing more to be done here other than adding a GUID and a 'deleted' attribute, if required

gsautter commented 5 years ago

After a good bit of implementation, I hope to have a variation of GG XML that will make @punkish 's life easier. A full dump is in progress, but here's an advance example: http://tb.plazi.org/GgServer/zenodeo/BDA70EC9F8ABAED6C2B7628596A1714A

I've added the UUIDs to the elements @punkish requires them for (it's easy to add them in other places as well now), and added a filter removing all heading and emphasis elements to reduce complexity (also easy to extend or alter if required).

punkish commented 5 years ago

here is a rather ambitious way of keeping track of data changes over time (but note, this still doesn't implement a way to go back to a prior state – that is just way too complicated)

https://punkish.org/Maintaining-data-integrity

cc @gsautter @mguidoti @myrmoteras @tcatapano

gsautter commented 5 years ago

The dump has finished packing as well now: http://tb.plazi.org/GgServer/dumps/plazi.zenodeo.zip

This is the full collection, though, not a random sample, since incremental improvement and adjustment of the output as we proceed would require re-creating the same random sample multiple times, which is, as you might imagine, not as easy a thing to do.

gsautter commented 5 years ago

Regarding tracking of deletions: I think that including deleted treatments in any kind of export would be off the point ... say we delete some treatments in reaction to some author's complaint, we cannot simply flag them and keep them accessible anyway, can we?

Here is an alternative approach:

We (next to) never delete an article, not at last because we cannot ever delete them from Zenodo once they are up there.
From when treatments are updated (you have the updateTime attribute), you can track when articles are updated via the masterDocId attribute.
With the latter, you can check which treatments of the updated articles are still extant and if there are any that are no longer; the latter are the ones that were deleted.

Alternatively, you can use the stats API to obtain a list of articles updated since you last checked, and then do the treatment diff based upon that; the stats API also gives you the extant treatments for a given article UUID.

punkish commented 5 years ago

Regarding tracking of deletions: I think that including deleted treatments in any kind of export would be off the point ... say we delete some treatments in reaction to some author's complaint, we cannot simply flag them and keep them accessible anyway, can we?

I should have been more clear, sorry. Once a treatment (or any of its parts) would be marked as "deleted" (or withdrawn, redacted, or anything similar), it would go out of circulation for all public. It would stay in the database (much like you mention that once in Zenodo, stuff doesn't really go away), but it would only be accessible to a restricted set of people with privileged access to the data.

The idea is to track even our own activity, like a ledger of what we do. This can only make our data better because (imagine, after a considerable period of time) we can see how things have evolved.

Anyway, this is something to be discussed more.

punkish commented 5 years ago

The dump has finished packing as well now: http://tb.plazi.org/GgServer/dumps/plazi.zenodeo.zip

This is the full collection, though, not a random sample, since incremental improvement and adjustment of the output as we proceed would require re-creating the same random sample multiple times, which is, as you might imagine, not as easy a thing to do.

many thanks for this. Will try it out.

gsautter commented 5 years ago

The point is that the dumps are public, not only for internal use ... their purpose is to provide bulk access to our treatment collection, e.g. for hackatons, etc.

punkish commented 5 years ago

The point is that the dumps are public, not only for internal use ... their purpose is to provide bulk access to our treatment collection, e.g. for hackatons, etc.

well, let's visit this later, but the above is hardly a constraint. It would be solved easily by creating a private dump/version for internal consumption. But anyway, we can discuss this at a later time.

gsautter commented 5 years ago

Putting up a login barrier is a good bit of effort ... but anyway, there is a better option: Mainly for replication purposes, the Remote Event Service in the back-end keeps a log of all document events that occur, including both updates and deletions. The caveat at this point is that this log comes as TSV (intended mainly for consumption by a corresponding component in another back-end installation), so I'd have to write some code molding that into XML for you to consume. Rather that than including deleted data in some dump, which would mean turning a ton of logic upside-down where there is no real need to do so.

punkish commented 5 years ago

Putting up a login barrier is a good bit of effor

Yes, absolutely agree. However, there are easier ways than a regular login system. We can talk about it later.

Let me process this batch and see if it accomplishes the data integrity goal I was hoping to achieve. It is a bit of work processing more than a gig of files and I am on my way from Spain to Germany. Will do this in the next couple of days and report back.

I know in advance that it will be crucial for me to know of any deletions in order to stay in sync with what you have, but again, let's cross that bridge in a bit. I still have work to do before I get there.

Many thanks again.

punkish commented 5 years ago

The dump has finished packing as well now: http://tb.plazi.org/GgServer/dumps/plazi.zenodeo.zip

This is the full collection, though, not a random sample, since incremental improvement and adjustment of the output as we proceed would require re-creating the same random sample multiple times, which is, as you might imagine, not as easy a thing to do.

cc @myrmoteras @tcatapano

Hi @gsautter,

Here are my belated notes, suggestions, requirements I really need, and some inconsistencies I can live with but are worth documenting, with respect to the special treatments dump you prepared for me:

Individual parts of a treatment still don't have a GUID as I had proposed above. Without GUIDs for these parts, I cannot track them if they are changed or deleted.
As proposed earlier, the individual parts need to have a flag that reflects their state in case they are deleted or redacted. This is the only way I can track them.
If the XML is being simplified, I would suggest to keep the formatting info such as 'emphasis', 'number' and other semantically meaningful tags but remove the attributes such as 'box'. But this is not essential as I can also just ignore this info.
Add a full citation for the article containing the treatment to the document attributes, perhaps to an attribute called 'masterDocCitation'. This will ensure a consistent, acceptable citation style available to everyone without having to construct one with individual pieces of information such as author, year, journal info, etc.
Is the 'docAuthor' and 'docDate' always the same for the masterDoc as well? If not, then add a 'masterDocAuthor' and 'masterDocDate' to the document attributes.

5a. Change 'docDate' to 'docYear' because really, this field is only storing the year, not the full date.

Shouldn't all documents have an 'authorityName' and 'authorityYear'? See 03AB8782012CFFFCFF1AFDD1FB38EFE9 which has neither.

gsautter commented 5 years ago

(1) I added IDs to the subSubSections now, see http://treatment.plazi.org/GgServer/zenodeo/03AB8782012CFFFCFF1AFDD1FB38EFE9 , the dump will get them with the next export.

gsautter commented 5 years ago

(6) This was simply old(er) markup from when we didn't ad the authorities yet (April 2016). I now ran the respective gizmo on the parent article, so the authority information is there. If you detect similar treatments, please let me know, so I can run that gizmo on them as well ... merely a single-line command to the server side batch.

gsautter commented 5 years ago

(3) If you want, a can re-add emphasis (you once complained about the complexity it adds to your paths), but number is plainly a pain in the back ... a leftover from a tagger that builds quantities from them. All these elements mark is sequences of digits, which are very easy to discover by other means.

gsautter commented 5 years ago

(5) You are assuming correctly. docAuthor and docDate are the same as for the parent article. The naming of docDate is a legacy thing and should be docYear, agreed. Had you only told me that 10 years ago ... will be a lot of effort to change now, especially in all the data, but also in XSLTs, etc. The actual date now comes in docPubDate in articles we have it for.

gsautter commented 5 years ago

(4) Adding a full citation might be possible, yes. But I wouldn't like it as much to add that kind of redundancy to the XMLs proper. You can have it from the JSON, however, as the backing TB stats readily provide a full reference string, just need to add it to the request.

gsautter commented 5 years ago

(2) I have expressed my opinion on that above, and also outlined how you can easily remove deleted elements on your end (via the updateTime attribute). Plus, we do keep a ledger in the event log table, which is append-only. It's just not part of the stats or TB exports.

punkish commented 5 years ago

IDs for every part of the treatment that I am extracting

(1) I added IDs to the subSubSections now, see http://treatment.plazi.org/GgServer/zenodeo/03AB8782012CFFFCFF1AFDD1FB38EFE9 , the dump will get them with the next export.

I was under the impression that the "special" export you made for me already had the GUIDs for every part as per the comment from @myrmoteras on June 21 above (see https://github.com/punkish/zenodeo/issues/14#issuecomment-504349112). Now that I have discovered that GUIDs were not in it then I am curious, what was special about it?

In any case, please note that I don't just want GUIDs in every subSubSection. I want GUIDs in every part that I extract and store in a separate relational table. I have listed those parts in my comments above (https://github.com/punkish/zenodeo/issues/14#issuecomment-504410952 and https://github.com/punkish/zenodeo/issues/14#issuecomment-504990312). I am listing them again here below:

Treatment
treatmentAuthor
treatmentCitation
materialsCitation
figureCitation
bibRefCitation

The unique combinations of the Treatment GUID and the GUID of any of the tracked parts of the treatment are the only (relatively easy) way I can track subsequent edits or deletes to these parts.

punkish commented 5 years ago

Ability to track parts of a treatment

(6) This was simply old(er) markup from when we didn't ad the authorities yet (April 2016). I now ran the respective gizmo on the parent article, so the authority information is there. If you detect similar treatments, please let me know, so I can run that gizmo on them as well ... merely a single-line command to the server side batch.

This is precisely the reason I need to be able to track parts of a treatment. If someone detects and reports an error in a treatment, and then you rerun your extraction process on that treatment to correct the error, I need to be able to change only that treatment and its related parts. I can only do that if I can uniquely identify them. Timestamp is not the way to do that. The only sure way and one that makes a programmer's life easy is by being able to identify them uniquely with a primary key (in this case, the PK is the combo of the treatment ID and the ID of that specific part of the treatment).

Also, please note that under ideal circumstances, I really won't detect any such errors because the extraction process will run automatically without any intervention from me. In this case, I found it purely by chance because I wanted to examine my extraction logic on a single XML and I randomly happened to pick one that had this error. Under ideal circumstances, my extraction process will just silently move on.

punkish commented 5 years ago

Want semantic tags, don't want positional OCR attribs

(3) If you want, a can re-add emphasis (you once complained about the complexity it adds to your paths), but number is plainly a pain in the back ... a leftover from a tagger that builds quantities from them. All these elements mark is sequences of digits, which are very easy to discover by other means.

I do want all the semantic (and formative tags) so tags like <bold>, <emphasis>, <underline>, <location>, <typeStatus>, <paragraph> are useful to me. Actually, the only things that are noise are the box info. But, as I mentioned above, I can easily ignore it so perhaps it may just be easier to not mess with this and just leave it as it it. We have bigger problems to solve.

In other words, from my side, I would forget about removing any info. I would only focus on adding the important info such as the GUIDs and other state info (more on that in a bit).

punkish commented 5 years ago

Full citation

(4) Adding a full citation might be possible, yes. But I wouldn't like it as much to add that kind of redundancy to the XMLs proper. You can have it from the JSON, however, as the backing TB stats readily provide a full reference string, just need to add it to the request.

If you don't want to add the full citation in the XML then that is fine. I will try to reconstruct it from the various bits of the XML, but I will be embedding the logic in Zenodeo so it will be only available to those who use Zenodeo. The citation won't be available to those who download the XML (for whatever reason).

Note that I am not messing around with JSON and TB stats now. I have enough on my plate with extracting data from the XML and making the web app from it. Injecting another parallel data source into the process makes no sense.

punkish commented 5 years ago

Deleted attribute

(2) I have expressed my opinion on that above, and also outlined how you can easily remove deleted elements on your end (via the updateTime attribute). Plus, we do keep a ledger in the event log table, which is append-only. It's just not part of the stats or TB exports.

Note that without the deleted (or call it redacted or noLongerUsed or whatever) attribute, I will not be able to remove these from my database. Using the updateTime attribute doesn't do any good to me because it is the timestamp of the update of the entire treatment. Besides the programming complexity it introduces in my pipeline, it is completely useless for me when it comes to removing a part of the treatment. Imagine the following:

Someone reports that a particular treatment has a wrong materialsCitation or a wrong author or a wrong figureCitation. Turns out, for whatever reason, GGI marked up the text wrongly. You go back and fix the error and rerun. The new version now has all the right markup but now the author is different. That is, a piece of text that was earlier identified as author is no longer the author and a new piece of text is the author. Or perhaps, there are now simply two authors instead of the three that were there before. (I am only using authors as an example. The same logic would apply to materialsCitation, treatmentCitation, figureCitation, bibRefCitation). There is some logic (perhaps the updateTime) by which my program is able to download not just the new treatments (since the last download and extract) but also the old but now modified treatments. My program has to automatically detect the changes and redo all the tables. In this example, I have to not change the treatment and its materialsCitation, treatmentCitation, figureCitation, and bibRefCitation but mark one of its authors as no longer being an author. I cannot do this with updateTime.

Adding a simple attribute to each part that I am tracking, an attribute that reflects the current state of that part, whether it is valid or not (call it deleted or not, redacted or not, no-longer-used or not, whatever) is the only way my program can do this in the multitude of tables and the related indexes that need to be updated automatically.

In fact, without this attribute, I actually have only a partial need for the GUIDs. I can update the parts but I can no longer remove them from display.

Putting a login barrier is really simple. Apache has a simple server based login that can be used without creating a user accounts system. On the other hand, I don't think a login is even required. Our whole premise is that all our data are open. We are adding semantic intelligence to the data anyway, and reflecting how parts of the treatment are identified is a part of that. Everyone should have access to whatever I can access.

Now, I understand that if you remove something from a treatment, your new version simply doesn't have that fragment. it is really removed. The problem is data integrity. The data we are putting out is no longer guaranteed to be persistent and consistent. I mean, stuff has simply vanished from the new version. That can be mystifying to downstream consumers of the data. In my view, once data has been extracted and put out in public, stuff should never be removed from it. Instead, it should be marked as no longer being used.

So, how do we deal with the case when, let's say, someone issues a take-down notice and says this treatment should not be out in the open at all. Well, for one, we are only extracting treatments from open licensed articles, so no one can issue a take-down. Two, if someone does issue a takedown, we have to put a placeholder telling subsequent users that there was something there but is no longer there because of whatever reason. My proposal for a "deleted" attribute is simply that, a placeholder marker.

But then, as I said, if we collectively decide that we are not going to do the following – instead of removing something we will add an attribute indicating that that part is no longer used – then I will simply not be able to remove it from my database because I can't detect the absence of something.

gsautter commented 5 years ago

I factored this out to the five tickets linked above this post.

punkish / zenodeo

A proposal to improve data integrity #14

Proposal to Improve Data Integrity

Treatment

Related Parts

Treatment Authors

Treatment Citations

Materials Citations

Figure Citations

BibRef Citations

IDs for every part of the treatment that I am extracting

Ability to track parts of a treatment

Want semantic tags, don't want positional OCR attribs

Full citation

Deleted attribute