plazi / arcadia-project

2 stars 1 forks source link

page-based WADM test #207

Open myrmoteras opened 1 year ago

myrmoteras commented 1 year ago

Dec 14, 2022

good evening Guido here's a page-based WADM test we've deleted most of the annotations that were on the page, so you can see what's going on, and then : • created one new annotation • modified one other (it had an OCR error :1 is changed to roman I) • left one annotation un-modified three files are attached: the original incoming WADM; an output version with the creator/generator at the top of the file; another version with creator/generator in-line per annotation

"https://plazi.annostor.org/task/generalTidy" is a place-holder for a task identifier

we've used a dummy human contributor ID, but perhaps you could supply a more representative default agent ?

probably we'd prefer to use the format with one instance of the creator/generator to keep things light-weight and scalable, but please could you specify which version you would like

all best, peterc plazi_input.zip

myrmoteras commented 1 year ago

Hi Peter,

here's a page-based WADM test

sorry it took me so long to get back to this ... was bogged down in other things, as well as Christmas and respective preparations ... starting fresh into a fresh year.

Looking at the stats for the example IMF (https://tb.plazi.org/GgServer/dioStats/stats?outputFields=doc.articleUuid+doc.name+doc.doi+doc.uploadUser+doc.uploadDate+doc.updateUser+doc.updateDate+bibRef.refString&groupingFields=doc.articleUuid+doc.name+doc.doi+doc.uploadUser+doc.uploadDate+doc.updateUser+doc.updateDate+bibRef.refString&FP-doc.articleUuid=221B321F2A22FFAEE63688265A3EFF93&format=HTML), I also see that the bibRefs don't seem to be marked correctly in this one, so it might not be the best example for this exercise ...

we've deleted most of the annotations that were on the page, so you can see what's going on, and then:

Fair enough for this very example, but in actual deployment, this would indicate a lot of annotation removals ... so please keep in mind to send back all the annotations, except for the ones that were actually removed.

created one new annotation

That would be the "Germany" collecting country in "plazi_return_example_1.json", right? A few things about this example:

image

modified one other (it had an OCR error :1 is changed to roman I)

I take it this is referring to the very first bibRef at the very top of plazi_input.json and plazi_return_example_1.json? In general, adjusting the text like this should work as long as modifications are minor, as then the context can serve as a means of anchoring the change in the surrounding words and help unambiguously associate the modifications with individual words in the page of the underlying IMF.

However, I hope you're aware you'll have to make that textual adjustment to all annotations spanning the modified words, as otherwise there is a lot of ambiguity regarding which variant of the transcript text for the overlapping area of the page is the valid one ...

An more targeted way of transferring text modifications like this might be to add a dedicated "ocrCorrection" annotation with the bounding box of the word in question proper and the adjusted transcript, which then becomes the only one pertaining to the actual document text on import on the Plazi end ... such annotations would also most likely be incorporated in the document page proper by adjusting the transcript of the underlying words, rather than being added to the page as actual annotations. That said, such annotations would be a vehicle for transferring OCR edits, and get incorporated in the text proper when the Plazi system sends fresh WADM back to you after the local update.

left one annotation un-modified

I take it this is referring to the the "Anthribidae" taxonomicName in plazi_input.json and plazi_return_example_1.json? Cannot seem to spot any changes to the attributes, etc., at least ... however, you seem to have added the "generator" and "creator" properties nonetheless ... why is that? Especially since these properties don't seem to be universally present in all the annotations in plazi_return_example_1.json and plazi_return_example_2.json ... what's the underlying logic of the presence or absence of these two properties in the annotations you return to the Plazi system?

three files are attached: the original incoming WADM; an output version with the creator/generator at the top of the file; another version with creator/generator in-line per annotation

"https://plazi.annostor.org/task/generalTidy" is a place-holder for a task identifier

Fair enough ... do you plan on having the generator/creator present in every write-back, regardless of actual modifications?

Maybe it would be possible to add the generator/creator only to the annotations that were actually modified? After all, a single write-back might well subsume contributions from multiple users ...

we've used a dummy human contributor ID, but perhaps you could supply a more representative default agent?

Well, I don't think it really matters what we use for testing this ... would be good to have the ORCIDs of the actual contributors in later live data, though, so a dummy ORCID seems like a sensible choice to me ...

probably we'd prefer to use the format with one instance of the creator/generator to keep things light-weight and scalable, but please could you specify which version you would like.

See above ... a single write-back might well subsume contributions from multiple users, and we might want to retain the ability to credit them individually.

Best, Guido

myrmoteras commented 1 year ago

@dfdan @peterc@blipcreative.com looking at the document used for this example, I strongly recommend that we only use documents that we approved to serve the purpose. That means they need to have the respective needed granularity, and are quality controlled. Before a task is started, the document used needs to be agreed to by Plazi.

Here is a list of such test files.

myrmoteras commented 1 year ago

good afternoon Guido thanks for this I think it's important to keep in view here that we need first to validate the information we are moving back and forth, and then work up to using the mechanism with automation to demonstrate meaningful enrichment ... there's a bit of a chicken and egg situation in that DF needs to have model for the structure it's going to generate before writing code to do it: sure, we want to get to this point asap, but until then we have to simulate this and make data manually to return to you—so at this stage we are not attempting to hand-craft an authoritative contribution and return hundreds of annotations: this was not intended as a valid enrichment which should over-write a TB record keeping this in mind (i've numbered your responses below ... forgive recourse to HTML color) :

  1. i guess that doesn't affect us here, but maybe Donat would find other examples
  2. obviously in a real enrichment it's unlikely than an expert would delete loads of annotations, so of course we'd return to TB everything that hadn't been touched
  3. we maintain 'plazi:' pre-pended internally, but can strip this upon output if you prefer
  4. this is fine - we just made a simple box by hand; obviously when it's generated by mirador it'll fit the respective text
  5. if you need to replace which we have minted for new annotations generated in manual enrichment tasks, we'd like to receive back the official Plazi UUIDs .. this could be done as part of an acknowledgement after you've processed the update into the IMF - let's discuss
  6. we could consider adding something like an "ocrCorrection" motivation when we return - but to be consistent i guess we should also then consider adding "boxCorrection" etc for other operations: we'd need a vocabulary of motivations, and i think we would be taking the lid off something quite complex and time-consuming, so DF would resist this right now since we receive WADM from TB, annostor doesn't have word position information to determine which characters in overlapping annotations would need to be updated, so making other affected annotations consistent will have to be done in TB
  7. and 8. if one or more changes to annotations have been made - then an enrichment task PID is minted ... probably you just need to know for TB that the task happened; what the PID is; plus a timestamp; also that it should be credited to a contributor (or machine) we'd prefer to mark each annotation affected by that task ... but TB should at least keep the PID intact in perpetuity, and i guess you'll want to maintain annotation credits in various locations, with peoples' ORCIDS let's try and get a conversation tomorrow ? morning is bad for me: i can probably fit in with your schedule p.m., though we'll have to check with Dan for availability later in the day best, peterc

Hi Peter, here's a page-based WADM test sorry it took me so long to get back to this ... was bogged down in other things, as well as Christmas and respective preparations ... starting fresh into a fresh year.

  1. Looking at the stats for the example IMF (https://tb.plazi.org/GgServer/dioStats/stats?outputFields=doc.articleUuid+doc.name+doc.doi+doc.uploadUser+doc.uploadDate+doc.updateUser+doc.updateDate+bibRef.refString&groupingFields=doc.articleUuid+doc.name+doc.doi+doc.uploadUser+doc.uploadDate+doc.updateUser+doc.updateDate+bibRef.refString&FP-doc.articleUuid=221B321F2A22FFAEE63688265A3EFF93&format=HTML), I also see that the bibRefs don't seem to be marked correctly in this one, so it might not be the best example for this exercise ...

we've deleted most of the annotations that were on the page, so you can see what's going on, and then:

  1. Fair enough for this very example, but in actual deployment, this would indicate a lot of annotation removals ... so please keep in mind to send back all the annotations, except for the ones that were actually removed.

• created one new annotation That would be the "Germany" collecting country in "plazi_return_example_1.json", right? A few things about this example:

  1. The type (value under "plazi:annotType") should be "collectingCountry", not "plazi:collectingCountry", as in the outbound files ... or we might change it in the outbound data we deliver to you as well, even though that would imply a Plazi schema or namespace that doesn't exist, so my preferred option is to omit the "plazi:" prefix on the annotation types.

  2. While in principle everything is there to do a write-back (annotation type and bounding box, as well as verbatim text for validation), it wouldn't work in this case because the indicated bounding box is in the middle of nowhere (the black rectangle in the top left corner, page edges cut off):

  3. We'll have to devise a policy regarding how to handle (UU)IDs that were minted on your end, like "dc0f864a-f97c-4f7f-b492-3424a1f84e48" in this example ... this is basically because soon as the annotation is added to the IMF, it implicitly gets a UUID that is composed from its type and bounding box in combination with the UUID of the underlying IMF proper ... would it cause any kind of problems if the annotation comes back to you with a different ID (one in the general scheme of the other annotation IDs) from the export that follows an update on the Plazi end?

• modified one other (it had an OCR error :1 is changed to roman I)

  1. I take it this is referring to the very first bibRef at the very top of plazi_input.json and plazi_return_example_1.json? In general, adjusting the text like this should work as long as modifications are minor, as then the context can serve as a means of anchoring the change in the surrounding words and help unambiguously associate the modifications with individual words in the page of the underlying IMF.

However, I hope you're aware you'll have to make that textual adjustment to all annotations spanning the modified words, as otherwise there is a lot of ambiguity regarding which variant of the transcript text for the overlapping area of the page is the valid one ...

An more targeted way of transferring text modifications like this might be to add a dedicated "ocrCorrection" annotation with the bounding box of the word in question proper and the adjusted transcript, which then becomes the only one pertaining to the actual document text on import on the Plazi end ... such annotations would also most likely be incorporated in the document page proper by adjusting the transcript of the underlying words, rather than being added to the page as actual annotations. That said, such annotations would be a vehicle for transferring OCR edits, and get incorporated in the text proper when the Plazi system sends fresh WADM back to you after the local update.

• left one annotation un-modified

  1. I take it this is referring to the the "Anthribidae" taxonomicName in plazi_input.json and plazi_return_example_1.json? Cannot seem to spot any changes to the attributes, etc., at least ... however, you seem to have added the "generator" and "creator" properties nonetheless ... why is that? Especially since these properties don't seem to be universally present in all the annotations in plazi_return_example_1.json and plazi_return_example_2.json ... what's the underlying logic of the presence or absence of these two properties in the annotations you return to the Plazi system?

three files are attached: the original incoming WADM; an output version with the creator/generator at the top of the file; another version with creator/generator in-line per annotation

"https://plazi.annostor.org/task/generalTidy" is a place-holder for a task identifier

  1. Fair enough ... do you plan on having the generator/creator present in every write-back, regardless of actual modifications?

Maybe it would be possible to add the generator/creator only to the annotations that were actually modified? After all, a single write-back might well subsume contributions from multiple users ...

we've used a dummy human contributor ID, but perhaps you could supply a more representative default agent? Well, I don't think it really matters what we use for testing this ... would be good to have the ORCIDs of the actual contributors in later live data, though, so a dummy ORCID seems like a sensible choice to me ...

probably we'd prefer to use the format with one instance of the creator/generator to keep things light-weight and scalable, but please could you specify which version you would like. See above ... a single write-back might well subsume contributions from multiple users, and we might want to retain the ability to credit them individually.

Best, Guido