kingsdigitallab / crossreads

Palaeographical environment for CROSSREADS project
1 stars 0 forks source link

Re-encode the word IDs in the annotations files #49

Closed geoffroy-noel-ddh closed 6 months ago

geoffroy-noel-ddh commented 7 months ago

Re-encode the word IDs in the annotations files to match the recent change from base52 to base100.

geoffroy-noel-ddh commented 7 months ago

There's now a python script in the repo to convert the annotations files (/tools/idconvertor/convert.py). It seems to work generally well but there are some odd cases reported in the output log.

http-sicily-classics-ox-ac-uk-inscription-isic001408-isic001408-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001420-isic001420-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001420-isic001420_copy2-jpg.json
  10 annotations = 0 converted + 10 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic001420-isic001420_copy4-jpg.json
  11 annotations = 0 converted + 11 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic001435-isic001435-jpg.json
  14 annotations = 13 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic001435-isic001435_copy3-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001439-isic001439-jpg.json
  WARNING: word_id (CAjpQ) encoded with unexpected base (0) (expected 52 or 100)
  30 annotations = 29 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic001445-isic001445-jpg.json
  16 annotations = 15 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic001447-isic001447-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001448-isic001448-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001463-isic001463-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001464-isic001464-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001465-isic001465-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001471-isic001471-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001472-isic001472-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001473-isic001473-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001474-isic001474-jpg.json
  WARNING: word_id (CAnaK) encoded with unexpected base (0) (expected 52 or 100)
  12 annotations = 11 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic001478-isic001478-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001481-isic001481-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001483-isic001483-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001485-isic001485-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001488-isic001488-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic001568-isic001568-jpg.json
  8 annotations = 0 converted + 8 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic003031-isic003031-jpg.json
  16 annotations = 0 converted + 16 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic003107-isic003107-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic003363-isic003363-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic003364-isic003364-jpg.json
  WARNING: word_id (EfJHm) encoded with unexpected base (0) (expected 52 or 100)
  55 annotations = 54 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic003375-isic003375-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic003474-isic003474-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020288-isic020288-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020292-isic020292-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020298-isic020298-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020300-isic020300-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020304-isic020304-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020306-isic020306-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020313-isic020313-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020317-isic020317-jpg.json
  WARNING: word_id (boiDG) encoded with unexpected base (0) (expected 52 or 100)
  12 annotations = 11 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic020319-isic020319-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020320-isic020320-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020322-isic020322-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020323-isic020323-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020368-isic020368-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020370-isic020370-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020371-isic020371-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020371-isic020371_rear-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020445-isic020445-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020445-isic020445_copy2-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020597-isic020597-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020598-isic020598-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic020600-isic020600-jpg.json
http-sicily-classics-ox-ac-uk-inscription-isic030002-isic001408-jpg.json
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHGA) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHGU) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHGU) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHGU) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHGU) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  13 annotations = 0 converted + 13 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic030002-isic030002-jpg.json

^ in the above:

geoffroy-noel-ddh commented 7 months ago

After manual verification, all the 0 unchanged without WARNING had no actual link to a sign in the text. Which explains why no word ID had to be was converted.

Here's the list after removing the files that we know were corrected processed:

http-sicily-classics-ox-ac-uk-inscription-isic001435-isic001435-jpg.json
  # Data issue? One odd annotation on the edge of the object, might be an editorial error...
  14 annotations = 13 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic001439-isic001439-jpg.json
  # TRANSPLANT: Looks like an annotation of a 'Ε' in a word in 1472 has landed in the 1439 file.
  WARNING: word_id (CAjpQ) encoded with unexpected base (0) (expected 52 or 100)
  30 annotations = 29 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic001445-isic001445-jpg.json
  # ? A 'Μ' , which is described but not bound to a sign in the text (textTarget=Null)
  16 annotations = 15 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic001474-isic001474-jpg.json
  # TRANSPLANT
  WARNING: word_id (CAnaK) encoded with unexpected base (0) (expected 52 or 100)
  12 annotations = 11 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic001568-isic001568-jpg.json
  # OK: None of the annotations were bound to the text
  8 annotations = 0 converted + 8 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic003031-isic003031-jpg.json
  # OK: None of the annotations were bound to the text
  16 annotations = 0 converted + 16 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic003364-isic003364-jpg.json
  # TRANSPLANT
  WARNING: word_id (EfJHm) encoded with unexpected base (0) (expected 52 or 100)
  55 annotations = 54 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic020317-isic020317-jpg.json
  # TRANSPLANT
  WARNING: word_id (boiDG) encoded with unexpected base (0) (expected 52 or 100)
  12 annotations = 11 converted + 1 unchanged.
http-sicily-classics-ox-ac-uk-inscription-isic030002-isic001408-jpg.json
 # MISMATCH between the inscription ID (30002) and the image (1408)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHGA) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHGU) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHGU) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHGU) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHGU) encoded with unexpected base (0) (expected 52 or 100)
  WARNING: word_id (BwHFq) encoded with unexpected base (0) (expected 52 or 100)
  13 annotations = 0 converted + 13 unchanged.

TRANSPLANT case:

This looks like an Annotator bug, where an annotation belongs to a file about inscription X but actually refers to context in inscription Y. That context is the sign, the word ID, the link to the image file.

In the first manifestation above the annotation CAjpQ:1 (Ε) also belongs to correct file (http-sicily-classics-ox-ac-uk-inscription-isic001439-isic001439-jpg.json:308).

Probably cause: a selected annotation survives the navigation to another inscription in the annotation, being transplanted there and saved as is.

Remediation: it should be easy to remove the misplaced annotations with a script. Or just ignore them from the search indexing script. Ideally we should find which sequence of interactions causes this bug so it can be permanently fixed rather than patched.

MISMATCH case

Similar to the previous case but in this case the annotation file is based on a mismatch b/w the inscription id and the image id. The word ID correspond to the image ID. Each annotation is also found in http-sicily-classics-ox-ac-uk-inscription-isic001408-isic001408-jpg.json.

Very odd situation, not sure what could cause it.

Remediation: this file could be deleted. or the annotation ignored using the same rule as suggested in the previous case.

simonastoyanova commented 7 months ago

Error in 1435 seems to be a box to the side of the inscription, no letter there and no bouding to the text; I've seen it appear before as an artefact and always delete it. Not sure where it's coming from, I assumed testing. I will leave it as is for infestigation. Screenshot: Screenshot 2024-02-29 at 17 09 16

Error in 1439: the rogue id CAjpQ is the base 52, not sure why it didn't convert to BvUAU which is its base 100 equivalent. In any case, its a token from 1472, not sure why it showed up in 1439. It shows the correct id in 1472.

Error in 1445: not sure, there are two 'Μ' characters in the text, one bound to token BsyAo and one to BsyAΤ.

Error in 1474: annotation of 'Α' present with both old id (CAnaK) from 1473 (new id BveAy, all correct in the annotations in 1473) and with its correct id BvoAe.

1568 isn't visible in the annotator due to upper case I in filename, now corrected. The annotations look linked to tokens, I will check when it's visible again.

3031 fine now.

Error 3364: id EfJHm is base 52 from a token in 3363, where it's now with the new id DkeAy.

Error in 20317: id boiDG is base 52 from a token in 20313, where it's now with the new id UfeAy.

Error with 30002 and 1408: no idea why the base 52 tokens were transplanted to 30002, the annotations list for 30002 now has the correct token ids.

geoffroy-noel-ddh commented 6 months ago

Thank you for reporting on those cases! I'll close this issue as it pertains to the re-encoding, which is done. I've opened new issues to address the bug.