Closed edstratford closed 1 year ago
Looks like 2 whole words form same text that somehow had same discourse uuid. Good catch! I’ll sort this out. Thanks,
It also looks like this does a good job of identifying the 60 texts that have really bad, messed-up discourse hierarchies...
Gertrudius: In that case, I'll go ahead and exclude those texts from the procedure for fixing object_on_tablet for now so I can move forward with that. I'll just remove the exclusion when we have that issue sorted out.
The issue coveres a few different phenomena:
1, 2, 4, and 5 have been fixed. Some texts had their entire discourse hierarchy fixed, but not all out of concern for time.
See https://docs.google.com/spreadsheets/d/1plDfXGIgckiY61_gltKKKYGzSZqxr3vouLVEEyGi50g/edit#gid=0 for tracking of errors.
At this time all instances that are actual error have been corrected. Some texts have had their discourse hierarchy completed cleaned. There is a key in the google sheet showing this: KEY: Green - text fixed and discourse hierarchy correct throughout. Red - Couldn't find Orange - legitimate example of word breaking across line Yellow - error fixed, but text discourse hierarchy not fully corrected Purple - damage across different lines - fixed - discourse hierarchy not addressed beyond that Brown - didn't seem to be a problem
Created new issue to mop up hierarchy issues with link to google sheet. This issue closed.
Gertrudius mid-May 2023:
I spotted a pretty big issue when doing some error checking on my fix iteration procedures. This is where epigraphic units with the same discourse_uuid do not share the same parent_uuid. I spotted this by checking for duplicate char_on_lines where the discourse_uuid is the same rather than the parent_uuid.
SELECT * FROM text_epigraphy WHERE discourse_uuid IN (SELECT discourse_uuid FROM text_epigraphy WHERE discourse_uuid IS NOT NULL AND parent_uuid IN (SELECT uuid FROM text_epigraphy WHERE type = "line") GROUP BY discourse_uuid, char_on_line HAVING COUNT(char_on_line) != 1 ORDER BY COUNT(char_on_line) DESC);