oaregithub / oare_sql

1 stars 0 forks source link

Deal with improper discourse_uuid assigned to two different words #12

Closed edstratford closed 1 year ago

edstratford commented 1 year ago

Gertrudius mid-May 2023:

I spotted a pretty big issue when doing some error checking on my fix iteration procedures. This is where epigraphic units with the same discourse_uuid do not share the same parent_uuid. I spotted this by checking for duplicate char_on_lines where the discourse_uuid is the same rather than the parent_uuid.

SELECT * FROM text_epigraphy WHERE discourse_uuid IN (SELECT discourse_uuid FROM text_epigraphy WHERE discourse_uuid IS NOT NULL AND parent_uuid IN (SELECT uuid FROM text_epigraphy WHERE type = "line") GROUP BY discourse_uuid, char_on_line HAVING COUNT(char_on_line) != 1 ORDER BY COUNT(char_on_line) DESC);

edstratford commented 1 year ago

Looks like 2 whole words form same text that somehow had same discourse uuid. Good catch! I’ll sort this out. Thanks,

It also looks like this does a good job of identifying the 60 texts that have really bad, messed-up discourse hierarchies...

edstratford commented 1 year ago

Gertrudius: In that case, I'll go ahead and exclude those texts from the procedure for fixing object_on_tablet for now so I can move forward with that. I'll just remove the exclusion when we have that issue sorted out.

edstratford commented 1 year ago

The issue coveres a few different phenomena:

  1. Words that legitimately break across lines
  2. errors (mostly in TMH volume) that arose from problems with discourse hiearchy
  3. undeterminedSigns across multiple lines that are lumped into on word
  4. a few instances of numbers on 2 consecutive lines improperly processed
  5. a few more various errors

1, 2, 4, and 5 have been fixed. Some texts had their entire discourse hierarchy fixed, but not all out of concern for time.

See https://docs.google.com/spreadsheets/d/1plDfXGIgckiY61_gltKKKYGzSZqxr3vouLVEEyGi50g/edit#gid=0 for tracking of errors.

edstratford commented 1 year ago

At this time all instances that are actual error have been corrected. Some texts have had their discourse hierarchy completed cleaned. There is a key in the google sheet showing this: KEY: Green - text fixed and discourse hierarchy correct throughout. Red - Couldn't find Orange - legitimate example of word breaking across line Yellow - error fixed, but text discourse hierarchy not fully corrected Purple - damage across different lines - fixed - discourse hierarchy not addressed beyond that Brown - didn't seem to be a problem

edstratford commented 1 year ago

Created new issue to mop up hierarchy issues with link to google sheet. This issue closed.