Insert text_discourse rows for broken regions in text_epigraphy that are missing

edstratford commented 1 year ago

Text_discourse represents language (words, phrases, sentences, etc.). Text_epigraphy represents the physical markings on the tablet. In text_epigraphy, we use the region to designate things that don’t fit neatly into the category of line or inside of lines (seal impressions, rulings, large breaks of an unknown # of lines).

We also need region in text_discourse to represent the one thing that won’t fit into words, numbers, phrases, clauses, sentences, or paragraphs --- large breaks in the text, where the thread of the conversation or text gets lost.

Currently, there are about 3050 instances of breaks of 1 or more lines or breaks of an unknown number of lines (a region with text_markup.type 'broken). Most (2720) of these have a corresponding region in text_disocurse. These text_discourse regions have explicit_spelling and transcription content of ‘(large break)’ or ‘(# broken lines)’ .

We need to insert the remaining 330 or so of these with appropriate explicit_spelling and transcription content. The query below selects these at the top and the ones in good order below (for comparison).

these regions DO take a word_on_tablet increment (as long as that column is still in use).

Parent_uuid for all should be the discourseUnit -- for any break of more than 1 line, this will be the rule. In the future, 1 or 2 line breaks can be reviewed to see if they remain in a paragraph where the thread of the conversation is clearly on the same topic (or in debt notes, etc. where the structure of the text is obvious.

SELECT te.id, te.uuid, te.text_uuid, te.object_on_tablet, tm.*, td.id, td.uuid, td.type, td.obj_in_text, td.parent_uuid, td.explicit_spelling, td.transcription FROM text_epigraphy te INNER JOIN text_markup tm ON tm.reference_uuid = te.uuid AND tm.type IN ('undeterminedLines','broken') LEFT JOIN text_discourse td ON te.discourse_uuid = td.uuid ORDER by td.type, td.explicit_spelling, te.text_uuid;

Will require discourse_uuid on the text_epigraphy rows, and incrementing of the obj_in_text, word_on_tablet, child_num.

(FOR LATER: -> In cases where the region clearly straddles two known paragraphs (such as when two broken lines clearly have the transition between two predictable sections of a debt note -- perhaps in this case again, it should be the child of the discourse unit, and the two paragraph sections break off and resume on either side of it... MAKE DETERMINATION)

Gertrudius commented 1 year ago

There are 11 undeterminedLines or broken that have an explicit_spelling and transcription value of (broken area). I'm assuming these should be brought into conformity with the (large break) and (# broken lines) paradigms.

edstratford commented 1 year ago

Correct. Please change as described.

Gertrudius commented 1 year ago

There are a number of texts that have this issue that have no content at all in text_discourse, I'll leave those alone for the moment so we can address the issue of adding discourseUnit and any other relevant items to discourse separately.

SELECT te.uuid AS te_uuid, te.text_uuid, te.type AS te_type, tm.type AS tm_type, tm.num_value, td.uuid, td.type AS td_type, td.obj_in_text, td.word_on_tablet, td.child_num, td.parent_uuid AS td_parent_uuid, td.explicit_spelling, td.transcription FROM text_epigraphy AS te LEFT JOIN text_discourse AS td ON te.discourse_uuid = td.uuid LEFT JOIN text_markup AS tm ON te.uuid = tm.reference_uuid AND tm.type IN ("undeterminedLines","broken") WHERE te.text_uuid NOT IN (SELECT text_uuid FROM text_discourse) ORDER BY te.text_uuid, te.object_on_tablet;

Gertrudius commented 1 year ago

While working on this I ran into another side issue. We have a number of duplicate object_on_tablet values in text_epigraphy, 29,266 to be exact, ranging from 2 to 5 duplicate values. I noticed this problem when I ran into columns that had been inserted without the correct object_on_tablet value, though that only makes up 4,455 of them.

SELECT *, COUNT(te.object_on_tablet) AS count FROM text_epigraphy AS te GROUP BY te.text_uuid, te.object_on_tablet HAVING COUNT(te.object_on_tablet) > 1;

SELECT *, COUNT(te.object_on_tablet) AS count FROM text_epigraphy AS te WHERE type = "column" GROUP BY te.text_uuid, te.object_on_tablet HAVING COUNT(te.object_on_tablet) > 1;

edstratford commented 1 year ago

Rerun above query - this last problem might be solved.

edstratford commented 1 year ago

This issue and sub issues have been moved to oare_sql.

oaregithub / oare_mono

Insert text_discourse rows for broken regions in text_epigraphy that are missing #1578