oaregithub / oare_sql

1 stars 0 forks source link

Insert text_discourse rows for broken regions in text_epigraphy that are missing #16

Open edstratford opened 1 year ago

edstratford commented 1 year ago

Text_discourse represents language (words, phrases, sentences, etc.). Text_epigraphy represents the physical markings on the tablet. In text_epigraphy, we use the region to designate things that don’t fit neatly into the category of line or inside of lines (seal impressions, rulings, large breaks of an unknown # of lines).

We also need region in text_discourse to represent the one thing that won’t fit into words, numbers, phrases, clauses, sentences, or paragraphs --- large breaks in the text, where the thread of the conversation or text gets lost.

Currently, there are about 3050 instances of breaks of 1 or more lines or breaks of an unknown number of lines (a region with text_markup.type 'broken). Most (2720) of these have a corresponding region in text_disocurse. These text_discourse regions have explicit_spelling and transcription content of ‘(large break)’ or ‘(# broken lines)’ .

We need to insert the remaining 330 or so of these with appropriate explicit_spelling and transcription content. The query below selects these at the top and the ones in good order below (for comparison).

these regions DO take a word_on_tablet increment (as long as that column is still in use).

Parent_uuid for all should be the discourseUnit -- for any break of more than 1 line, this will be the rule. In the future, 1 or 2 line breaks can be reviewed to see if they remain in a paragraph where the thread of the conversation is clearly on the same topic (or in debt notes, etc. where the structure of the text is obvious.

SELECT te.id, te.uuid, te.text_uuid, te.object_on_tablet, tm.*, td.id, td.uuid, td.type, td.obj_in_text, td.parent_uuid, td.explicit_spelling, td.transcription FROM text_epigraphy te INNER JOIN text_markup tm ON tm.reference_uuid = te.uuid AND tm.type IN ('undeterminedLines','broken') LEFT JOIN text_discourse td ON te.discourse_uuid = td.uuid ORDER by td.type, td.explicit_spelling, te.text_uuid;

Will require discourse_uuid on the text_epigraphy rows, and incrementing of the obj_in_text, word_on_tablet, child_num.

(FOR LATER: -> In cases where the region clearly straddles two known paragraphs (such as when two broken lines clearly have the transition between two predictable sections of a debt note -- perhaps in this case again, it should be the child of the discourse unit, and the two paragraph sections break off and resume on either side of it... MAKE DETERMINATION)

edstratford commented 1 year ago

Gertrudius late Dec 2022:

There are 11 undeterminedLines or broken that have an explicit_spelling and transcription value of (broken area). I'm assuming these should be brought into conformity with the (large break) and (# broken lines) paradigms.

edstratford commented 1 year ago

Stratford: Correct. Please change as described.

Gertrudius commented 5 months ago

It appears that we have a couple cases where a region in text_discourse will be used as a reference_uuid for multiple broken/undeterminedLines in text_epigraphy. This seems to occur when a broken ends a side, and a broken begins the next side. Now obviously it's not essential to have a discourse_unit for each if there is no intervening text, but is that an organizational paradigm we plan to continue to support in the future?

SELECT te.id, te.uuid, te.text_uuid, te.object_on_tablet, tm.*, td.id, td.uuid, td.type, td.obj_in_text, td.parent_uuid, td.explicit_spelling, td.transcription, COUNT(td.uuid) AS this_count FROM text_epigraphy te
INNER JOIN text_markup tm ON tm.reference_uuid = te.uuid AND tm.type IN ('undeterminedLines','broken')
LEFT JOIN text_discourse td ON te.discourse_uuid = td.uuid
GROUP BY td.uuid
ORDER by this_count DESC;