oaregithub / oare_mono

1 stars 0 forks source link

DRAFT Dirty parser #1296

Open edstratford opened 2 years ago

edstratford commented 2 years ago

Have lots of explicit_spellings that need to be added to dictionary. Many are verbal forms - these can (to some degree) be recognized and analyzed through a script. Will try to create the rules for such analysis here.

STEP 1 Change the cuneiform spelling into a pseudo-bound transcription. To do this: 1) convert all á,à to just a, and same for i,e,u 2) convert any dash that has a consonant (bdgḫklmnpqrsṣštṭwz) before it and a vowel after into a single quote: (i.e t-a > t'a) 3) eliminate any remaining dashes 4) convert any instance of three vowels in a row to vowel-'-vowel (i.e. aaa > a'a) 5) delete the second instance of any remaining two vowels in a row (i.e aa > a)

This will not be a final form. It ignores the possibility of a doubled consonant unless it's already in the spelling (rare), and doubled alephs also won't show up, and some instances where there should be an aleph between vowels will be lost. But with this pseudo-transcription, can start guesses.

edstratford commented 2 years ago

STILL DRAFTING STEP 2 (This will only work for strong finite verbs) Declare the following variables: Morphological Form Stem Tense Person Gender Grammatical NUmber Clitic 1 (for ventive) Clitic 2 (for suffix pronouns) Clitic 2 Person Clitic 2 Grammatical Number Clitic 2 Gender Clitic 2 Case Clitic 3 (for second suffix pronoun) Clitic 3 Person Clitic 3 Grammatical Number Clitic 3 Gender Clitic 3 Case Clitic 4 (for -man/min/man) Pass

STEP 3 (Also - any form that is less than 5 characters will automatically fail) Start the sorting through analysis of the the pseudo-transcription (PT). **Note assignment of some variables is temporary, for example assigning Stem= 'G' before final determination of Gt or Gtn - this is revised through the process. --First loop If PT REGEXP '^i' THEN Stem = 'G', Person = 'Third Person', Gender = 'Masculine', Grammatical Number = 'Singular' AND PT = PT minus first character If PT REGEXP '^ta' THEN Stem = 'G', Person = 'Second Person', Gender = 'Masculine', Grammatical Number = 'Singular' AND PT = PT minus first two characters If PT REGEXP '^a' THEN Stem = 'G', Person = 'First Person', Gender = 'Common', Grammatical Number = 'Singular' AND PT = PT minus first character If PT REGEXP '^ni' THEN Stem = 'G', Person = 'First Person', Gender = 'Common', Grammatical Number = 'Plural' AND PT = PT minus first character If PT REGEXP '^u' THEN Stem = 'D', Person = 'Third Person', Gender = 'Masculine', Grammatical Number = 'Singular' AND PT = PT minus first character If PT REGEXP '^ta' THEN Stem = 'D', Person = 'Second Person', Gender = 'Masculine', Grammatical Number = 'Singular' AND PT = PT minus first two characters If PT REGEXP '^nu' THEN Stem = 'D', Person = 'First Person', Gender = 'Common', Grammatical Number = 'Plural' AND PT = PT minus first character ELSE DROP form (Pass = "Fail")

--Second loop IF PT REGEXP 'ma$' THEN Clitic 4 = 'ma' AND PT = PT minus last character IF PT REGEXP 'man$' THEN Clitic 4 = 'man' AND PT = PT minus last two characters IF PT REGEXP 'min$' THEN Clitic 4 = 'min' AND PT = PT minus last two characters

--Third loop IF PT REGEXP 'am$' THEN Clitic 4 = 'man' AND PT = PT minus last two characters --ni (for subjunctive?)

Fourth loop IF PR REGEXP 'ka$' THEN Clitc 2 Person = 'Second Person', Clitic 2 Grammatical Number = 'Singular', Clitic 2 Gender = 'Masculine', Clitic 2 Case = 'Accusative' AND PT = PT minus last 2 characters IF PR REGEXP 'ki$' THEN Clitc 2 Person = 'Second Person', Clitic 2 Grammatical Number = 'Singular', Clitic 2 Gender = 'Feminine', Clitic 2 Case = 'Accusative' AND PT = PT minus last 2 characters IF PR REGEXP 'kunuti$' THEN Clitc 2 Person = 'Second Person', Clitic 2 Grammatical Number = 'Plural', Clitic 2 Gender = 'Masculine', Clitic 2 Case = 'Dative' AND PT = PT minus last 6 characters IF PR REGEXP 'kunu$' THEN Clitc 2 Person = 'Second Person', Clitic 2 Grammatical Number = 'Plural', Clitic 2 Gender = 'Masculine', Clitic 2 Case = 'Accusative' AND PT = PT minus last 4 characters

Gertrudius commented 2 years ago

I've gotten a script written for step one. You didn't mention anything about logograms in step 1, so I left them as they are with the "." in between two of them.

Gertrudius commented 2 years ago

I've also done a rough cut of the logic for step 2, loop 1. There are a couple of issues there, since we'll need a way to flag a given spelling as needing verbal logic applied to it, and there is some overlap between prefixes that cannot be resolved with an equally weighted if/elseif clause.

Gertrudius commented 2 years ago

So I've been working through some of the logic necessary for identifying verbal forms, and while doing that I had an idea on a quick and dirty way to maybe id where some of these spellings go. I set up a script to use the same logic as above on an array of all of the explicit_spellings from dictionary_spelling and then compared them with the entries in the spreadsheet to see if there was any correspondence. You can see the results here: https://docs.google.com/spreadsheets/d/11n-CCstX43CijXGTd3ICV-kFIfs9NBibAcCkuoNsSho/edit?usp=sharing

edstratford commented 2 years ago

@Gertrudius -- you said you had scripted something to reduce forms for step one of this issue--is that scripted in PHP? Thanks

Gertrudius commented 2 years ago

Yeah, it's a PHP script.