iljackb / Mixtepec_Mixtec

Mostly XML (TEI) markup of Mixtepec-Mixtec Language resources
3 stars 1 forks source link

problems with tagging <m> within strings #90

Open iljackb opened 4 years ago

iljackb commented 4 years ago

In issue #88 we concluded that rather than keep the <c>'s from the transcriptions in order to make the content more searchable and usable, we would remove all <c>'s except for where on a morpho-semantically significant tone and these would be changed to <m>, thus leaving the structure as follows:

           <u who="#TS" xml:id="d1e112" n="2" start="1.48" end="2.98" xml:lang="mix">
               <seg xml:lang="mix" xml:id="d1e113" notation="orth" type="S">
                  <w xml:id="d1e114" synch="#T14">sketa</w>
                  <w xml:id="d1e116" synch="#T19">ntikii</w>
               </seg>
               <seg xml:lang="mix" xml:id="d1e118" notation="ipa" type="S" sameAs="#d1e113">
                  <w xml:id="d1e119" synch="#T14" sameAs="#d1e114">skɛ<m xml:id="d1e225">˥</m>t̪a<m xml:id="d1e120">↘</m></w>
                  <w xml:id="d1e132" synch="#T19" sameAs="#d1e116">nd̪i↘kiː↘↗ꜛ</w>
               </seg>
            </u>

However while an improvement, this is still problematic in that if one is searching for phonological content, where there is a <m> (which also means that the tone encoded therein is particularly significant) it is not possible to search for full phonetic strings.

So there are three possible solutions I can envision:

  1. Live with it

  2. Copy the string into an attribute like @orig and search for phonetics in the attribute values (though that contradicts the usage in this project in which I'm using these to keep track of where I've normalized)

  3. Make another copy of the IPA contents and don't include the <m>'s; However, this raises the questions of:

    • these would have to be linked to either the orthographic or the original IPA contents which would be best to point to? Could we instead also have the orth <seg> point to it?

    • they would have to be typed; which is a problem given that @type is already used to classify the type of segment (thus @subtype wouldn't be consistant) and @notation is still ="ipa"

Below is an example in which I use @function="full" on the <seg> and which also points to the orthographic <seg>:

           <u who="#TS" xml:id="d1e112" n="2" start="1.48" end="2.98" xml:lang="mix">
              <seg xml:lang="mix" xml:id="d1e113" notation="orth" type="S">
                 <w xml:id="d1e114" synch="#T14">sketa</w>
                 <w xml:id="d1e116" synch="#T19">ntikii</w>
              </seg>
              <seg xml:lang="mix" xml:id="d1e118" notation="ipa" type="S" sameAs="#d1e113">
                 <w xml:id="d1e119" synch="#T14" sameAs="#d1e114">skɛ<m xml:id="d1e225">˥</m>t̪a<m xml:id="d1e120">↘</m></w>
                 <w xml:id="d1e132" synch="#T19" sameAs="#d1e116">nd̪i↘kiː↘↗ꜛ</w>
              </seg>
              <seg xml:lang="mix" xml:id="d1e128" notation="ipa" type="S" sameAs="#d1e113" function="full">
                 <w xml:id="d1e129" synch="#T14" sameAs="#d1e114">skɛ˥t̪a↘</w>
                 <w xml:id="d1e142" synch="#T19" sameAs="#d1e116">nd̪i↘kiː↘↗ꜛ</w>
              </seg>            
           </u>

Using this, a search for all phonetic strings would then have to be done matching both @notation="ipa" and @function="full"; and to get the full phonetic string (to copy into a dictionary for example) it would have to match the same as well as point to an @xml:id of a <w> which is a child of <seg notation="orth">.

What do you think @Laurent?

laurentromary commented 4 years ago

Now that I think about it, hadn't we manage to implement an XSLT search that flattens strings?

iljackb commented 4 years ago

I already have done it myself! But the problem isn't how to do it it, it's how to encode and annotate it in a way that allows for easy access but also maximally accurate annotation

iljackb commented 4 years ago

actually I remember what you were talking about it was something to retrieve the content, but it was based on searching for the translations. The goal, and the basis of this issue is to try to figure out a way to be able to search the Mixtec, specifically the phonetic and/or orthographic strings.

laurentromary commented 4 years ago

That's what I mean, if we can manage to search in decent conditions, I would not delete fine grained markup too much...

laurentromary commented 4 years ago

That should be feasible to adapt the search function to flatten the content. I can see several techniques. Can you show me how you do it currently?

Le 30 oct. 2019 à 11:44, Jack Bowers notifications@github.com a écrit :

actually I remember what you were talking about it was something to retrieve the content, but it was based on searching for the translations, the goal is to be able to search the Mixtec, specifically the phonetic and/or orthographic strings.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/iljackb/Mixtepec_Mixtec/issues/90?email_source=notifications&email_token=ABH5B32XRZSZMZENQAEHHETQRFQQDA5CNFSM4JGMK4GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECTWCKA#issuecomment-547840296, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABH5B3ZAR2NJMBGLG22FJK3QRFQQDANCNFSM4JGMK4GA.

iljackb commented 4 years ago

Sorry, I misunderstood your first comment originally, what I said I did was just to make a flat copy to convert the phonetics with the <c>'s for every character.

So the only think I do to search the strings is just basic XQuery (I generally use XQuery to search and only use XSLT to convert into another format) I search as follows: e.g. //seg[@notation='ipa']/w[contains(.,'skɛ˥t̪a↘')] (which isn't possible unless I make that flattened copy)

laurentromary commented 4 years ago

So there is a possibility by replacing the “.” by a function that flattens the content of . This is where I see a technical solution. Do you know how to write a function? This would call (the empty string is significant since by default, it is a white space.

Le 30 oct. 2019 à 12:12, Jack Bowers notifications@github.com a écrit :

Sorry, I misunderstood your first comment originally, what I said I did was just to make a flat copy to convert the phonetics with the 's for every character.

So the only think I do to search the strings is just basic XQuery (I generally use XQuery to search and only use XSLT to convert into another format) I search as follows: e.g. //seg[@notation='ipa']/w[contains(.,'skɛ˥t̪a↘')] (which isn't possible unless I make that flattened copy)

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/iljackb/Mixtepec_Mixtec/issues/90?email_source=notifications&email_token=ABH5B3YKVQGUHWHVBJTLFFDQRFTZRA5CNFSM4JGMK4GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECTYVTA#issuecomment-547850956, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABH5B347RXQOBL62J4BH3ADQRFTZRANCNFSM4JGMK4GA.

iljackb commented 4 years ago

I wouldn't know how to do that. I assume this is with XSLT not XQuery? I like making things XQuery friendly because in Oxygen, you can do 'search whole project' and it gathers from files in different folders but in XSLT you have to specify a single directory (unless I'm mistaken)..

iljackb commented 4 years ago

I'm thinking it may also be possible to search using "string-join" in XQuery but I'm not sure yet...

laurentromary commented 4 years ago

That would be XPath, which is both XQuery and XSLT friendly.

Le 30 oct. 2019 à 12:38, Jack Bowers notifications@github.com a écrit :

I wouldn't know how to do that. I assume this is with XSLT not XQuery? I like making things XQuery friendly because in Oxygen, you can do 'search whole project' and it gathers from files in different folders but in XSLT you have to specify a single directory (unless I'm mistaken)..

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/iljackb/Mixtepec_Mixtec/issues/90?email_source=notifications&email_token=ABH5B374PR4XSVKRCQAUXKDQRFWZTA5CNFSM4JGMK4GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECT24YQ#issuecomment-547860066, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABH5B3YDOQRYWEAYCS6VQXLQRFWZTANCNFSM4JGMK4GA.

laurentromary commented 4 years ago

I am not mastering XQuery, but I could check easily.

Le 30 oct. 2019 à 14:37, Jack Bowers notifications@github.com a écrit :

I'm thinking it may also be possible to search using "string-join" in XQuery but I'm not sure yet...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/iljackb/Mixtepec_Mixtec/issues/90?email_source=notifications&email_token=ABH5B32I3JAOZBEHOSTQ76TQRGEY7A5CNFSM4JGMK4GKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECUGIMQ#issuecomment-547906610, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABH5B3452QQQGRDQYXJTIITQRGEY7ANCNFSM4JGMK4GA.