Strongs matching doesn't play as nicely with "Match based on stems" for Biblical key terms as it could.

sillsdev / ptx2pdf

XeTeX based macro package for typesetting USFM formatted (Paratext output) scripture files

23 stars 8 forks source link

Strongs matching doesn't play as nicely with "Match based on stems" for Biblical key terms as it could. #770

Open davidg-sil opened 2 years ago

davidg-sil commented 2 years ago

The 'Match based on stems' flag in paratext reduces false positive matches in the key-terms checking - it means that only word-forms that have been marked as having a certain stem will match.

Unfortunately for producing nice output in ptxprint, using this flag means that the stem is in the biblical terms renderings list with no asterisk or other marker, which users of paratext can eventually get used to but it would be confusing for most people. e.g. in Romani, 'give' is used in many phrasal renderings, (such as 'give hand' = assist, 'give way' = release), and the stem is 'd-'. Thus the key term for release might have "d drom (give way)".

Paratext obviously contains code that searches the word entries and realises that 'd' is the stem in this case. It would be nice if PT had provided some way to specify that this word is the stem, but as it hasn't, it would be good to notice the above-mentioned flag and provide some mechanism to look through the wordforms and spot that 'd' should have a * or hyphen appended to it.

davidg-sil commented 2 years ago

How's this for hand-waving pseudo-code:

If relevant flag set, build dict of stems from wordforms database, and whether it accepts pre/suff/circumfixes (hopefully this is clear to keep everyone happy, else ask the user, I guess!)
When strongs-ifying the lines from keyterms, find 'words' that might be stems.
- If there's one item, assume that is the relevant stem and attach *(s) as appropriate.
- If there's >1, check some kind 'I need help' editable list, (with context) adding to it as needed.
- Also add to that editable list if there's a stem that can also stand freely, since it might be totally different.
Feed this input into 'replace * with' replacements.

mhosken commented 2 years ago

Where is the data? What files? Please contribute example files. Currently this issue is inactionable.

I'm suspicious that this information is held in FLeX. In that case, I rule it out of scope for ptxprint to interact with FLeX data. It may be that another tool can be written that will pull data of FLeX in a form that we can use in ptxprint to enhance the string matching. PTXprint's matching capability is limited and automated. As soon as you start asking users for resolution of issues, that places the problem beyond the capabilities that we want to see in PTXprint. Again, feel free to write a tool that goes through and interacts with the user to resolve all the ambiguities.

davidg-sil commented 2 years ago

'Wordforms database' is in paratext ~/Paratext9Projects/Project/WordAnalyses.xml Example entries:

   <Entry Word="ćhudine">
    <Analysis>
      <Lexeme>Stem:ćhud</Lexeme>
      <Lexeme>Suffix:in</Lexeme>
      <Lexeme>Suffix:e</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="das">
    <Analysis>
      <Lexeme>Stem:d</Lexeme>
      <Lexeme>Suffix:as</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="dasas">
    <Analysis>
      <Lexeme>Stem:d</Lexeme>
      <Lexeme>Suffix:as</Lexeme>
      <Lexeme>Suffix:as</Lexeme>
    </Analysis>
  </Entry>
  <Entry Word="daxni">
    <Analysis>
      <Lexeme>Stem:daxn</Lexeme>
      <Lexeme>Suffix:i</Lexeme>
    </Analysis>
  </Entry>

Incomplete Rendering example (TermRenderings.xml)

 <TermRendering Id="ἀπολύω" Guess="false">
    <Renderings>d * drom (a da cuiva drumul / let go)</Renderings>