openscriptures / morphhb

Open Scriptures Hebrew Bible
https://hb.openscriptures.org
Other
178 stars 63 forks source link

Ketiv/Qere without catchWord? #80

Closed jonathanrobie closed 2 years ago

jonathanrobie commented 3 years ago

In most instances of Ketiv/Qere, the pattern seems to be:

For instance:

          <w lemma="3318" morph="HVhv2ms" id="01Pdv">הוצא</w>
          <note type="variant">
            <catchWord>הוצא</catchWord>
            <rdg type="x-qere">
              <w lemma="3318" morph="HVhv2ms" id="01S7t">הַיְצֵ֣א</w>
            </rdg>
          </note>

I'm I right to expect that for all readings?

The following 9 instances do not follow this pattern because the note does not contain a catchWord:

<note type="variant">
  <rdg type="x-qere">
    <w lemma="6635 b" n="0.0" morph="HNcbpa" id="12kTe">צְבָא֖וֹת</w>
  </rdg>
</note>
<note type="variant">
  <rdg type="x-qere">
    <w lemma="1121 a" n="1.1.0" morph="HNcmpc/Sp3ms" id="12uR9">בָּנָי/ו֙</w>
  </rdg>
</note>
<note  type="variant">
  <rdg type="x-qere">
    <w lemma="1121 a" morph="HNcmpc" id="07Kvd">בְּנֵ֣י</w>
  </rdg>
</note>
<note type="variant">
  <rdg type="x-qere">
    <w lemma="6578" n="0" morph="HNp" id="10dp9">פְּרָֽת</w>
  </rdg>
</note>
<note type="variant">
  <rdg type="x-qere">
    <w lemma="376" n="1.0" morph="HNcmsa" id="10Gmc">אִ֖ישׁ</w>
  </rdg>
</note>
<note type="variant">
  <rdg type="x-qere">
    <w lemma="935" n="1.0" morph="HVqrmpa" id="24hu2">בָּאִ֖ים</w>
  </rdg>
</note>
<note type="variant">
  <rdg type="x-qere">
    <w lemma="l" n="1.2.0" morph="HR/Sp3fs" id="24CRs">לָ/הּ֙</w>
  </rdg>
</note>
<note type="variant">
  <rdg type="x-qere">
    <w lemma="413" n="0.0" morph="HR/Sp1cs" id="08k23">אֵלַ֖/י</w>
  </rdg>
</note>
<note type="variant">
  <rdg type="x-qere">
    <w lemma="413" n="0.1" morph="HR/Sp1cs" id="086C3">אֵלַ֔/י</w>
  </rdg>
</note>
DavidTroidl commented 3 years ago

There are qere elements without corresponding ketiv in the WLC source.

jonathanrobie commented 3 years ago

What I really need right now is an algorithm to construct the Qere reading of a text. I'm sure someone else is doing that. Is this documented somewhere?

I also see 23 instances where the catchWord corresponds to more than one preceding w element, but it's not clear how I am supposed to identify the start and end of the Ketiv reading without using string operations and comparing as I go. Here are some:

  <out count="1" osisID="1Sam.9.1">
    <w lemma="c/1961" morph="HC/Vqw3ms" id="09wci">וַֽ/יְהִי</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="376" morph="HNcmsa" id="09MpA">אִ֣ישׁ</w>
    <w lemma="m/1121 a" morph="HR/Np" id="09Una">מ/בן</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="3225" morph="HNp" id="09jgC">ימין</w>
    <note type="variant">
      <catchWord>מ/בן־ימין</catchWord>
      <rdg type="x-qere">
        <w lemma="m/1144" n="1.0.1" morph="HR/Np" id="09EC9">מִ/בִּנְיָמִ֗ין</w>
      </rdg>
    </note>
  </out>
  <out count="2" osisID="1Sam.20.2">
    <w lemma="c/559" morph="HC/Vqw3ms" id="09EWb">וַ/יֹּ֨אמֶר</w>
    <w lemma="l" morph="HR/Sp3ms" id="09R77">ל֣/וֹ</w>
    <w lemma="2486" n="1.2.0" morph="HTj/Sh" id="09rin">חָלִילָ/ה֮</w>
    <w lemma="3808" morph="HTn" id="09kCJ">לֹ֣א</w>
    <w lemma="4191" n="1.2" morph="HVqi2ms" id="09b5f">תָמוּת֒</w>
    <w lemma="2009" n="1.1.1.1" morph="HTm" id="091nK">הִנֵּ֡ה</w>
    <w lemma="l" morph="HR/Sp3ms" id="09UFD">ל/ו</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="6213 a" morph="HVqp3ms" id="09atM">עשה</w>
    <note type="variant">
      <catchWord>ל/ו־עשה</catchWord>
      <rdg type="x-qere">
        <w lemma="3808" morph="HTn" id="09uGu">לֹֽא</w>
        <seg type="x-maqqef">־</seg>
        <w lemma="6213 a" morph="HVqi3ms" id="09dRu">יַעֲשֶׂ֨ה</w>
      </rdg>
    </note>
  </out>
  <out count="3" osisID="1Sam.24.9">
    <note>KJV:1Sam.24.8</note>
    <w lemma="c/6965 b" morph="HC/Vqw3ms" id="09WJN">וַ/יָּ֨קָם</w>
    <w lemma="1732" n="1.1.1.0" morph="HNp" id="09BZ4">דָּוִ֜ד</w>
    <w lemma="310 a" morph="HR" id="09TCU">אַחֲרֵי</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="3651 c" n="1.1.1" morph="HD" id="09qjr">כֵ֗ן</w>
    <w lemma="c/3318" n="1.1.0" morph="HC/Vqw3ms" id="09wRN">וַ/יֵּצֵא֙</w>
    <w lemma="4480 a" morph="HR" id="0984p">מן</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="d/4631" morph="HTd/Ncfsa" id="09Mw5">ה/מערה</w>
    <note type="variant">
      <catchWord>מן־ה/מערה</catchWord>
      <rdg type="x-qere">
        <w lemma="m/d/4631" n="1.1" morph="HR/Td/Ncfsa" id="098Qr">מֵֽ/הַ/מְּעָרָ֔ה</w>
      </rdg>
    </note>
  </out>
  <out count="4" osisID="Isa.44.24">
    <w lemma="3541" morph="HD" id="23SM6">כֹּֽה</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="559" morph="HVqp3ms" id="237aJ">אָמַ֤ר</w>
    <w lemma="3068" n="1.1.0" morph="HNp" id="232XC">יְהוָה֙</w>
    <w lemma="1350 a" n="1.1" morph="HVqrmsc/Sp2ms" id="23g2n">גֹּאֲלֶ֔/ךָ</w>
    <w lemma="c/3335" n="1.0" morph="HC/Vqrmsc/Sp2ms" id="23BLL">וְ/יֹצֶרְ/ךָ֖</w>
    <w lemma="m/990" n="1" morph="HR/Ncfsa" id="23jck">מִ/בָּ֑טֶן</w>
    <w lemma="595" morph="HPp1cs" id="23Mxs">אָנֹכִ֤י</w>
    <w lemma="3068" n="0.2.0" morph="HNp" id="23y1n">יְהוָה֙</w>
    <w lemma="6213 a" morph="HVqrmsa" id="23FQH">עֹ֣שֶׂה</w>
    <w lemma="3605" n="0.2" morph="HNcmsa" id="23Xv1">כֹּ֔ל</w>
    <w lemma="5186" morph="HVqrmsa" id="23yTs">נֹטֶ֤ה</w>
    <w lemma="8064" n="0.1.0" morph="HNcmpa" id="23ZN9">שָׁמַ֨יִם֙</w>
    <w lemma="l/905" n="0.1" morph="HR/Ncmsc/Sp1cs" id="237wi">לְ/בַדִּ֔/י</w>
    <w lemma="7554" morph="HVqrmsc" id="23ogM">רֹקַ֥ע</w>
    <w lemma="d/776" n="0.0" morph="HTd/Ncbsa" id="23sdY">הָ/אָ֖רֶץ</w>
    <w lemma="4325" morph="HNcmpc" id="23Pzf">מי</w>
    <w lemma="854" morph="HR/Sp1cs" id="23yH3">את/י</w>
    <note type="variant">
      <catchWord>מי את/י</catchWord>
      <rdg type="x-qere">
        <w lemma="m/854" n="0" morph="HR/R/Sp1cs" id="23BvR">מֵ/אִתִּֽ/י</w>
      </rdg>
    </note>
  </out>
  <out count="5" osisID="Isa.52.5">
    <w lemma="c/6258" morph="HC/D" id="23c5v">וְ/עַתָּ֤ה</w>
    <w lemma="4100" morph="HTi" id="235WL">מי</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="l" morph="HR/Sp1cs" id="23JfN">ל/י</w>
    <note type="variant">
      <catchWord>מי־ל/י</catchWord>
      <rdg type="x-qere">
        <w lemma="4100" morph="HTi" id="23VCc">מַה</w>
        <seg type="x-maqqef">־</seg>
        <w lemma="l" morph="HR/Sp1cs" id="23aeF">לִּ/י</w>
      </rdg>
    </note>
  </out>
  <out count="6" osisID="2Chr.34.6">
    <w lemma="c/b/5892 b" morph="HC/R/Ncfpc" id="14Mu8">וּ/בְ/עָרֵ֨י</w>
    <w lemma="4519" morph="HNp" id="14kJu">מְנַשֶּׁ֧ה</w>
    <w lemma="c/669" n="1.0.0" morph="HC/Np" id="143VV">וְ/אֶפְרַ֛יִם</w>
    <w lemma="c/8095" n="1.0" morph="HC/Np" id="14yH2">וְ/שִׁמְע֖וֹן</w>
    <w lemma="c/5704" morph="HC/R" id="14xfs">וְ/עַד</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="5321" n="1" morph="HNp" id="14TB9">נַפְתָּלִ֑י</w>
    <w lemma="b/2022" morph="HR/Ncmsc" id="14iDc">ב/הר</w>
    <w lemma="1004 b" morph="HNcmpc/Sp3mp" id="14rFV">בתי/הם</w>
    <note type="variant">
      <catchWord>ב/הר בתי/הם</catchWord>
      <rdg type="x-qere">
        <w lemma="b/2719" n="0.0" morph="HR/Ncfpc/Sp3mp" id="14CXE">בְּ/חַרְבֹתֵי/הֶ֖ם</w>
      </rdg>
    </note>
  </out>
  <out count="7" osisID="2Kgs.6.25">
    <w lemma="c/1961" morph="HC/Vqw3ms" id="12sGR">וַ/יְהִ֨י</w>
    <w lemma="7458" morph="HNcmsa" id="12FKm">רָעָ֤ב</w>
    <w lemma="1419 a" n="1.1.0" morph="HAamsa" id="12eb4">גָּדוֹל֙</w>
    <w lemma="b/8111" n="1.1" morph="HR/Np" id="12fqb">בְּ/שֹׁ֣מְר֔וֹן</w>
    <w lemma="c/2009" n="1.0" morph="HC/Tm" id="12DUs">וְ/הִנֵּ֖ה</w>
    <w lemma="6696 a" morph="HVqrmpa" id="12NwS">צָרִ֣ים</w>
    <w lemma="5921 a" n="1" morph="HR/Sp3fs" id="12SdY">עָלֶ֑י/הָ</w>
    <w lemma="5704" morph="HR" id="121bU">עַ֣ד</w>
    <w lemma="1961" morph="HVqc" id="12jno">הֱי֤וֹת</w>
    <w lemma="7218 a" morph="HNcmsc" id="12E7d">רֹאשׁ</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="2543" n="0.1.0" morph="HNcbsa" id="12z2B">חֲמוֹר֙</w>
    <w lemma="b/8084" morph="HR/Acbpa" id="12WGp">בִּ/שְׁמֹנִ֣ים</w>
    <w lemma="3701" n="0.1" morph="HNcmsa" id="12CRu">כֶּ֔סֶף</w>
    <w lemma="c/7255" n="0.0.0" morph="HC/Ncmsc" id="12fwL">וְ/רֹ֛בַע</w>
    <w lemma="d/6894" morph="HTd/Ncmsa" id="12FAr">הַ/קַּ֥ב</w>
    <w lemma="2755" morph="HNcmsc" id="12TtK">חרי</w>
    <w lemma="3123" morph="HNcfpa" id="12vw5">יונים</w>
    <note type="variant">
      <catchWord>חרייונים</catchWord>
      <rdg type="x-qere">
        <w lemma="1686" n="0.0" morph="HNcmpa" id="12Bqs">דִּבְיוֹנִ֖ים</w>
      </rdg>
    </note>
  </out>
  <out count="10" osisID="Ezek.42.9">
    <w lemma="c/m/8478" morph="HC/R/R/Sd" id="26Mwv">ו/מ/תחת/ה</w>
    <w lemma="3957" morph="HNcfpa" id="26DfV">לשכות</w>
    <note type="variant">
      <catchWord>ו/מ/תחת/ה לשכות</catchWord>
      <rdg type="x-qere">
        <w lemma="c/m/8478" n="1.0" morph="HC/R/R" id="262Lg">וּ/מִ/תַּ֖חַת</w>
        <w lemma="d/3957" morph="HTd/Ncfpa" id="26CYE">הַ/לְּשָׁכ֣וֹת</w>
      </rdg>
    </note>
  </out>

  <out count="11" osisID="Judg.16.25">
    <w lemma="c/1961" n="1.2.0" morph="HC/Vqw3ms" id="07JGX">וַֽ/יְהִי֙</w>
    <w lemma="3588 a" morph="HC" id="07rb5">כי</w>
    <w lemma="2896 a" morph="HVqp3ms" id="07saF">טוב</w>
    <note type="variant">
      <catchWord>כי טוב</catchWord>
      <rdg type="x-qere">
        <w lemma="k/2896 a" morph="HR/Vqc" id="07URH">כְּ/ט֣וֹב</w>
      </rdg>
    </note>
  </out>
  <out count="12" osisID="2Sam.21.12">
    <w lemma="c/3212" morph="HC/Vqw3ms" id="10ngQ">וַ/יֵּ֣לֶךְ</w>
    <w lemma="1732" n="1.2.2" morph="HNp" id="106FS">דָּוִ֗ד</w>
    <w lemma="c/3947" n="1.2.1.0" morph="HC/Vqw3ms" id="106U7">וַ/יִּקַּ֞ח</w>
    <w lemma="853" morph="HTo" id="1075S">אֶת</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="6106" morph="HNcfpc" id="10Fo2">עַצְמ֤וֹת</w>
    <w lemma="7586" n="1.2.1" morph="HNp" id="10Cb1">שָׁאוּל֙</w>
    <w lemma="c/853" morph="HC/To" id="10z1M">וְ/אֶת</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="6106" n="1.2.0" morph="HNcfpc" id="10e7f">עַצְמוֹת֙</w>
    <w lemma="3083" morph="HNp" id="10pst">יְהוֹנָתָ֣ן</w>
    <w lemma="1121 a" n="1.2" morph="HNcmsc/Sp3ms" id="10THR">בְּנ֔/וֹ</w>
    <w lemma="m/854" n="1.1" morph="HR/R" id="10aXS">מֵ/אֵ֕ת</w>
    <w lemma="1167" n="1.0" morph="HNcmpc" id="10JeL">בַּעֲלֵ֖י</w>
    <w lemma="3003" morph="HNp" id="10Kff">יָבֵ֣ישׁ</w>
    <w lemma="1568" n="1" morph="HNp" id="10YxZ">גִּלְעָ֑ד</w>
    <w lemma="834 a" morph="HTr" id="10jf2">אֲשֶׁר֩</w>
    <w lemma="1589" morph="HVqp3cp" id="10cuN">גָּנְב֨וּ</w>
    <w lemma="853" n="0.1.1.0" morph="HTo/Sp3mp" id="109eN">אֹתָ֜/ם</w>
    <w lemma="m/7339" morph="HR/Ncfsc" id="10rVU">מֵ/רְחֹ֣ב</w>
    <w lemma="1052+" morph="HNp HNp HTr" id="103wq">בֵּֽית שַׁ֗ן אֲשֶׁ֨ר</w>
    <seg type="x-maqqef">־</seg>
    <w lemma="8518" morph="HVqp3cp/Sp3mp" id="104P3">תלו/ם</w>
    <note type="variant">
      <catchWord>תלו/ם</catchWord>
      <rdg type="x-qere">
        <w lemma="8518" morph="HVqp3cp/Sp3mp" id="102PP">תְּלָא֥וּ/ם</w>
      </rdg>
    </note>
    <w lemma="8033" morph="HD" id="10nn6">שם</w>
    <w lemma="d/6430" morph="HTd/Ngmpa" id="10srQ">ה/פלשתים</w>
    <note type="variant">
      <catchWord>שם ה/פלשתים</catchWord>
      <rdg type="x-qere">
        <w lemma="8033" n="0.1.0" morph="HD/Sd" id="10gnL">שָׁ֨מָּ/ה֙</w>
        <w lemma="6430" n="0.1" morph="HNgmpa" id="10u2Q">פְּלִשְׁתִּ֔ים</w>
      </rdg>
    </note>
  </out>
jonathanrobie commented 3 years ago

Wouldn't this be a lot easier if all readings were put in an app element with a rdg element for each reading?

<app>
      <rdg type="x-ketiv">
        <w lemma="3588 a" morph="HC" id="07rb5">כי</w>
        <w lemma="2896 a" morph="HVqp3ms" id="07saF">טוב</w>
      </rdg>
      <rdg type="x-qere">
        <w lemma="k/2896 a" morph="HR/Vqc" id="07URH">כְּ/ט֣וֹב</w>
      </rdg>
</app>
DavidTroidl commented 3 years ago
is not an OSIS element. This would also disrupt the choice of using elements for the qere, because that ys really what they are. I checked into the schema and found that @type is allowed on elements. So i would propose putting @type=:x-ketiv" on the ketiv elements. If that seems reasonable, i may take some time to implement.
DavidTroidl commented 3 years ago

Sorry, the elements disappeared. app is not an OSIS element. We are using note elements for qere, because that is what they are. @type is allowed on a w element, so we could use @type="x-ketiv". Does that make sense?

jonathanrobie commented 3 years ago

Yes, I think that could work, if I understand correctly. As far as timing goes, it's possible that I can produce this and do a pull request, not promising just yet, but that could happen.

You are proposing this?

    <w lemma="c/m/8478" morph="HC/R/R/Sd" id="26Mwv"  type="x-ketiv">ו/מ/תחת/ה</w>
    <w lemma="3957" morph="HNcfpa" id="26DfV"  type="x-ketiv">לשכות</w>
    <note type="variant">
      <catchWord>ו/מ/תחת/ה לשכות</catchWord>
      <rdg type="x-qere">
        <w lemma="c/m/8478" n="1.0" morph="HC/R/R" id="262Lg">וּ/מִ/תַּ֖חַת</w>
        <w lemma="d/3957" morph="HTd/Ncfpa" id="26CYE">הַ/לְּשָׁכ֣וֹת</w>
      </rdg>
    </note>
DavidTroidl commented 3 years ago

@ just stands for attribute, so i would be type="x-ketiv". On the 1Sam.9.1 example, I think yu would also have to do the maqqef: seg type="x-ketiv">־

jonathanrobie commented 3 years ago

Sorry, the elements disappeared. app is not an OSIS element. We are using note elements for qere, because that is what they are. @type is allowed on a w element, so we could use @type="x-ketiv". Does that make sense?

rdgGroup is an OSIS element. Could we use that instead of app?

Perhaps:

   <verse osisID="Ezek.42.9">
      <rdgGroup type="variant">
         <rdg type="x-ketiv">
            <w lemma="c/m/8478" morph="HC/R/R/Sd" id="26Mwv">ו/מ/תחת/ה</w>
            <w lemma="3957" morph="HNcfpa" id="26DfV">לשכות</w>            
         </rdg>
         <rdg type="x-qere">
            <w lemma="c/m/8478" n="1.0" morph="HC/R/R" id="262Lg">וּ/מִ/תַּחַת</w>
            <w lemma="d/3957" morph="HTd/Ncfpa" id="26CYE">הַ/לְּשָׁכ֣וֹת</w>
         </rdg>
      </rdgGroup>
      <w lemma="d/428" n="1" morph="HTd/Pdxcp" id="26FUJ">הָ/אֵ֑לֶּה</w>
      <rdgGroup type="variant">
         <rdg type="x-ketiv">
            <w lemma="d/3996" morph="HTd/Ncmsa" d="26pCg">ה/מבוא</w>            
         </rdg>
         <rdg type="x-qere">
            <w lemma="d/935" n="0.2.0" morph="HTd/Ncmsa" id="26wTv">הַ/מֵּבִיא֙</w>
         </rdg>
      </rdgGroup>
      <w lemma="m/d/6921" n="0.2" morph="HR/Td/Ncmsa" id="26rrD">מֵֽ/הַ/קָּדִ֔ים</w>
      <w lemma="b/935" morph="HR/Vqc/Sp3ms" id="261fi">בְּ/בֹא֣/וֹ</w>
      <w lemma="l/2007" n="0.1" morph="HR/Pp3fp" id="26i6h">לָ/הֵ֔נָּה</w>
      <w lemma="m/d/2691 a" n="0.0" morph="HR/Td/Ncbsa" id="26rx5">מֵֽ/הֶ/חָצֵ֖ר</w>
      <w lemma="d/2435" n="0" morph="HTd/Aafsa" id="26sM6">הַ/חִצֹנָֽה</w>
      <seg type="x-sof-pasuq">׃</seg>
   </verse>   
DavidTroidl commented 3 years ago

The rdgGroup is not allowed as a direct child of a verse element. It appears to be designed to be inside a note element. In addition, I would really prefer the format as it stands. Adding the type-"x-ketiv" should resolve the issue.

jonathanrobie commented 3 years ago

OK, I think I will try to create a patch that makes that change. I am on vacation until Monday, then on the road on Tuesday and Wednesday, so I may or may not get to it before then.

Jonathan

On Fri, Jul 9, 2021 at 7:10 PM David Troidl @.***> wrote:

The rdgGroup is not allowed as a direct child of a verse element. It appears to be designed to be inside a note element. In addition, I would really prefer the format as it stands. Adding the type-"x-ketiv" should resolve the issue.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openscriptures/morphhb/issues/80#issuecomment-877502478, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPTPIXAA5VF35UNND4ITLTW56YFANCNFSM477FH6ZQ .

jonathanrobie commented 2 years ago

I have a transformation that seems to work, but there's a problem with maqqef, which already uses the @type attribute.

I can't do this because an element cannot have two type attributes:

  <w type="x-ketiv" lemma="m/1121 a" morph="HR/Np" id="09Una">מ/בן</w>
  <seg type="x-ketiv" type="x-maqqef">־</seg>
  <w type="x-ketiv" lemma="3225" morph="HNp" id="09jgC">ימין</w>

I could provide one type attribute with two tokens, but that's likely to mess up existing software if it is assuming there can only be one value in the type attribute:

  <w type="x-ketiv" lemma="m/1121 a" morph="HR/Np" id="09Una">מ/בן</w>
  <seg type="x-ketiv x-maqqef">־</seg>
  <w type="x-ketiv" lemma="3225" morph="HNp" id="09jgC">ימין</w>

Or I could simply leave out the ketiv marking for the maqqef:

  <w type="x-ketiv" lemma="m/1121 a" morph="HR/Np" id="09Una">מ/בן</w>
  <seg type="x-maqqef">־</seg>
  <w type="x-ketiv" lemma="3225" morph="HNp" id="09jgC">ימין</w>

Or I could add some other attribute. Certainly not k, but let me use that as a placeholder to illustrate what this would look like:

<verse osisID="1Sam.9.1">
  <w lemma="c/1961" morph="HC/Vqw3ms" id="09wci">וַֽ/יְהִי</w>
  <seg type="x-maqqef">־</seg>
  <w lemma="376" morph="HNcmsa" id="09MpA">אִ֣ישׁ</w>
  <w k="x-ketiv" lemma="m/1121 a" morph="HR/Np" id="09Una">מ/בן</w>
  <seg k="x-ketiv" type="x-maqqef">־</seg>
  <w k="x-ketiv" lemma="3225" morph="HNp" id="09jgC">ימין</w>
  <note type="variant">
    <catchWord>מ/בן־ימין</catchWord>
    <rdg type="x-qere">
      <w lemma="m/1144" n="1.0.1" morph="HR/Np" id="09EC9">מִ/בִּנְיָמִ֗ין</w>
    </rdg>
  </note>
  <note n="a">Adaptations to a Qere which L and BHS, by their design, do not indicate.</note>

Is there a suitable attribute name to use instead of k here? Is there a better option that I have not considered?

I can generate a patch quickly once we agree on the format.

DavidTroidl commented 2 years ago

I would think the nest way to do it would be the two token option:

  <w type="x-ketiv" lemma="m/1121 a" morph="HR/Np" id="09Una">מ/בן</w>
  <seg type="x-ketiv x-maqqef">־</seg>
  <w type="x-ketiv" lemma="3225" morph="HNp" id="09jgC">ימין</w>

If you feel that would be problematic, the nest best option would be to leave it out altogether. As long as that doesn't hinder your efforts. One thing to watch is that you maintain the white space within the verse elements.

pdurusau commented 2 years ago

David, can you say a word or two about why maqqef has a type attribute for maqqef? Maqqef is U+05BE so why is the type attribute required on the seg that encloses it? Apologies if this is well known but Jonathan asked me to take a look at the issue and that would free up the type attribute for x-ketiv with one token.

DavidTroidl commented 2 years ago

It is essentially the same reason Jonathan wants to identify the ketiv elements. We want to be able to deal with the OSIS elements without having to analyze the character content.

pdurusau commented 2 years ago

David, likely my bad but a maqqef in a ketiv doesn't have any character content to be analyzed does it? If I understand your rule, you want to avoid the sometimes problematic parsing of Hebrew by assigning it a fixed value? As an attribute. Yes?

OK, that's understandable but a maqqef, in my ignorance, has no such other parsing. There may be traditions where it has varying meanings, I simply mean to say I am not aware of them.

It would be the difference in part of speech for words in English, your non-parsing of character content, and separating marking a comma with an element, using an attribute that reads k="comma". I'm not seeing what the consistency adds, sorry.

jonathanrobie commented 2 years ago

I think I agree with Patrick. Each seg type simply names the character that the element contains. I think it's just as easy to test the character as the attribute.

  <seg type="x-maqqef">־</seg>
  <seg type="x-sof-pasuq">׃</seg>
  <seg type="x-pe">פ</seg>

In a path expression, I can say if (seg='פ') or I can say if (seg/@type='x-pe''), they are equivalent, and I don't think the one that uses the @type attribute is simpler.

To me, that's different from the Ketiv case. If I want to ignore the Ketiv and follow the Qere, I have to parse the catchWord and work backwards to identify the elements that correspond to it using string operations. Nothing explicitly marks the elements that correspond to the Ketiv reading. The query I do to identify these nodes is more complex than if (seg='פ'), and I don't want to do this every time I need to ignore the Ketiv in a query. Or is there a simpler trick I am missing?

declare function local:get-ketiv($base, $catchword)
{
  let $prev := $base/preceding-sibling::*[1]
  let $prevstring := fn:string($prev)
  where $prev and fn:ends-with($catchword, $prevstring)
  return (
    $prev
    ,
    if ($prevstring != $catchword)
    then get-ketiv($prev, fn:substring($catchword, 1, fn:string-length($catchword) - fn:string-length($prevstring)))
    else ()
  )
};

declare updating function local:mark-ketiv($variant)
{
  for $ketiv in get-ketiv($variant, $variant/catchWord)
  return (
    delete node $ketiv/@type,
    insert node attribute type { fn:string-join(($ketiv/@type, "x-ketiv")," ") } into $ketiv
  )
};
jonathanrobie commented 2 years ago

David, likely my bad but a maqqef in a ketiv doesn't have any character content to be analyzed does it? If I understand your rule, you want to avoid the sometimes problematic parsing of Hebrew by assigning it a fixed value? As an attribute. Yes?

OK, that's understandable but a maqqef, in my ignorance, has no such other parsing. There may be traditions where it has varying meanings, I simply mean to say I am not aware of them.

It would be the difference in part of speech for words in English, your non-parsing of character content, and separating marking a comma with an element, using an attribute that reads k="comma". I'm not seeing what the consistency adds, sorry.

I wonder if @DavidTroidl 's concern is CSS selectors, which work with attribute values but not element content? But in that case, I would think Ketiv would also be important. Ketiv is often formatted differently.

DavidTroidl commented 2 years ago

These arguments seem to make sense, but the point was raised earlier about necessitating a rewrite of existing applications. I don't think a second token on the magqef type should have any impact, but the suggested changes certainly would. Testing an attribute value for x-maqqef or for x-ketiv should both work.

jonathanrobie commented 2 years ago

I can give you a pull request that does it that way. That meets my actual need, making it easy to filter out the Ketiv and follow the Qere.

The issue with existing applications is this: depending on how you parse and how you test, existing code may break when multiple tokens are put into a single attribute. Most applications do not expect that. But it's easy enough to fix such applications.

If OSIS allowed it, I would still prefer an explicit parallel reading, but this is your repository, your design sense should rule.

DavidTroidl commented 2 years ago

I apologize for my mistake. Unfortunately the two tokens on the @type do not validate. Can we just drop the x-ketiv from the maqqef?

jonathanrobie commented 2 years ago

I apologize for my mistake. Unfortunately the two tokens on the @type do not validate.

Ouch. I should have validated before issuing a pull request, my bad.

Can we just drop the x-ketiv from the maqqef?

If we do, neither .CSS nor programs have an easy way to see exactly what is in the Ketiv vs. Qere. To me, the main use cases for marking up Ketiv or Qere are:

Am I missing important use cases? For my syntax trees and queries I generally want to follow the Qere as the main reading. If we want to make this possible and make it easy to format like the editions Wikipedia describes, we need to mark the Ketiv somehow. I'm trying to get a feel for what others do here. According to Wikipedia:

Modern editions of the Chumash and Tanakh include information about the qere and ketiv, but with varying formatting, even among books from the same publisher. Usually, the qere is written in the main text with its vowels, and the ketiv is in a side- or footnote (as in the Gutnick and Stone editions of the Chumash, from Kol Menachem[15] and Artscroll,[16] respectively). Other times, the ketiv is indicated in brackets, in-line with the main text (as in the Rubin edition of the Prophets, also from Artscroll).

If we want to follow that approach, then we could put the Ketiv in a note and put the Qere in the mainline text, but I think you would probably dislike that because you think of the Ketiv as the main reading and the Qere as mere commentary. For the record, that would look like this, and it would make it easy to format a text as described above:

  <w lemma="c/1961" morph="HC/Vqw3ms" id="09wci">וַֽ/יְהִי</w>
  <seg type="x-maqqef">־</seg>
  <w lemma="376" morph="HNcmsa" id="09MpA">אִ֣ישׁ</w>
  <note type="variant">
      <rdg type="x-ketiv">
        <w lemma="m/1121 a" morph="HR/Np" id="09Una">מ/בן</w>
        <seg type="x-maqqef">־</seg>
       <w lemma="3225" morph="HNp" id="09jgC">ימין</w>
    </rdg>
  </note>
  <w lemma="m/1144" n="1.0.1" morph="HR/Np" id="09EC9">מִ/בִּנְיָמִ֗ין</w>

Another approach would be to get rid of @type attributes that simply say that a maqqef is a maqqef, a sof-pasuq is a sof-pasuq, a samekh is a same, a pe is a pe, etc. They do not add anything to the semantics, are they needed for css selectors? They certainly are not needed for programs or queries. And for .css selectors, I would think it is more important to say whether it is Ketiv or not, since Ketiv is often in a different font. If we do that, it would look like this:

  <w lemma="c/1961" morph="HC/Vqw3ms" id="09wci">וַֽ/יְהִי</w>
  <seg type="x-maqqef">־</seg>
  <w lemma="376" morph="HNcmsa" id="09MpA">אִ֣ישׁ</w>
  <w type="x-ketiv" lemma="m/1121 a" morph="HR/Np" id="09Una">מ/בן</w>
  <seg type="x-ketiv">־</seg>
  <w type="x-ketiv" lemma="3225" morph="HNp" id="09jgC">ימין</w>
  <note type="variant">
    <catchWord>מ/בן־ימין</catchWord>
    <rdg type="x-qere">
      <w lemma="m/1144" n="1.0.1" morph="HR/Np" id="09EC9">מִ/בִּנְיָמִ֗ין</w>
    </rdg>
  </note>

To me, the cleanest approach would be to wrap both the Ketiv and Qere in rdg elements, but not in a note. @pdurusau, you are the OSIS expert, is there a way to do that? I suggested this earlier, but the rdgGroup element was not allowed there in the OSIS schema:

   <verse osisID="Ezek.42.9">
      <rdgGroup type="variant">
         <rdg type="x-ketiv">
            <w lemma="c/m/8478" morph="HC/R/R/Sd" id="26Mwv">ו/מ/תחת/ה</w>
            <w lemma="3957" morph="HNcfpa" id="26DfV">לשכות</w>            
         </rdg>
         <rdg type="x-qere">
            <w lemma="c/m/8478" n="1.0" morph="HC/R/R" id="262Lg">וּ/מִ/תַּחַת</w>
            <w lemma="d/3957" morph="HTd/Ncfpa" id="26CYE">הַ/לְּשָׁכ֣וֹת</w>
         </rdg>
      </rdgGroup>
      <w lemma="d/428" n="1" morph="HTd/Pdxcp" id="26FUJ">הָ/אֵ֑לֶּה</w>
      <rdgGroup type="variant">
         <rdg type="x-ketiv">
            <w lemma="d/3996" morph="HTd/Ncmsa" d="26pCg">ה/מבוא</w>            
         </rdg>
         <rdg type="x-qere">
            <w lemma="d/935" n="0.2.0" morph="HTd/Ncmsa" id="26wTv">הַ/מֵּבִיא֙</w>
         </rdg>
      </rdgGroup>
      <w lemma="m/d/6921" n="0.2" morph="HR/Td/Ncmsa" id="26rrD">מֵֽ/הַ/קָּדִ֔ים</w>
      <w lemma="b/935" morph="HR/Vqc/Sp3ms" id="261fi">בְּ/בֹא֣/וֹ</w>
      <w lemma="l/2007" n="0.1" morph="HR/Pp3fp" id="26i6h">לָ/הֵ֔נָּה</w>
      <w lemma="m/d/2691 a" n="0.0" morph="HR/Td/Ncbsa" id="26rx5">מֵֽ/הֶ/חָצֵ֖ר</w>
      <w lemma="d/2435" n="0" morph="HTd/Aafsa" id="26sM6">הַ/חִצֹנָֽה</w>
      <seg type="x-sof-pasuq">׃</seg>
   </verse>   

Crosswire uses a seg for marking variants. I tried this in Amos, and the following seems to validate just fine:

        <verse osisID="Amos.9.6">
          <w lemma="d/1129" morph="HTd/Vqrmsa" id="30Rr8">הַ/בּוֹנֶ֤ה</w>
          <w lemma="b/8064" n="1.1.0" morph="HRd/Ncmda" id="30q3q">בַ/שּׁמַ֨יִם֙</w>
          <seg type="x-variant" subType="x-ketiv">
            <w lemma="4609 b" morph="HNcfpc/Sp3ms" id="30zBj">מעלות/ו</w>
          </seg>
          <seg type="x-variant" subType="x-qere">
            <w lemma="4609 b" n="1.1" morph="HNcfpc/Sp3ms" id="30mM8">מַעֲלוֹתָ֔י/ו</w>
          </seg>
          <w lemma="c/92" n="1.0" morph="HC/Ncfsc/Sp3ms" id="30Boi">וַ/אֲגֻדָּת֖/וֹ</w>
       !!! SNIP !!!

That validates (except for the id attributes on w elements, see https://github.com/openscriptures/morphhb/issues/84) and does not interfere with using a seg element and a @type attribute for the maqqef.

@DavidTroidl Am I missing any possibilities or use cases? What would you prefer?

@pdurusau What would you suggest? Does the OSIS schema allow us any kind of direct representation of Ketiv and Qere as parallel readings?

jonathanrobie commented 2 years ago

Once we agree on what to do, I should be able to turn this around quickly so we don't have XML that won't validate.

DavidTroidl commented 2 years ago

In developing the OSHB, one of our primary concerns was how to faithfully represent the text, within the limitations of OSIS. We had numerous extended discussions, to resolve various issues. I am in no way saying that the ketiv is the "preferred reading". The point is that the ketiv is an actual part of the consonantal text, the qere is a Massoretic note on that text. Of course, vowel points and cantillation are too. But this gives the underlying logic for the format we are using. Setting apart maqqef, paseq, etc. and assigning attributes to them, I'm sure goes back to one of these discussions, but I don't recall all the details. Now that they are there, removing the attributes would wreak havoc on existing software developed for dealing with the OSHB. Looking at the schema, I find that the w and seg elements both have a subType attribute. Would it work to use @subType at least on the maqqef, or possibly on the w too?

jonathanrobie commented 2 years ago

To me, @subType only makes sense when it describes a subtype of the @type attribute.

Of all the approaches discussed so far, I think this is the one that allows the maqqef attribute, validates under OSIS, and clearly identifies both Ketiv and Qere:

        <verse osisID="Amos.9.6">
          <w lemma="d/1129" morph="HTd/Vqrmsa" id="30Rr8">הַ/בּוֹנֶ֤ה</w>
          <w lemma="b/8064" n="1.1.0" morph="HRd/Ncmda" id="30q3q">בַ/שּׁמַ֨יִם֙</w>
          <seg type="x-variant" subType="x-ketiv">
            <w lemma="4609 b" morph="HNcfpc/Sp3ms" id="30zBj">מעלות/ו</w>
          </seg>
          <seg type="x-variant" subType="x-qere">
            <w lemma="4609 b" n="1.1" morph="HNcfpc/Sp3ms" id="30mM8">מַעֲלוֹתָ֔י/ו</w>
          </seg>
          <w lemma="c/92" n="1.0" morph="HC/Ncfsc/Sp3ms" id="30Boi">וַ/אֲגֻדָּת֖/וֹ</w>

It is also the approach that is documented in Crosswire's documentation.

Would you be OK with that?

I could fix the ID validation problem in a separate pull request.

DavidTroidl commented 2 years ago

Maqqefs that appear in ketivs would be a valid subtype of maqqefs in general. That would narrow down the usage to the maqqef only. The seg suggestion would again wreak havoc with existing software.

jonathanrobie commented 2 years ago

Sounds like existing software means we can't change much beyond adding an attribute. If so, I think we are probably looking at solutions that are not semantically clean, but can work in a program. Would it be possible to include the developers of this other software in the conversation? Or perhaps that's you? I'd like to know if any more wiggle room is possible.

Using @subtype without a @type is odd, but I don't know what @type would work for both the w elements and the seg elements.

  <w lemma="c/1961" morph="HC/Vqw3ms" id="09wci">וַֽ/יְהִי</w>
  <seg type="x-maqqef">־</seg>
  <w lemma="376" morph="HNcmsa" id="09MpA">אִ֣ישׁ</w>
  <w subType="x-ketiv" lemma="m/1121 a" morph="HR/Np" id="09Una">מ/בן</w>
  <seg type="x-maqqef" subType="x-ketiv">־</seg>
  <w subType="x-ketiv" lemma="3225" morph="HNp" id="09jgC">ימין</w>
  <note type="variant">
    <catchWord>מ/בן־ימין</catchWord>
    <rdg type="x-qere">
      <w lemma="m/1144" n="1.0.1" morph="HR/Np" id="09EC9">מִ/בִּנְיָמִ֗ין</w>
    </rdg>
  </note>

I could do that. It would work. It would probably confuse some people. It's not semantically clean.

DavidTroidl commented 2 years ago

Actually, my last suggestion was intended to confine the subType to the maqqef and leave the type="x-ketiv" on the w elements. This would be cleaner and still accomplish the objective. The subType="x-ketiv" would then be a refinement of type="maqqef".

jonathanrobie commented 2 years ago

Ah, I misunderstood you. I agree that would be cleaner for maqqefs used within ketiv.

What about <seg> elements that are not ketiv? What type and subtype would a maqqef have in that case?

DavidTroidl commented 2 years ago

That's akin to asking what a w element would be, when it's not a ketiv. It would be just an ordinary word, or in this case an ordinary maqqef.

jonathanrobie commented 2 years ago

I just created a pull request that fixes the id problem this one, it now validates.  This example shows how I treated maqqef inside and outside of the ketiv, did I understand your intent correctly?

<verse osisID="1Sam.9.1">
          <w ID="i09wci" lemma="c/1961" morph="HC/Vqw3ms">וַֽ/יְהִי</w><seg type="x-maqqef">־</seg><w ID="i09MpA" lemma="376" morph="HNcmsa">אִ֣ישׁ</w>
          <w ID="i09Una" type="x-ketiv" lemma="m/1121 a" morph="HR/Np">מ/בן</w><seg subType="x-maqqef" type="x-ketiv">־</seg><w ID="i09jgC" type="x-ketiv" lemma="3225" morph="HNp">ימין</w><note type="variant"><catchWord>מ/בן־ימין</catchWord><rdg type="x-qere"><w ID="i09EC9" lemma="m/1144" n="1.0.1" morph="HR/Np">מִ/בִּנְיָמִ֗ין</w></rdg></note>
          <note n="a">Adaptations to a Qere which L and BHS, by their design, do not indicate.</note>
DavidTroidl commented 2 years ago

The id fix was only meant to be a temporary stopgap for testing. I am still discussing the ID format with the authors of the IDs. Making the maqqef the subType is still going to cause problems. I don't want to commit either of these to the files.

jonathanrobie commented 2 years ago

OK.

Let me know if there's anything useful I can do.

On Sun, Jul 18, 2021 at 3:07 PM David Troidl @.***> wrote:

The id fix was only meant to be a temporary stopgap for testing. I am still discussing the ID format with the authors of the IDs. Making the maqqef the subType is still going to cause problems. I don't want to commit either of these to the files.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/openscriptures/morphhb/issues/80#issuecomment-882103456, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPTPLVXMJOBKI5EXUCU2DTYMQ6RANCNFSM477FH6ZQ .

DavidTroidl commented 2 years ago

Thanks for your help. I corrected the two token issue using type="x-maqqef" subType="x-ketiv". This validates, using the temporary id correction.

jonathanrobie commented 2 years ago

Thanks - I had misunderstood your earlier comment. This will work for me.

DavidTroidl commented 2 years ago

Glad we came up with something that works.