sanskrit-lexicon / MWS

Monier Monier-Williams, Sir; A Sanskrit-English dictionary. Oxford, 1899
Other
7 stars 5 forks source link

Further <ls> cleanup #136

Closed funderburkjim closed 1 year ago

funderburkjim commented 2 years ago

This issue devoted to continuation (from #134) of cleanup of ls markup in mw.txt. Two fertile areas:

876 matches in 874 lines for "<ls[^<]* and"
Example: 
OLD
<ls>Mn. ix, 49 and 51.</ls>  
NEW
<ls>Mn. ix, 49</ls> and <ls n="Mn. ix,">51.</ls>

1098 matches in 1085 lines for "<ls[^<]*;"
Example:
OLD
<ls>Mn. iii, 257; v, 73</ls>
NEW
<ls>Mn. iii, 257</ls>; <ls n="Mn.">v, 73</ls>
gasyoun commented 2 years ago

fertile areas

:100:

funderburkjim commented 1 year ago

This week's changes to mw.txt, primarily to ls markup, are now completed. The work is done in issue136 directory.

The sequence of changes are in files change_1.txt through change_4.txt, with corresponding notes in the readme.txt file of issue136 directory.

The change_all.txt file shows list of all 4221 lines changed in mw.txt.

The ls markup in mw.txt has received considerable attention in the last few weeks. Thanks to @Andhrabharati and @gasyoun for pointing out areas that needed attention.

At the moment, I don't have in mind further lines of improvement to ls markup in mw.txt, and will likely return to ls markup improvements in pw and pwg.

Andhrabharati commented 1 year ago

Probably @funderburkjim may consider removing the 3 <slp1> tags in the <ls> strings--

change <ls>Kielhorn., <s1 slp1="mahABAzya">Mahābhāṣya</s1>, vol. i, preface, p.9 f.</ls> as <ls>Kielhorn., Mahābhāṣya, vol. i, preface, p.9 f.</ls>

change <ls>VS. (<s1 slp1="kARva">Kāṇva</s1>) ii, 24</ls> as <ls>VS. (Kāṇva.) ii, 24</ls>

change <ls>YajurV. <s1 slp1="parIS">Parīś</s1>. xv</ls> as <ls>YajurV. Parīś. xv</ls>

Andhrabharati commented 1 year ago

And also correct the space-between-digits [0-9] [0-9] errors inside the <ls> (45 occurrences, which link to a wrong place) and <pc> (4 occurrences, which lead to a wrong page) blocks.

Andhrabharati commented 1 year ago

Some more minor corrections, on a quick look--

  1. change <ls>R , i, 27, 7</ls> as <ls>R. i, 27, 7</ls>

  2. change <ls>Daś. -</ls> as <ls>Daś.</ls>

  3. change <ls>L. -</ls> as <ls>L.</ls>

  4. change <ls>Rājat. -</ls> as <ls>Rājat.</ls>

  5. change <ls>R. -</ls> as <ls>R.</ls>

  6. change <ls>L., also</ls> as <ls>L.</ls>, also

  7. change <ls>ŚBr., as</ls> as <ls>ŚBr.</ls>, as

  8. change <ls>Beta bengalensis</ls> as <bot>Beta bengalensis</bot>

  9. change <ls>T., but according to</ls>; <ls>Uṇ. i, 67</ls> as <ls>T.</ls>, but according to <ls>Uṇ. i, 67</ls>

  10. change <ls>MBh. etc</ls>, as <ls>MBh.</ls> etc.

  11. change <ls>Kāv. etc.</ls> as <ls>Kāv.</ls> etc.

And it is not out of place to mention that many punctuation errors, as at cases 9 & 10 above, are seen throughout the text; and a handful cases of hyphen marks at wrong places which indicate the <hom> numbers of the next entry word are present.

funderburkjim commented 1 year ago

@Andhrabharati Will take a look at these flaws; is there a systematic way to find other instances like 9, 10 ?

Andhrabharati commented 1 year ago

a handful cases of hyphen marks at wrong places which indicate the <hom> numbers of the next entry word are present.

Though these are not related to <ls> items, they are more important as concerning the HWs (and metalines) themselves; some (6) are to be found by the regex -[0-9].<info.

And, there is a <hom> related issue #131, that @funderburkjim yet needs to put his eye on!

funderburkjim commented 1 year ago

further cleanup.

These take into account suggestions since previous commit. The change transactions are change_5.txt. All in all, about 800 lines were changed.

A large number of 'new' ls abbreviations were added to mwath as 'Unknown'. (Also, the Maṇḍ. abbreviation was given tooltip -- see the 'pywork' commit above.)

@Andhrabharati could help by providing tooltips for these Unknown cases. ls_abbrev_instances_unknown.txt file has instances of most of these Unknown cases.

An attempt was made to define programmatically a 'normal' ls instance. Using this rule, there remain about 40 'abnormal' instances identified in file lsabnormal_5.txt.

If there are no more correction suggestions for 'ls' in mw, this issue can be closed and I'll take a look at the hom issue mentioned above.

Andhrabharati commented 1 year ago

Seen that the space between digits and other corrections related to <ls> items are all considered now.

The remaining point in this issue is https://github.com/sanskrit-lexicon/MWS/issues/136#issuecomment-1186091325, which could be done before going to #131, or to be kept in mind while doing the <hom> corrections.

I would prefer them being corrected here itself, as these are not marked <hom> explicitly.

So Jim can decide the action accordingly to close this issue.

Andhrabharati commented 1 year ago

A large number of 'new' ls abbreviations were added to mwath as 'Unknown'. (Also, the Maṇḍ. abbreviation was given tooltip -- see the 'pywork' commit above.)

@Andhrabharati could help by providing tooltips for these Unknown cases. ls_abbrev_instances_unknown.txt file has instances of most of these Unknown cases.

An attempt was made to define programmatically a 'normal' ls instance. Using this rule, there remain about 40 'abnormal' instances identified in file lsabnormal_5.txt.

If @funderburkjim likes to do it here itself, I can surely help resolving these, but I would suggest doing this piece of work while some action is taken on the issue #135 (which is related to the same and also has some more relevant points).

Hope to listen back Jim's opinion.

Andhrabharati commented 1 year ago

Some small extra corrections related to spaces:

  1. There are 112 double space instances in the mw.txt, that are to be made single spaces.
  2. 10 cases of ,, 9 cases of ; and 10 cases of ) to have the preceding space deleted.
  3. 3 dangling > to be deleted.
Andhrabharati commented 1 year ago

In the tooltip.txt,

99.98 ib. int the same place [Cologne Addition] Title

to be corrected as

99.98 ib. in the same place [Cologne Addition] Title

Andhrabharati commented 1 year ago

There are two instances of ** under <L>229710 and <L>237718 in the mw.txt, that may be deleted.

Incidentally, the <ls>ĀpastPray.</ls>, which is at <L>237718, is without a tooltip, but is not listed in either abbrevlist_unknown.txt or in ls_abbrev_instances_unknown.txt, though present in both mwauth.txt and tooltip.txt.

When looked for the equivalence among these 4 files, noticed that both mwauth.txt & tooltip.txt have 168 no.s of to be expanded "Unknown reference" entries, whereas both abbrevlist_unknown.txt & ls_abbrev_instances_unknown.txt listed just 147 no.s.

What is the reason for the difference of 21 between the two sets of files?

The 21 additional entries in tooltip.txt are--

ĀpGṛh. ĀpastPray. Śak. (Chézy) Śak. (Pi.) AV. Paipp. AV., SBE. Kaegi, Der Ṛgveda Ludwig, RV. Muir's Sanskrit Texts Muir, S. T. Pañc. B. Pat. (K.) R. (B) R. (B.) R. G. R. [B.] RV. AnuvAnukr. SV.Anukr. Uttamac.² YajurV. Parīś. Zachariae, Beiträge

In these, Śak. (Pi.) occurs 18 times!!

Andhrabharati commented 1 year ago

Resolving 9 out of 10 instances of <ls n="Unknown">:

  1. under <L>67611, <ls n="Unknown">lii, 19</ls> to be made as <ls n="AV.Pariś.">lii, 19</ls>, taking the prev. ls item (AV.Pariś.) as the ref. [cf. PWG entry of govITI.]

  2. under <L>71651, <ls n="Unknown">xxx.</ls> to be made as <ls n="Vīrac.">xxx.</ls>, taking the prev. ls item (Vīrac.) as the ref. [cf. pwk entry candraketu and the Ind. St. 14 thereupon (p. 159).]

  3. under <L>71652, <ls n="Unknown">xxx.</ls> to be made as <ls n="Vīrac.">xxx.</ls>, taking the prev. ls item (Vīrac.) as the ref. [cf. pwk entry candrakeSa and the Ind. St. 14 thereupon (p. 159).]

  4. under <L>71680, <ls n="Unknown">xv</ls> to be made as <ls n="Vīrac.">xv.</ls>, taking the prev. ls item (Vīrac.) as the ref. [cf. pwk entry candracUqa and the Ind. St. 14 thereupon (p. 159).]

  5. under <L>71827, <ls n="Unknown">xxx.</ls> to be made as <ls n="Vīrac.">xxx.</ls>, taking the prev. ls item (Vīrac.) as the ref. [cf. pwk entry candravikrama and the Ind. St. 14 thereupon (p. 159).]

  6. under <L>71874, <ls n="Unknown">xxx.</ls> to be made as <ls n="Vīrac.">xxx.</ls>, taking the prev. ls item (Vīrac.) as the ref. [cf. pwk entry candrasena and the Ind. St. 14 thereupon (p. 159).]

  7. under <L>84603, <s1 slp1="saMgIta-darpaRa">Saṃgīta-darpaṇa</s1>, <ls n="Unknown">vi</ls> to be made as <ls>Saṃgīta-darpaṇa, vi</ls>

  8. under <L>95073.91, <ls n="Unknown">52, 5</ls> to be made as <ls n="R.">2, 52, 5</ls>; this is a print correction, and has the prev. ls item (MBh. &c.) as the ref. [cf. pwk entry darh having "— 5) दृढ꣫ , दृळ्ह꣫" and the PWG entry darh having "°स्थूण R. 2, 105, 16. नौ 2, 52, 5."]

  9. under <L>95074.05, <ls n="Unknown">52, 5</ls> to be made as <ls n="R.">2, 52, 5</ls>; this is a print correction, and has the prev. ls item (MBh. &c.) as the ref. [cf. pwk entry darh having "— 5) दृढ꣫ , दृळ्ह꣫" and the PWG entry darh having "°स्थूण R. 2, 105, 16. नौ 2, 52, 5."]

Shouldn't the last two "दृढ (or दृळ्ह॑)" be marked as or-group candidates?

There are plenty more of such "unmarked groups", separated out as diff. HWs in the whole data of mw.txt. https://github.com/sanskrit-lexicon/MWS/issues/132#issuecomment-1163134486

Andhrabharati commented 1 year ago

@funderburkjim

while you're on this MW work, would you mind generating the IAST version of mw.txt again [so that I can do a better (rather, faster) work using it]? [I am having the version which is more than one year old (Apr 2021); lot many updates have taken place on the text during this period.]

funderburkjim commented 1 year ago

2nd batch of corrections.

These take into account the preceding comments by @Andhrabharati. About 170 lines changes in mw.txt. Change transaction details are in change_6.txt.

Note 1: The one remaining n="Unknown" was solved:

 <L>81877<pc>433,1<k1>tattvaboDa
   knowledge or understanding of truth, <ls n="Sarvad.">xii, 46</ls>
   [cf. PWG, and MW tattvaprakASa]

Note 2: Shouldn't the last two "दृढ (or दृळ्ह॑)" be marked as or-group candidates?

  They already are so marked  in
   L>95073.9<pc>490,2<k1>df|a and <L>95074<pc>490,2<k1>dfQa
which have the 'or' markup: <info or="95074,dfQa;95073.9,df|a"/>
 The `or` markup is not repeated for the '2a' subsidiary entries.

The iast version of revised mw.txt is temp_mw_6_iast.zip.

The unknown ls abbreviations file is revised and contains 170 items: abbrevlist_unknown.txt

Instances of the abbreviations with unknown tooltips are in ls_abbrev_instances_unknown1.txt) and ls_abbrev_instances_unknown1_iast.txt) based on temp_mw_6_iast.txt.

funderburkjim commented 1 year ago

We can discuss tooltips for the unknown literary source abbreviations under #135. The best format for me would be via an edit of abbrevlist_unknown.txt, where each Unknown reference text is replaced by the appropriate tooltip text for the abbreviation. The ls_abbrev_instances_unknown1_iast.txt file might be helpful in examining the cases.

Perhaps now we can consider this #136 closeable?

Andhrabharati commented 1 year ago

@funderburkjim

Wonderful updates! And, thanks for the IAST file.

About to finish resolving the unknown reference entities (just another 15 remaining). Will post the results in #135.

Andhrabharati commented 1 year ago

And you can close this issue now.