Closed funderburkjim closed 1 year ago
fertile areas
:100:
This week's changes to mw.txt, primarily to ls markup, are now completed. The work is done in issue136 directory.
The sequence of changes are in files change_1.txt through change_4.txt, with corresponding notes in the readme.txt file of issue136 directory.
The change_all.txt file shows list of all 4221 lines changed in mw.txt.
The ls markup in mw.txt has received considerable attention in the last few weeks. Thanks to @Andhrabharati and @gasyoun for pointing out areas that needed attention.
At the moment, I don't have in mind further lines of improvement to ls markup in mw.txt, and will likely return to ls markup improvements in pw and pwg.
Probably @funderburkjim may consider removing the 3 <slp1>
tags in the <ls>
strings--
change <ls>Kielhorn., <s1 slp1="mahABAzya">Mahābhāṣya</s1>, vol. i, preface, p.9 f.</ls>
as <ls>Kielhorn., Mahābhāṣya, vol. i, preface, p.9 f.</ls>
change <ls>VS. (<s1 slp1="kARva">Kāṇva</s1>) ii, 24</ls>
as <ls>VS. (Kāṇva.) ii, 24</ls>
change <ls>YajurV. <s1 slp1="parIS">Parīś</s1>. xv</ls>
as <ls>YajurV. Parīś. xv</ls>
And also correct the space-between-digits [0-9] [0-9]
errors inside the <ls>
(45 occurrences, which link to a wrong place) and <pc>
(4 occurrences, which lead to a wrong page) blocks.
Some more minor corrections, on a quick look--
change <ls>R , i, 27, 7</ls>
as <ls>R. i, 27, 7</ls>
change <ls>Daś. -</ls>
as <ls>Daś.</ls>
change <ls>L. -</ls>
as <ls>L.</ls>
change <ls>Rājat. -</ls>
as <ls>Rājat.</ls>
change <ls>R. -</ls>
as <ls>R.</ls>
change <ls>L., also</ls>
as <ls>L.</ls>, also
change <ls>ŚBr., as</ls>
as <ls>ŚBr.</ls>, as
change <ls>Beta bengalensis</ls>
as <bot>Beta bengalensis</bot>
change <ls>T., but according to</ls>; <ls>Uṇ. i, 67</ls>
as <ls>T.</ls>, but according to <ls>Uṇ. i, 67</ls>
change <ls>MBh. etc</ls>,
as <ls>MBh.</ls> etc.
change <ls>Kāv. etc.</ls>
as <ls>Kāv.</ls> etc.
And it is not out of place to mention that many punctuation errors, as at cases 9 & 10 above, are seen throughout the text; and a handful cases of hyphen marks at wrong places which indicate the <hom>
numbers of the next entry word are present.
@Andhrabharati Will take a look at these flaws; is there a systematic way to find other instances like 9, 10 ?
a handful cases of hyphen marks at wrong places which indicate the
<hom>
numbers of the next entry word are present.
Though these are not related to <ls>
items, they are more important as concerning the HWs (and metalines) themselves; some (6) are to be found by the regex -[0-9].<info
.
And, there is a <hom>
related issue #131, that @funderburkjim yet needs to put his eye on!
These take into account suggestions since previous commit. The change transactions are change_5.txt. All in all, about 800 lines were changed.
A large number of 'new' ls abbreviations were added to mwath as 'Unknown'. (Also, the Maṇḍ. abbreviation was given tooltip -- see the 'pywork' commit above.)
@Andhrabharati could help by providing tooltips for these Unknown cases. ls_abbrev_instances_unknown.txt file has instances of most of these Unknown cases.
An attempt was made to define programmatically a 'normal' ls instance. Using this rule, there remain about 40 'abnormal' instances identified in file lsabnormal_5.txt.
If there are no more correction suggestions for 'ls' in mw, this issue can be closed and I'll take a look at the hom issue mentioned above.
Seen that the space between digits and other corrections related to <ls>
items are all considered now.
The remaining point in this issue is https://github.com/sanskrit-lexicon/MWS/issues/136#issuecomment-1186091325, which could be done before going to #131, or to be kept in mind while doing the <hom>
corrections.
I would prefer them being corrected here itself, as these are not marked <hom>
explicitly.
So Jim can decide the action accordingly to close this issue.
A large number of 'new' ls abbreviations were added to mwath as 'Unknown'. (Also, the Maṇḍ. abbreviation was given tooltip -- see the 'pywork' commit above.)
@Andhrabharati could help by providing tooltips for these Unknown cases. ls_abbrev_instances_unknown.txt file has instances of most of these Unknown cases.
An attempt was made to define programmatically a 'normal' ls instance. Using this rule, there remain about 40 'abnormal' instances identified in file lsabnormal_5.txt.
If @funderburkjim likes to do it here itself, I can surely help resolving these, but I would suggest doing this piece of work while some action is taken on the issue #135 (which is related to the same and also has some more relevant points).
Hope to listen back Jim's opinion.
Some small extra corrections related to spaces:
,
, 9 cases of ;
and 10 cases of )
to have the preceding space deleted.>
to be deleted.In the tooltip.txt,
99.98 ib. int the same place [Cologne Addition] Title
to be corrected as
99.98 ib. in the same place [Cologne Addition] Title
There are two instances of **
under <L>229710
and <L>237718
in the mw.txt, that may be deleted.
Incidentally, the <ls>ĀpastPray.</ls>
, which is at <L>237718
, is without a tooltip, but is not listed in either abbrevlist_unknown.txt or in ls_abbrev_instances_unknown.txt, though present in both mwauth.txt and tooltip.txt.
When looked for the equivalence among these 4 files, noticed that both mwauth.txt & tooltip.txt have 168 no.s of to be expanded "Unknown reference" entries, whereas both abbrevlist_unknown.txt & ls_abbrev_instances_unknown.txt listed just 147 no.s.
What is the reason for the difference of 21 between the two sets of files?
The 21 additional entries in tooltip.txt are--
ĀpGṛh. ĀpastPray. Śak. (Chézy) Śak. (Pi.) AV. Paipp. AV., SBE. Kaegi, Der Ṛgveda Ludwig, RV. Muir's Sanskrit Texts Muir, S. T. Pañc. B. Pat. (K.) R. (B) R. (B.) R. G. R. [B.] RV. AnuvAnukr. SV.Anukr. Uttamac.² YajurV. Parīś. Zachariae, Beiträge
In these, Śak. (Pi.) occurs 18 times!!
<ls n="Unknown">
:under <L>67611
, <ls n="Unknown">lii, 19</ls>
to be made as <ls n="AV.Pariś.">lii, 19</ls>
, taking the prev. ls item (AV.Pariś.) as the ref.
[cf. PWG entry of govITI.]
under <L>71651
, <ls n="Unknown">xxx.</ls>
to be made as <ls n="Vīrac.">xxx.</ls>
, taking the prev. ls item (Vīrac.) as the ref.
[cf. pwk entry candraketu and the Ind. St. 14 thereupon (p. 159).]
under <L>71652
, <ls n="Unknown">xxx.</ls>
to be made as <ls n="Vīrac.">xxx.</ls>
, taking the prev. ls item (Vīrac.) as the ref.
[cf. pwk entry candrakeSa and the Ind. St. 14 thereupon (p. 159).]
under <L>71680
, <ls n="Unknown">xv</ls>
to be made as <ls n="Vīrac.">xv.</ls>
, taking the prev. ls item (Vīrac.) as the ref.
[cf. pwk entry candracUqa and the Ind. St. 14 thereupon (p. 159).]
under <L>71827
, <ls n="Unknown">xxx.</ls>
to be made as <ls n="Vīrac.">xxx.</ls>
, taking the prev. ls item (Vīrac.) as the ref.
[cf. pwk entry candravikrama and the Ind. St. 14 thereupon (p. 159).]
under <L>71874
, <ls n="Unknown">xxx.</ls>
to be made as <ls n="Vīrac.">xxx.</ls>
, taking the prev. ls item (Vīrac.) as the ref.
[cf. pwk entry candrasena and the Ind. St. 14 thereupon (p. 159).]
under <L>84603
, <s1 slp1="saMgIta-darpaRa">Saṃgīta-darpaṇa</s1>, <ls n="Unknown">vi</ls>
to be made as <ls>Saṃgīta-darpaṇa, vi</ls>
under <L>95073.91
, <ls n="Unknown">52, 5</ls>
to be made as <ls n="R.">2, 52, 5</ls>
; this is a print correction, and has the prev. ls item (MBh. &c.) as the ref.
[cf. pwk entry darh having "— 5) दृढ꣫ , दृळ्ह꣫" and the PWG entry darh having "°स्थूण R. 2, 105, 16. नौ 2, 52, 5."]
under <L>95074.05
, <ls n="Unknown">52, 5</ls>
to be made as <ls n="R.">2, 52, 5</ls>
; this is a print correction, and has the prev. ls item (MBh. &c.) as the ref.
[cf. pwk entry darh having "— 5) दृढ꣫ , दृळ्ह꣫" and the PWG entry darh having "°स्थूण R. 2, 105, 16. नौ 2, 52, 5."]
Shouldn't the last two "दृढ (or दृळ्ह॑)" be marked as or-group candidates?
There are plenty more of such "unmarked groups", separated out as diff. HWs in the whole data of mw.txt. https://github.com/sanskrit-lexicon/MWS/issues/132#issuecomment-1163134486
@funderburkjim
while you're on this MW work, would you mind generating the IAST version of mw.txt again [so that I can do a better (rather, faster) work using it]? [I am having the version which is more than one year old (Apr 2021); lot many updates have taken place on the text during this period.]
These take into account the preceding comments by @Andhrabharati. About 170 lines changes in mw.txt. Change transaction details are in change_6.txt.
Note 1: The one remaining n="Unknown"
was solved:
<L>81877<pc>433,1<k1>tattvaboDa
knowledge or understanding of truth, <ls n="Sarvad.">xii, 46</ls>
[cf. PWG, and MW tattvaprakASa]
Note 2: Shouldn't the last two "दृढ (or दृळ्ह॑)" be marked as or-group candidates?
They already are so marked in
L>95073.9<pc>490,2<k1>df|a and <L>95074<pc>490,2<k1>dfQa
which have the 'or' markup: <info or="95074,dfQa;95073.9,df|a"/>
The `or` markup is not repeated for the '2a' subsidiary entries.
The iast version of revised mw.txt is temp_mw_6_iast.zip.
The unknown ls abbreviations file is revised and contains 170 items: abbrevlist_unknown.txt
Instances of the abbreviations with unknown tooltips are in ls_abbrev_instances_unknown1.txt) and ls_abbrev_instances_unknown1_iast.txt) based on temp_mw_6_iast.txt.
We can discuss tooltips for the unknown literary source abbreviations under #135.
The best format for me would be via an edit of abbrevlist_unknown.txt, where
each Unknown reference
text is replaced by the appropriate tooltip text for the abbreviation.
The ls_abbrev_instances_unknown1_iast.txt file might be helpful in examining the cases.
Perhaps now we can consider this #136 closeable?
@funderburkjim
Wonderful updates! And, thanks for the IAST file.
About to finish resolving the unknown reference entities (just another 15 remaining). Will post the results in #135.
And you can close this issue now.
This issue devoted to continuation (from #134) of cleanup of ls markup in mw.txt. Two fertile areas: