sanskrit-lexicon / MWS

Monier Monier-Williams, Sir; A Sanskrit-English dictionary. Oxford, 1899
Other
7 stars 5 forks source link

Grassmanizing and/or groups #176

Closed funderburkjim closed 1 month ago

funderburkjim commented 1 month ago

The aim of this issue is to carry out the post-standardization plan mentioned in #175. This will make the handling of alternate headwords in MW very similar to their handling in the cdsl version of the GRA dictionary.

funderburkjim commented 1 month ago

lost (missing) headwords

In #175, it was mentioned that the and/or group standardization has the undesired side-effect of 'losing' some searchable headwords; such as 'karvarI'.

dumpgroups_div_missinghw.txt provides a list of these lost headwords and the groups they are in.
We'll need to develop coding in mw.txt that puts these headwords back in.

funderburkjim commented 1 month ago

Sample rewrite of groups

@Andhrabharati The readme at 'Sample rewrite of groups' gives an idea of how to proceed with the groups.

Comment?

Andhrabharati commented 1 month ago

Looking alright; but why to do this in the mw.txt?

  • systematically replace in mw.txt the two (or more in some cases) entries with only one entry (L=N1), whose metaline field is replaced by a comma-separated list of the k2 from L=N1 and L=N2. The '' field of metaline will also be part of the k2 subfields
  • remove the L=N2 entry.

Such duplication was thought of being done in the xml file, isn't it?

funderburkjim commented 1 month ago

Some advantages of having the metalines of the alternates in mw.txt: 1) Control over L
2) equivalence (in terms of L) between mw.xml and mw.txt 3) Ability to handle missing headwords by generating another alternate in mw.txt: e.g. for karvara:

<L>45528<pc>259,3<k1>karvara<k2>karvara<h>2<e>1
 <hom>2.</hom> <s>karvara</s> or <s>karbara</s>, ¦  ....
<LEND>
<L>45528.1<pc>259,3<k1>karbara<k2>karbara<e>1
{{Lbody=45528}}
<LEND>
<L>45528.2<pc>259,3<k1>karvarI<k2>karvarI<e>1
{{Lbody=45528}}
<LEND>
Andhrabharati commented 1 month ago

So, you are having some 'second' thoughts (on following the GRA process)!!

Pl. proceed as you feel convenient & better.

funderburkjim commented 1 month ago

such duplication was thought of being done in the xml file, isn't it?

Yes, it 'was thought of'. In Grassman, there are extra entries in gra.xml when compared to gra.txt.

If the Lbody idea for mw works, then we could do the same for GRA, and also for other dictionaries (such as VCP). There would then be no need for the rather complicated 'xxxhwextra.txt' files.

funderburkjim commented 1 month ago

Pl. proceed as you feel convenient & better

I'll give it a try!

Andhrabharati commented 1 month ago

Just like to remind you that this deviates from an important goal of the envisaged "standardizing", that

  • Text more closely resembles MW print.
funderburkjim commented 1 month ago

test version

I've done the conversion to 'lbody' form. Also added back in the missing headwords. Standard displays using this test version uploaded at url: https://sanskrit-lexicon.uni-koeln.de/work/mwtest_lbody/web/

Request others to review this test version. If it seems ok, I'll make this the standard cdsl version.

While others are reviewing, I'll try to recode GRA using the Lbody idea.

funderburkjim commented 1 month ago

Feedback, please! Otherwise, in a day or two I'll assume there is no objection and install the test version as the next cdsl production version of mw.

drdhaval2785 commented 1 month ago

I feel it is good, because the data now resides in one single source mw.txt instead of two places mw.txt and mw_hwextra.txt. It reduces technical cost of maintaining a separate file for extra words and allows uniform coding.

Andhrabharati commented 1 month ago

such duplication was thought of being done in the xml file, isn't it?

Yes, it 'was thought of'. In Grassman, there are extra entries in gra.xml when compared to gra.txt.

Feedback, please! Otherwise, in a day or two I'll assume there is no objection and install the test version as the next cdsl production version of mw.

While I do appreciate that getting a 'closer' "estimate" of the HW candidates through xml file is important, I think keeping the text file data close to the print matter is a far-more important action, to correlate them.

I would like to bring to the notice of the forum that the MW data still contains far many (running over few thousands!!) HW candidates that are yet to be brought under the "estimate" umbrella.

Adding some tagging to the text file data is quite understandable, but duplicating in any manner is to be done at a different stage/place.

And coming to the 'cost of technical maintenance' that @drdhaval2785 has mentioned above, I think it is of no concern at all.

Of course, it appears that practically I am the ONLY person "looking" at the CDSL text files, but I think my approach makes me get to the nuances of content (& intent) of various dictionaries more closer (than anyone so far); and I strongly think such "duplication" and "unnecessary" filling of various kinds of entries in the text file itself would make the navigating through the "actual" dictionary content a troubling effort.

As such, I strongly opine that ALL "programmatically" possible extensions should be done outside the text file.

Andhrabharati commented 1 month ago

It reduces technical cost of maintaining a separate file for extra words and allows uniform coding.

And it is to note here that all other CDSL dictionaries have this althw approach, other than MW; so shouldn't this process be applied to MW as well, in the name of 'uniform coding'?

But then, I think, there is no need AT ALL to have (and maintain) a separate XXX_hwextra.txt file; this can be arrived at programmatically from the comma-separated k2-field of the metalines and the "extra HWs" could be populated in the xml file directly (on-the-fly).

Andhrabharati commented 1 month ago

At the end of it, if I am NOT to "go through" the CDSL text files, LET IT BE SO!! [It not only saves my time, but also from getting cursed by others for pushing them into more "troubles".]

gasyoun commented 1 month ago

Feedback, please!

@funderburkjim where at https://sanskrit-lexicon.uni-koeln.de/work/mwtest_lbody/web/webtc/indexcaller.php to look for it? Or in list mode?

I would like to bring to the notice of the forum that the MW data still contains far many (running over few thousands!!) HW candidates that are yet to be brought under the "estimate" umbrella.

@Andhrabharati like... can you give an example or two, please?

I think my approach makes me get to the nuances of content (& intent) of various dictionaries more closer (than anyone so far)

Indeed.

But then, I think, there is no need AT ALL to have (and maintain) a separate XXX_hwextra.txt file; this can be arrived at programmatically from the comma-separated k2-field of the metalines and the "extra HWs" could be populated in the xml file directly (on-the-fly).

Only on-the-fly? How to verify it in that case?

At the end of it, if I am NOT to "go through" the CDSL text files, LET IT BE SO!

Let be what?

funderburkjim commented 1 month ago

@gasyoun I see that I gave link only to test basic display. For test versions of the 'basic' displays' (including list display), you can start at this url: https://sanskrit-lexicon.uni-koeln.de/work/mwtest_lbody/web/

funderburkjim commented 1 month ago

Thanks for the feedback.
I'm working on a conversion process between the cdsl form ({{Lbody=X}} described above and the 'k2-comma-separated' form that AB prefers.

This conversion is fairly straightforward, with a couple of exceptions:

accent alternates.

For instance <L>1405<pc>7,1<k1>aGnya<k2>a/-Gnya,a-Gnya/<e>2 This should not generate an extra entry, despite the comma in k2 field. This is easily solved - by changing the comma to some other unused character -- I chose semicolon <L>1405<pc>7,1<k1>aGnya<k2>a/-Gnya;a-Gnya/<e>2

In AB form, the comma in k2 will continue to be separate 'real' alternates (and, extras like 'karvarI').

artificial homonyms.

As discussed recently, a UI feature of the list display is the presence of centering arrows. This is described here. This UI feature is generated currently by the homonym field of the metaline.

<L>1493<pc>7,2<k1>aNgana<k2>aNgana<h>a<e>2
<s>aNgana</s> <hom>a</hom> ¦ <lex>n.</lex> walking, <ls>L.</ls><info lex="n"/>

My intention is to change this to

<L>1493<pc>7,2<k1>aNgana<k2>aNgana<e>2
<s>aNgana</s> ¦ <lex>n.</lex> walking, <ls>L.</ls><info lex="n"/><info halt="a"/>

Since recentering also occurs for 'real' homonyms, entries with real homonyms will also get the extra markup, e.g.

OLD:
<L>1<pc>1,1<k1>a<k2>a<h>1<e>1
<hom>1.</hom> <s>a</s> ¦ the first letter of the alphabet 
NEW:
<L>1<pc>1,1<k1>a<k2>a<h>1<e>1
<hom>1.</hom> <s>a</s> ¦ the first letter of the alphabet <info halt="1"/>

With these two preliminary solved, then it should be possible to convert between the cdsl form and an AB form.

I hope all this works out and proves to be satisfactory for AB, so he can feel comfortable with his supplement review of MW.

Will post further progress here.

Andhrabharati commented 1 month ago

For instance <L>1405<pc>7,1<k1>aGnya<k2>a/-Gnya,a-Gnya/<e>2 This should not generate an extra entry, despite the comma in k2 field.

I differ on this, as this is definitely a candidate of grouping!!

Also, the duality in the feminine here is "missed' in the digitisation-- image

And then, if the grouping wrt accent differences is to be ignored as suggested here, why not ignore the same wrt hyphenation (as at L-5676 & L-5677; L-5678 & L-5683 etc.) as well that have "identical" k1-entry (presuming that this is what prompted Jim in the above!)? Just because they are "separately" given in the print? I am sure, there are few given as grouped entries also by MW himself (though I could not locate them quickly).

Andhrabharati commented 1 month ago

I am sure, there are few given as grouped entries also by MW himself (though I could not locate them quickly).

Found 3 such entries, L-21529, L-29831 & L-206364.

funderburkjim commented 1 month ago

Progress

  1. test version 11 displays: https://sanskrit-lexicon.uni-koeln.de/work/mwtest_lbody_11/web/
  2. cdsl version 11 : temp_mw_11.zip the cdsl version
  3. AB equivalent version 11: temp_mw_11_ab2.zip

@Andhrabharati There is code (python) which serves as a bijection between the cdsl and AB versions; thus the two forms are equivalent.

My hope is that you will be able to edit the ab2 version (e.g. for the supplement review). When you send the revised ab2 version back, I will programatically convert it back to revised cdsl version.

I have not yet promoted the current temp_mw_11 (along with changes to display code) to production.

funderburkjim commented 1 month ago

Notes for revising AB2 version

semicolon for k2 variants that yield the same k1

Please use the ';' for accent variants in k2. (or hyphenation variants). i.e., if two k2 variants yield the same k1, they should be separated by ; There is even one entry (L=5412) where there are both kinds of variants. Both kinds ( comma or semicolon) represent MW group variants. For the purpose of displays, two semicolon variants represent only one k1 variant.

{{L1,L2...}}

You will note that entries with more than one k2 comma-variant have an extra field . {{L1,L2}} --- these are the cdsl L-nums for the variants. If you move such a grouped entry, then ideally you would change not only the L of the grouped entry, but also the L1,L2. This is because I prefer to use fixed L rather than dynamic L. However, it is fine with me if you ignore this adjustment. Since L1 should always be same as L, I can detect that {{L1,L2}} needs revision and do the necessary when I convert ab2 form back to cdsl form.

Example from AB2
<L>24<pc>1,1<k1>aMSaBAgin<k2>aMSa—BAgin,aMSa—BAj<e>3
<s>aMSa—BAgin</s> or <s>aMSa—BAj</s>, ¦ <lex>mfn.</lex> one who 
has a share, an heir, co-heir.<info lex="m:f:n"/>{{24,25}}
<LEND>
The corresponding CDSL form:
<L>24<pc>1,1<k1>aMSaBAgin<k2>aMSa—BAgin<e>3
<s>aMSa—BAgin</s> or <s>aMSa—BAj</s>, ¦ <lex>mfn.</lex> one who 
has a share, an heir, co-heir.<info lex="m:f:n"/>
<LEND>
<L>25<pc>1,1<k1>aMSaBAj<k2>aMSa—BAj<e>3
{{Lbody=24}}
<LEND>

artificial homs

These appear as <info hui="a"/> (for instance). In your work, just ignore (but leave) these. This field only plays a role in the List display. The UI feature for normal homonyms is handled by make_xml.py, which generates <info hui="1"/> (for instance) using the homonym information of the k2 field.

funderburkjim commented 1 month ago

accent display

I noticed an error in the display of accents in MW With 'show accent' option, the accents are displayed properly with IAST output, but wrongly with Devanagari output. No idea when this bug began or what in the display code is causing it Will open a new issue .

funderburkjim commented 1 month ago

AP -> PA

temp_mw_12_ab2.zip

@Andhrabharati This version 12 contains the correction you mentioned in #174, and is in the AB format. There were 203 changes, slightly less than the number (224 or 222) that you mentioned.

I didn't think it useful to further upload the CDSL version 12 or the corresponding display package.

Andhrabharati commented 1 month ago

here were 203 changes, slightly less than the number (224 or 222) that you mentioned.

Though its not of any importance, just like to inform here that the cdsl version as of 12th Aug 2024 had 224 instances; these got reduced to 203 after that date!

And you have wrongly changed L-130326 and L-132942 instances also, which were specifically mentioned to be correct as is (in my post). These two places should be reverted back. So, this would count to 201 changes at the end!

@funderburkjim

As it appears that you want me to work on a file version specifically prepared for my use (by having a 'bijection' code), I request you to make another change for my sake, i.e to remove all slp1 tags from the body matter (let them be in the info tags at the end of the lines, which I would ignore).

This is to make the

  1. <s1 slp1="XXX">YYY</s1> as <s1>YYY</s1> and
  2. <ab n="YYY" slp1="XXX">ZZZ°</ab> as <s1 n="YYY">ZZZ°</s1> [these being Sanskrit words, should go with s1-tag]

in the ab2 version, whose slp1 forms can be "regenerated" in the cdsl version (with the bijection code).

If you do not need the slp1 matter (which I highly doubt!), its just a matter of few seconds for me to remove these slp1 strings (as I had already done in my revision file).

Andhrabharati commented 1 month ago

Noted that L-91601 HW has got no s-tagging; to change dADikA ¦ -> <s>dADikA</s> ¦

Andhrabharati commented 1 month ago

No idea when this bug began or what in the display code is causing it

Could it be from the days of #169 work by you, @funderburkjim ?

funderburkjim commented 1 month ago

last version for this issue?

temp_mw_14_ab2.zip has changes that you requested. See the readme starting at 08-21-2024.

The slp1 attribute for s1 tag was introduced long ago; at that time the display showed the text according to user 'output' preference - e.g. in Devanagari. But for several years, the displays have made no use of this slp1 attribute. Thus, it seems reasonable just to remove it from the s1 tag.

Similarly, the slp1 tag within the ab element was removed, and such an ab tag was changed to s1 tag; the display code and dtd were correspondingly revised.

I kept copies of all instances of the tags thus modified. Possibly of some future use, though unlikely.

We seem to be getting close to a satisfactory ab2 version for you. Let me know if this version 14 is satisfactory -- then I can fully install the cdsl-14 version and accompanying display changes.

Andhrabharati commented 1 month ago

We seem to be getting close to a satisfactory ab2 version for you.

Yes, this version 14 has all that I mentioned above.

I kept copies of all instances of the tags thus modified. Possibly of some future use, though unlikely.

I envisage getting more such strings in the process of my working; as such, retaining the present lot of tags is not sufficient; that's the reason why I had asked you to make some code to regenerate the slp1 strings from the tags.

But as per your latest statement, that is not going to be necessary at all; I am glad that the text now has no slp1 strings once converted to iast.

By the way, I see that while almost all the <info tags are at the ending body-line (i.e. preceding the <LEND> line), a few (~300) are either at the first body-line (i.e. after the meta-line) or sometimes at the middle lines. Noticed a few that occur before <div lines require various additional tags, like <info lex="inh"/> at L-450, <info lex="m"/> & <info lex="n"/> at L-693.

As I am thinking of moving all the <info tags to the trailing end of the <LEND> line [and at the end of my work, to push them back to the preceding line], these 300+ cases pose an issue. Any thoughts on changing (like adding new tags at every div-line or repositioning to the ending body-line) these?

Andhrabharati commented 1 month ago

(L-85374)

old: <ab n="Terminalia">T°</ab> and <ab n="Puṣa" slp1="puza">P°</ab> new: <ab n="Terminalia">T°</ab> and <s1>Punar-vasu</s1>

It is not "Terminalia", but "Tiṣya" here and at the prev. entry and both places to be s1-tagged, not ab-tagged! And "Puṣa" at the next entry should to be changed as "Punar-vasu". (My revision file had many such corrections)

Andhrabharati commented 1 month ago

By the way, I see that while almost all the <info tags are at the ending body-line (i.e. preceding the <LEND> line), a few (~300) are either at the first body-line (i.e. after the meta-line) or sometimes at the middle lines.

These are the actual counts-- image

The same, when looking from a different perspective-- image

Andhrabharati commented 1 month ago

Probably,

  1. splitting these entries having multiple lex-categories into separate entries as done in case of all others elsewhere (with e-field having [ABCE] in meta-lines), and
  2. merging the div-lines having a single lex-category together as a single-liner (separated by the ';') as done in case of a couple of entries recently

is a proper way to go [with an "unwritten rule" that a single entry body matter should contain a single line of <info tag(s)].

funderburkjim commented 1 month ago

version 16

temp_mw_16_ab2.zip

infotag_attr.txt has counts of the various (13 currently) attributes mentioned in info tags.

Main change: put all of the info tags for an entry at the END of the LAST body line. This changed 325 entries. The moved tags are in file infotag_notend.txt.


Re L=85374 amd 85375 tooltip correction: I did NOT change this. Will leave that, and similar, to you.


Re 'Probably, 1. 2.' Did not do anything here.
I could make an ab3 version so that for each entry the body lines are concatenated with tab separator (then I could convert tabs to newlines in the reversion from ab3 to cdsl version.

I wasn't sure if such an ab3 version is all you had in mind. Also, I definitely don't want the cdsl version to have huge one-line blobs of data for entries such as dhatus: Huge blobs would make the manual editing (for corrections) hard, at least with Emacs. So multiline entries are preference for cdsl version.


Incidentally, the only info tag attributes that play a role in the displays are

The lex, lexcat, and verb tags were used by the csl-inflect repo. Whenever someone improves that repo, then the information of these tags could be useful.

The phwchild and phwparent are also informational -- 'extra' entries were generated from some parenthetical comments. Whenever the supplement review is completed, we might reconsider these.

funderburkjim commented 1 month ago

the cdsl version as of 12th Aug 2024 had 224 instances; these got reduced to 203 after that date!

The difference is from the 'grouping' of alternate verb spellings.

funderburkjim commented 1 month ago

Could it be from the days of https://github.com/sanskrit-lexicon/MWS/issues/169 work by you

Yes, that is relevant. Thanks for mentioning!

Andhrabharati commented 1 month ago

Re 'Probably, 1. 2.' Did not do anything here. I could make an ab3 version so that for each entry the body lines are concatenated with tab separator (then I could convert tabs to newlines in the reversion from ab3 to cdsl version.

I wasn't sure if such an ab3 version is all you had in mind. Also, I definitely don't want the cdsl version to have huge one-line blobs of data for entries such as dhatus: Huge blobs would make the manual editing (for corrections) hard, at least with Emacs. So multiline entries are preference for cdsl version.

No, I didn't mean this at all; I was talking about those 300+ entries that do not have info tags at the ending line.

I see that the entries having <div tagged multi-lex lines now have just one lex-type at the <info tag (is it the first one always?) which is technically not correct; but, this is not a big thing to debate upon (as I do not "wish" to see the legitimacy of those <info tags).

last version for this issue?

We seem to be getting close to a satisfactory ab2 version for you. Let me know if this version 14 is satisfactory -- then I can fully install the cdsl-14 version and accompanying display changes.

Now, we can take that this ver. 16 file as THE last version for the issue; and, you may install/promote the resp. files as "production files". [My further working might take some time; but that would be going into another issue.]

Andhrabharati commented 1 month ago

@funderburkjim

I have made few changes in the file data, in addition to moving the "<info(.*)" to the <LEND> line, namely

  1. moved "{{(.*)}}" also to the <LEND> line
  2. introduced a space after comma in k2-list as well as in the group-list (and also after the semi-colon in k2-list)

Example:

<L>21<pc>1,1<k1>aMSakalpanA<k2>aMSa—kalpanA,aMSa—prakalpanA,aMSa—pradAna<e>3`
<s>aMSa—kalpanA</s>, <lex>f.</lex> or <s>aMSa—prakalpanA</s>, <lex>f.</lex> or <s>aMSa—pradAna</s>, <lex>n.</lex> ¦ allotment of a portion.{{21,22,23}}
<LEND>

to

<L>21<pc>1,1<k1>aṃśakalpanā<k2>aṃśa—kalpanā, aṃśa—prakalpanā, aṃśa—pradāna<e>3`
<s>aṃśa—kalpanā</s>, <lex>f.</lex> or <s>aṃśa—prakalpanā</s>, <lex>f.</lex> or <s>aṃśa—pradāna</s>, <lex>n.</lex> ¦ allotment of a portion.
<LEND>{{21, 22, 23}}
  1. moved the group-list to the front, from the tail-end of the info-list

Example:

<L>24<pc>1,1<k1>aMSaBAgin<k2>aMSa—BAgin,aMSa—BAj<e>3
<s>aMSa—BAgin</s> or <s>aMSa—BAj</s>, ¦ <lex>mfn.</lex> one who has a share, an heir, co-heir.<info lex="m:f:n"/>{{24,25}}
<LEND>

to

<L>24<pc>1,1<k1>aṃśabhāgin<k2>aṃśa—bhāgin, aṃśa—bhāj<e>3
<s>aṃśa—bhāgin</s> or <s>aṃśa—bhāj</s>, ¦ <lex>mfn.</lex> one who has a share, an heir, co-heir.
<LEND>{{24, 25}}<info lex="m:f:n"/>

Hope this should be alright with you!!

Andhrabharati commented 1 month ago

You may recall that I had employed the space in the k2-lists in both pwk and GRA [this is my std. style]; and you had retained the same in their CDSL versions.

funderburkjim commented 1 month ago

I don't see any problem with your adjustments as mentioned in the previous two comments. However, I will need to change my 'inverse function' code. Please upload your revised version 16.

Andhrabharati commented 1 month ago

There isn't much difference in my version as of now; I had just changed the above with simple regex stuff.

And as I mentioned earlier, I would be pushing the {{ and <info lists to the line before the LEND line, once my work is done. You just need to account for the spaces that I had introduced in the k2 & group-lists; otherwise, I can discard those as well, when I pass on the revised file to you.

I am presently aligning your file with my revision file (wherein I had done quite a bit of work so far); and in the next phase, I won't be limiting to the suppl. matter integration, but also look at other parts of the main text.

funderburkjim commented 1 month ago

I just wanted your stated revisons for checking my inverse code. Do you have revised version (LEND line etc.) (before actual substantive changes?)

Andhrabharati commented 1 month ago

No; I don't maintain the partial work versions!!

funderburkjim commented 1 month ago

sigh! I'll upload a temp_mw_ab3.txt file when it's ready - so you can confirm it meets your specs.

Andhrabharati commented 1 month ago

I have just recreated the file at my end for you-- temp_mw_16_ab3.zip

funderburkjim commented 1 month ago

Appreciated!

funderburkjim commented 1 month ago

minor LEND correction

In your temp_mw_16_ab3.txt, please change 3 LEND lines; These originated in my temp_mw_16.txt.

old: <LEND>d<info lex="m:f:n"/>
new: <LEND><info lex="m:f:n"/>
---
old: <LEND>1<info lex="m:f:n"/>
new: <LEND><info lex="m:f:n"/>
---
old: <LEND> <info lex="inh"/>
new: <LEND><info lex="inh"/>

I've got the inverse functions working. These will come into play when I receive your revisions.

Meanwhile, I'll go ahead and install cdsl version along with display code changes.

funderburkjim commented 1 month ago

Installation of cdsl versions (version 17) complete. Closing this issue.

Andhrabharati commented 3 weeks ago

Sorry, @funderburkjim for posting my 'aligned' file here for your perusal, at this closed issue. [I thought this is the proper place do so.]

CDSL (temp_mw_16_ab3_IAST) [matched].zip

The majority of differences (over 500) are due to the splitting the text at semi-colon (whether as a lexical-change, or a meaning sense change), which has been the practice in the MW.

In the recent corrections, Jim had adopted a different<div n="P"/> form, which I thought should be changed to match with the rest. I have marked these splits with '[]', leading to change in no. of lines. So this should be the first point of adjustment; after which the 'real' changes could be identified in my alignment process.

If you find this suitable, I shall be continuing the annexure work with this, else I have framed a plan do it in some other format.