sanskrit-lexicon / MWS

Monier Monier-Williams, Sir; A Sanskrit-English dictionary. Oxford, 1899
Other
7 stars 5 forks source link

About 6800 missing ¦ (Broken Vertical bar) in the L-entries #132

Closed Andhrabharati closed 1 year ago

Andhrabharati commented 2 years ago

There are 287628 occurrences of <L> (and <LEND>), but only 280827 occurrences of ¦ in the latest mw.txt (taken from csl-orig/v02 folder).

The last mw_iast file I got from Jim in April 2021 has 287634 occurrences of <L> (and <LEND>) and 280833 of ¦. I recall some <L> entries getting deleted during the exercise done those days, so the difference of 6 therein is accounted for.

But the difference of 6801 between <L> and ¦ seems continuing still.

However, the mw_meta2 file (dated Jun 18, 2018) gives a count of 287433, as under-

¦ (\u00a6) 287433 := BROKEN BAR (Demarcates the headword part of an entry).

Request @funderburkjim to trace whence the 6801 difference started, and give-out the reason.

Andhrabharati commented 2 years ago

A quick look inside those entries revealed that a majority (6797) of them are 'grouped" entries-- [Total 'or-group' entries: 5258 & 'and-group' entries: 1539]

Tagged: 6627 [and: 1410 & or: 5217] Untagged, with body portion starting with "See": 170 [and (or &): 129 & or (or ,): 41]

And, 3 entries with body portion starting with "See" [<L>102543, <L>108451, <L>135782] and one "sup" entry [<L>56162] complete the count to 6801.

Also seen that there are 1162 'or-group' entries & 163 'and-group' entries in the rest of entries having the ¦. So, it seems the "grouping session" had a break somehow!

Probably @funderburkjim might wish to correct these entries, by adding the ¦ appropriately and also tagging the untagged 170 entries.

Andhrabharati commented 2 years ago

Here is the smaller chunk lot to complete the "and/or" tagging-- L-entries to fill and-or tags.txt

Andhrabharati commented 2 years ago

Here is how I would've split the entries (header-block and body portion), for the tagged 'and-group' mentioned above-- tagged 'and-group' entries split with vertical bar.txt

Andhrabharati commented 2 years ago

Finished the 'or-group' splitting with vertical bar, and in the process some entries were changed to 'and-tagging', looking at the content; and 2 entries were to be untagged altogether!

Now the tagged group counts are-

Tagged: 6625 [and: 1451 & or: 5174],

as against the earlier counts of

Tagged: 6627 [and: 1410 & or: 5217]

With this experience, seems another look at the already "grouped" (or: 1162 & and: 163) entries is needed. ------------- Also seen that there are some entries that need to be identified and marked as group entries; I will be checking for them next.

[Hope @funderburkjim would be willing to carry this output from my working to the CDSL file.]

Shall be posting my results, upon his response to this issue.

funderburkjim commented 2 years ago

@Andhrabharati I'm finishing up work to add link markup to Ramayana references in PWG. This should be done in a couple of days, and then I'll take a look at your work here on MW. From a first glance at the two files you provide above, I see no problem adding the broken vertical bar as you have placed it.

Will await your 'further look' revisions before proceeding to alter cdsl version of mw.txt.

gasyoun commented 2 years ago

add link markup to Ramayana references in PWG

Hurray, long live Ramayana, Russian Sanskrit dictionaries and Jim!

Working on https://samskrtam.ru/parallel-corpus/ramayana.html

Andhrabharati commented 2 years ago

With this experience, seems another look at the already "grouped" (or: 1162 & and: 163) entries is needed.

The re-look at the already "grouped" entries resulted in the counts as

or: 1079, orsl: 93 & and: 167

And few of these need some sort of correction in tagging etc.; probably this should be discussed in another (future) issue "grouped entries".

Andhrabharati commented 2 years ago

@funderburkjim

Pl. see the screenshot for PWG ls-tooltip 'error(?)'--

image

Andhrabharati commented 2 years ago

Also seen that there are some entries that need to be identified and marked as group entries; I will be checking for them next.

On a quick checking, 271 'or' & 22 'and' (probable) group-candidates are found.

There sure would be some more, to get by leisure checking!

funderburkjim commented 1 year ago

cleanup work finished

The work is further described in the issue132 directory. About 8800 lines of mw.txt were changed.

hom markup sidetrack

I got side-tracked to add homonym markup. E.g., 1. <s>X</s> was changed to <hom>1.</hom> <s>X</s> everywhere applicable.

not done

I did not address some entries that need to be identified and marked as group entries as mentioned by @Andhrabharati.

Will see what further corrections @Andhrabharati suggests before closing this issue.

Andhrabharati commented 1 year ago

The primary point (the missing broken-bar) in this issue having been addressed, I would suggest this to be closed now.

And few of these need some sort of correction in tagging etc.; probably this should be discussed in another (future) issue "grouped entries".

As I had mentioned above, the rest (about the grouped-entries) could be dealt in another issue.

@funderburkjim may look at the #137 issue next, before generating the IAST file again (for enabling me to resume further study).

funderburkjim commented 1 year ago

Yes, I'm just starting the 137 changes, and will regenerate the iast version of mw at the end of that.

Andhrabharati commented 1 year ago

@funderburkjim

just did a casual checking for <L> and ¦ counts, and got a difference of 3 (287616 vs. 287619).

then found 3 orphan ¦ instances [just dangling isolated] at lines (255594), (274957) and (280382).

Andhrabharati commented 1 year ago

There are 2 ¦ instances without a preceding space, and another 2 ¦ instances without a following space.

gasyoun commented 1 year ago

then found 3 orphan ¦ instances [just dangling isolated] at lines (255594), (274957) and (280382).

Who are you by education?

Andhrabharati commented 1 year ago

I am an (Electronics) Engineer.

Andhrabharati commented 1 year ago

@funderburkjim

All <info or/and="X"/> markup occurs on the headline (first line of entry after metaline) within the mw.txt digitization.

The <info ...> tags have been at the end of the line preceding a <LEND> line so far.

But now noticed that there are 217 instances where these tags are in the next line after the meta-line; these need to be shifted to the line preceding the resp. <LEND> lines (to have a consistency of style).

[I've used the regex <info (.*?)\n<[^L] to find these instances.]

funderburkjim commented 1 year ago

orphan broken bar

These will removed in #137.

funderburkjim commented 1 year ago

placement of <info

There is no definite rule where the various <info .../> markup should appear. These empty xml info elements contain various kinds of 'meta' information. The entry-relative location of these is as we view them to be 'convenient'.

I think it is convenient to have the <info or/and="X"/> tag on the headline, since then it is readily compared to the placement of the broken bar.

I'll look at the 217 instances; and will probably move the info elements to the lastline (line before <LEND>) except for the <info or/and="X"/> elements. Will do this in the #137 work.

Andhrabharati commented 1 year ago

My main intention is that all <info ...> tags could be in the same line (probably barring the sup and rev entries of the Annexure data); there are ~50 cases of <info [oa] tag on headline and the <info [wv] on another line, while all other (~150) cases of <info [oa] and the <info [wv] are on the same headline.

And then, noticed 2 instances of <info orsl="248495,sumnAya;248495.1,sumnaya"/> that are not in the headline.