Closed Andhrabharati closed 1 year ago
A quick look inside those entries revealed that a majority (6797) of them are 'grouped" entries-- [Total 'or-group' entries: 5258 & 'and-group' entries: 1539]
Tagged: 6627 [and: 1410 & or: 5217] Untagged, with body portion starting with "See": 170 [and (or &): 129 & or (or ,): 41]
And, 3 entries with body portion starting with "See" [<L>102543, <L>108451, <L>135782
] and one "sup" entry [<L>56162
] complete the count to 6801.
Also seen that there are 1162 'or-group' entries & 163 'and-group' entries in the rest of entries having the ¦
.
So, it seems the "grouping session" had a break somehow!
Probably @funderburkjim might wish to correct these entries, by adding the ¦
appropriately and also tagging the untagged 170 entries.
Here is the smaller chunk lot to complete the "and/or" tagging-- L-entries to fill and-or tags.txt
Here is how I would've split the entries (header-block and body portion), for the tagged 'and-group' mentioned above-- tagged 'and-group' entries split with vertical bar.txt
Finished the 'or-group' splitting with vertical bar, and in the process some entries were changed to 'and-tagging', looking at the content; and 2 entries were to be untagged altogether!
Now the tagged group counts are-
Tagged: 6625 [and: 1451 & or: 5174],
as against the earlier counts of
Tagged: 6627 [and: 1410 & or: 5217]
With this experience, seems another look at the already "grouped" (or: 1162 & and: 163) entries is needed.
-------------
Also seen that there are some entries that need to be identified and marked as group entries; I will be checking for them next.
[Hope @funderburkjim would be willing to carry this output from my working to the CDSL file.]
Shall be posting my results, upon his response to this issue.
@Andhrabharati I'm finishing up work to add link markup to Ramayana references in PWG. This should be done in a couple of days, and then I'll take a look at your work here on MW. From a first glance at the two files you provide above, I see no problem adding the broken vertical bar as you have placed it.
Will await your 'further look' revisions before proceeding to alter cdsl version of mw.txt.
add link markup to Ramayana references in PWG
Hurray, long live Ramayana, Russian Sanskrit dictionaries and Jim!
Working on https://samskrtam.ru/parallel-corpus/ramayana.html
With this experience, seems another look at the already "grouped" (or: 1162 & and: 163) entries is needed.
The re-look at the already "grouped" entries resulted in the counts as
or: 1079, orsl: 93 & and: 167
And few of these need some sort of correction in tagging etc.; probably this should be discussed in another (future) issue "grouped entries".
@funderburkjim
Pl. see the screenshot for PWG ls-tooltip 'error(?)'--
Also seen that there are some entries that need to be identified and marked as group entries; I will be checking for them next.
On a quick checking, 271 'or' & 22 'and' (probable) group-candidates are found.
There sure would be some more, to get by leisure checking!
The work is further described in the issue132 directory. About 8800 lines of mw.txt were changed.
<info or="X"/>
<info and="X"/>
<info or/and="X"/>
markup occurs on the headline (first line of entry after metaline) within the mw.txt digitization.<info or="L1,X1;L2,X2"/>
occurs identically in the headlines of both entry L1 and L2.I got side-tracked to add homonym markup. E.g., 1. <s>X</s>
was changed to
<hom>1.</hom> <s>X</s>
everywhere applicable.
I did not address some entries that need to be identified and marked as group entries
as mentioned by @Andhrabharati.
Will see what further corrections @Andhrabharati suggests before closing this issue.
The primary point (the missing broken-bar) in this issue having been addressed, I would suggest this to be closed now.
And few of these need some sort of correction in tagging etc.; probably this should be discussed in another (future) issue "grouped entries".
As I had mentioned above, the rest (about the grouped-entries) could be dealt in another issue.
@funderburkjim may look at the #137 issue next, before generating the IAST file again (for enabling me to resume further study).
Yes, I'm just starting the 137 changes, and will regenerate the iast version of mw at the end of that.
@funderburkjim
just did a casual checking for <L>
and ¦
counts, and got a difference of 3 (287616 vs. 287619).
then found 3 orphan ¦
instances [just dangling isolated] at lines (255594), (274957) and (280382).
There are 2 ¦ instances without a preceding space, and another 2 ¦ instances without a following space.
then found 3 orphan ¦ instances [just dangling isolated] at lines (255594), (274957) and (280382).
Who are you by education?
I am an (Electronics) Engineer.
@funderburkjim
All
<info or/and="X"/>
markup occurs on the headline (first line of entry after metaline) within the mw.txt digitization.
The <info ...>
tags have been at the end of the line preceding a <LEND>
line so far.
But now noticed that there are 217 instances where these tags are in the next line after the meta-line; these need to be shifted to the line preceding the resp. <LEND>
lines (to have a consistency of style).
[I've used the regex <info (.*?)\n<[^L]
to find these instances.]
These will removed in #137.
<info
There is no definite rule where the various <info .../>
markup should appear. These empty
xml info elements contain various kinds of 'meta' information. The entry-relative location of
these is as we view them to be 'convenient'.
I think it is convenient to have the <info or/and="X"/>
tag on the headline, since
then it is readily compared to the placement of the broken bar.
I'll look at the 217 instances; and will probably move the info elements to the lastline (line before <LEND>
) except for the <info or/and="X"/>
elements. Will do this in the #137 work.
My main intention is that all <info ...>
tags could be in the same line (probably barring the sup and rev entries of the Annexure data); there are ~50 cases of <info [oa]
tag on headline and the <info [wv]
on another line, while all other (~150) cases of <info [oa]
and the <info [wv]
are on the same headline.
And then, noticed 2 instances of <info orsl="248495,sumnAya;248495.1,sumnaya"/>
that are not in the headline.
There are 287628 occurrences of
<L>
(and<LEND>
), but only 280827 occurrences of¦
in the latest mw.txt (taken from csl-orig/v02 folder).The last mw_iast file I got from Jim in April 2021 has 287634 occurrences of
<L>
(and<LEND>
) and 280833 of¦.
I recall some<L>
entries getting deleted during the exercise done those days, so the difference of 6 therein is accounted for.But the difference of 6801 between
<L>
and¦
seems continuing still.However, the mw_meta2 file (dated Jun 18, 2018) gives a count of 287433, as under-
Request @funderburkjim to trace whence the 6801 difference started, and give-out the reason.