MW72 missing '-' in Sanskrit at line beginning.

funderburkjim commented 7 years ago

@SergeA mentions in a comment to case 19 of batch 314 (see #322):

 NB! This is a frequent typo: in the beginning of the line hyphens are dropped. (But not always.)
E.g.
"saṃ-laksh, cl. 10. P. A. -lakshayati,
te, -yitum ..." (-te)
" saṃ-lańgh, cl. 1. P. A. -lańghati,
te ..." (-te)
"saṃ-rūsh, cl. 10. or Caus. -rūshayati,
roshayati ..." (-roshayati)

The current filter on which these batches are based does NOT catch these cases, although a few cases had this error coincidentally.

I'm not sure of what programmable pattern would be required to catch more of these cases of missing '-' at the beginning of a line.

One such pattern is {%te\W (te at beginning of an italic (and therefore Sanskrit in MW72), followed by a non-word character. A quick search shows 183 matches - And a quick glance at the results suggests there are no false positives.

There may be other patterns, which might involve a prior line.

@SergeA Have you noticed any other patterns?

gasyoun commented 7 years ago

183 matches - And a quick glance at the results suggests there are no false positives.

Well done, a big catch. You meant {%te\W and {%ti\W, right?

funderburkjim commented 7 years ago

{%ti\W

Good suggestion - 32 of those are found.

funderburkjim commented 7 years ago

We'll need to do some individual examination to be sure the missing '-' is not present at the end of the preceding line, for instance - this would be a case where we would not want to change to {%-ti.

<P>.{#aByund#}¦ {%abhy-und (abhi-und),%} cl. 7. P. {%-unat-%}
<>{%ti, -unditum,%} to wet, bedew; flow over.

Probably a display showing current and preceding line would suffice. and we could look for a -%} at end of prior line, and, if found, presume no '-' needed on the {%ti or {%te.

gasyoun commented 7 years ago

Probably a display showing current and preceding line would suffice. and we could look for a -%} at end of prior line, and, if found, presume no '-' needed on the {%ti or {%te.

Or change double to one if put all in 1 line.

SergeA commented 7 years ago

@SergeA Have you noticed any other patterns?

Also MW72 gives a great quantity of verbforms for prefixed verbs, replacing their prefix with a hyphen. As in

vi-bhram, cl. 1. 4. P. -bhramati,
bhrāmyati, -bhramitum

I suppose in the scheme

prefix-root ... ... .... -form1,
form2, -form3 ...

for "form2" in 100% should be "-form2" However, the preceding element "-form1," and the following "-form3" are not always present, in which case the probability of lost hyphen lowers.

funderburkjim commented 7 years ago

I've addressed the simplest case , for ti,te. These have been autocorrected; I've only examined directly a small number of randomly selected cases, but feel fairly sure all these corrections are warranted.

In fact, all these corrections have been installed.

There are 189 cases of corrections. Here is the file, which shows the correction, as well as the preceding line. Here are the corrections: filter.txt

funderburkjim commented 7 years ago

Autocorrections have also been generated for the 'prefix-root' cases mentioned above.

Two files have been prepared:

filterpv.txt 1153 cases These are probably cases where a '-' is needed, but should be further examined manually, before these autocorrections are installed.
filterpv_no.txt 447 cases. There are probably false positives (i.e., don't need correction). However, they should be examined for any false negatives.

The corrections in filterpv.txt have not yet been installed. They need further examination before being installed.

The two files are in this gist

gasyoun commented 7 years ago

1153 cases

A lot, indeed.

SergeA commented 7 years ago

1153 cases
A lot, indeed.

A lot in number, but it is a very easy task. They mostly even do not need to recheck by PDF and can be solved on the fly by context, few seconds for each case.

funderburkjim commented 7 years ago

it is very easy task

Agreed. No need to consult scan usually.

If you do some checking, why don't you start 'at the top', and I'll start 'at the bottom' tomorrow.

If you find some False Positives in filterpv.txt, just note them and add the bunch to comment here. I'll do the same.

SergeA commented 7 years ago

If you find some False Positives in filterpv.txt, just note them and add the bunch to comment here.

https://docs.google.com/document/d/10Ivo95hD75xHVcnQ8RgJF9oKge3gq06e2cyzveQeqVI/edit?usp=sharing

Here are few false positives from the bigger list. The other one has more complicated cases, it'll better be viewed though interface, I think.

funderburkjim commented 7 years ago

I went through the filterpv_no.txt and gathered 29 false negatives

The records to be corrected are

the filterpv.txt items, excluding the 31 false positives noted above
the 29 false negatives from filterpv_no.txt

funderburkjim commented 7 years ago

All the above now installed.

Time to close this issue.

One aside re MW72: There appear to be more verb forms in MW72 than in MW(99). Due to the simpler form of mw72.txt relative to mw.xml, it would likely be easier (though not easy) to harvest the verb forms from mw72 than it would be to harvest the forms from mw.xml.

The resulting list, from either source, could provide a useful digital reference (in addition to Whitney Roots) of verb forms.
One use of such a list would be in comparison to algorithmic computations of verb forms.

sanskrit-lexicon / CORRECTIONS

MW72 missing '-' in Sanskrit at line beginning. #326