PUI inconsistent diacritics

drdhaval2785 commented 3 years ago

See 'i' in saptami

<L>10550<pc>2-675<k1>mAGa<k2>mAGa<h>II
{%Māgha%} (II)¦ — (Pañcadaśi): a yugādi for śrāddha;
<div n="lb"/>(saptami) a manvantarādi for śrāddha.
<div n="P"/>M. 17. 4, 7.
<LEND>

And see ī in Viśokasaptamī

<L>13176<pc>3-268<k1>viSokasaptamI<k2>viSokasaptamI
{%Viśokasaptamī%}¦ — to be observed on the sixth day of the
<div n="lb"/>white half of Māgha month.
<div n="P"/>M. 74. 2; 75. 1-2.
<LEND>

This is a generic issue. Sometimes PUI would write with diacritics, sometimes without. Do we ever want to do some systematic changes or leave it at that?

drdhaval2785 commented 3 years ago

<L>11584<pc>3-068<k1>rAjasUyam<k2>rAjasUyam
{%Rājasūyam%}¦ — the gift of {%Brahmāṇḍa Purāṇa%} equal
<div n="lb"/>to the performance of 1000 sacrifices.<sup>1</sup> The fruits
<div n="lb"/>of this {%yajña%} are equal to fasting and praying to
<div n="lb"/>Viṣṇu on the akṣayatṛtīya day;<sup>2</sup> a plunge in the Prayāgā is
<div n="lb"/>equal to this {%yajña.%}<sup>3</sup> Sacrifice performed by Soma when
<div n="lb"/>Viṣṇu was Brahmā, Śiva, the protector, Atri, the hota,

Look at the last word. It should be hotā. every diacritic macron / dots are messed up badly in this dictionary.

My opinion - Beyond salvage. Leave it.

gasyoun commented 3 years ago

every diacritic macron / dots are messed up badly in this dictionary.

That's good to know.

Do we ever want to do some systematic changes or leave it at that?

So you want to say that PUI is a source of millions of wrong headwords?

My opinion - Beyond salvage. Leave it.

That does not sound like a plan. Would not o_vs_O help here?

drdhaval2785 commented 3 years ago

So you want to say that PUI is a source of millions of wrong headwords?

Did not say it. But it means the same. See raTasaptami as headword. It should have I at the end.

Would not o_vs_O help here?

We can think about methods, if we want to go that way. Edit distance methods would give us fairly accurate substitutions. Question is - Are we taking that calculated risk?

Way is

Identify non English words from description (and presume them to be Sanskrit)
Do some kind of premitive stemming like removal of english suffix 's' / 'es' etc, if needed.
Find if there is a possible word within edit distance 1 and difference is only small / long vowel or nasals.
Replace.

drdhaval2785 commented 3 years ago

Just to show the scale, from rAjasUyam entry, you can see two other cases of wrong diacritic.

akṣayatṛtīya -> akṣayatṛtīyā
Prayāgā -> Prayāga

drdhaval2785 commented 3 years ago

And in mAGa etry, Pañcadaśi -> Pañcadaśī

gasyoun commented 3 years ago

you can see two other cases of wrong diacritic

All your cases deal with o_vs_O at end of word. That does not look too bad. What really is bad if all the same issues are at headword level as well.

funderburkjim commented 3 years ago

sanhw1 and hwnorm1c provide some perspective regarding headwords in PUI

https://github.com/sanskrit-lexicon/hwnorm1/blob/master/sanhw1/sanhw1.txt https://github.com/sanskrit-lexicon/hwnorm1/blob/master/sanhw1/hwnorm1c.txt

In sanhw1, there are 4823 PUI headwords that appear in no other dictionary. (regex = "^[^:]+:PUI$) And similar search in hwnorm1c shows 4706 normalized headwords only in PUI (regex = ^[^:]+:[^:]+:PUI$)

This is out of a total 17513 PUI headwords. So about 1/4 of PUI headwords are unique to PUI.

Some unknown fraction of these could be identified as the 'same' as headwords appearing in other dictionaries. Developing such a mapping from unique PUI headwords to more common spellings of other dictionaries seems like a reasonable goal.

Formal methods (as opposed to semantic) could probably yield many of the possible correspondences. Some of the techniques mentioned above (ending variations, edit distance) might apply.

There are probably quite a few unique compounds in PUI, such as vArzavratam

There are probably some systematic idiosyncrasies, such as

'i-ending' in PUI <-> 'in-ending' elsewhere, e.g. agamyagAmi (PUI) <-> agamyagAmin (PD) 500 or so 'i-ending' unique to PUI
'vAn-ending' in PUI <-> 'vat-ending' elsewhere. 30 vAn-ending unique to PUI
'am-ending' in PUI <-> 'a-ending' elsewhere e.g. agamyAgamanam <-> agamyAgamana 930 'am-ending' unique to PUI
'A-ending' in PUI <-> various elsewhere aNgarAjA <-> aNgarAjan 586 cases.

And probably several others.

funderburkjim commented 3 years ago

? o-O

Would you remind me of what this is ?

drdhaval2785 commented 3 years ago

https://github.com/sanskrit-lexicon/CORRECTIONS/issues/151

gasyoun commented 3 years ago

systematic idiosyncrasies

If I would be a woman I would fall in love with this man just because of this list. 500+930+586 cases seems like a normalization issue discovered. Your REGEXes are fast and advanced. But can the lists of words found be attached, please?

funderburkjim commented 3 years ago

the 'meta' nature of k1

The 'meta' nature of the 'k1' part of the 'metaline' is something we might exploit.

Consider these two entries in hwnorm1c:

gARapata:gARapata:AP,AP90,MW,MW72,PW,PWG,SHS,VCP,WIL,YAT
gARapatA:gARapatA:PUI

Here PUI has chosen to use the feminine form in its entry. So 'gARapata' in the other dictionaries is discussed under the gARapatA heading of PUI.

Here is the gARapatA entry as it appears in the digitization pui.txt:

<L>4136<pc>1-525<k1>gARapatA<k2>gARapatA
{%Gāṇapatā mantras%}¦ — sacred to Gaṇapati.
<div n="P"/>Br. IV. 38. 5.
<LEND>

Now suppose we changed just the k1 part of this entry to gARapata:

<L>4136<pc>1-525<k1>gARapata<k2>gARapatA
{%Gāṇapatā mantras%}¦ — sacred to Gaṇapati.
<div n="P"/>Br. IV. 38. 5.
<LEND>

This simple 'k1' change would

not change the displays of the body of this entry
would allow this entry in PUI to be discovered by the more common 'gARapata' spelling.

This same 'trick' could be applied in SKD. For example, SKD has the entry headword for nominals appear in the Nominative singular. E.g. त्वष्टा is SKD headword vs. त्वष्टृ in MW, AP, etc and त्वष्टर् in PW, PWG,CCS

These differences among dictionaries could melt away by changing 'k1' to 'tvazwf' everywhere.

gasyoun commented 3 years ago

These differences among dictionaries could melt away by changing 'k1' to 'tvazwf' everywhere.

So it's a case for normalization, not yet implemented?

drdhaval2785 commented 3 years ago

I advise not to go this way. The underlying data and key1 should match. For certain works, we use only sanhw1.txt. So having keywords as in dictionaries would be necessary.

We can slightly modify the proposal. I propose that we have normalized headword also as part of meta line by tag n.

<L>4136<pc>1-525<k1>gARapatA<k2>gARapatA<n>gARapata
{%Gāṇapatā mantras%}¦ — sacred to Gaṇapati.
<div n="P"/>Br. IV. 38. 5.
<LEND>

funderburkjim commented 3 years ago

adding an (optional) <n> field in meta line would be a possibility.

But this would be a BIG change -- with impacts in many directions.

Perhaps we can consider it in depth sometime in the future.

I would rather spend time in the near term on the API and on some already identified issues with simple search.

gasyoun commented 3 years ago

near term on the API and on some already identified issues with simple search.

Full support, rest - after.

sanskrit-lexicon / csl-corrections

PUI inconsistent diacritics #45

the 'meta' nature of k1