sanhw1.txt Sorting Issue (PD .a.)

gasyoun commented 9 years ago

In sanhw1.txt yesterday update there is a new sorting issue.

.anuSizwavasuDAsuraSaMsita.:PD
.anyasUkta.:PD
.anyonyajanmamaraRASanaBItaBIta.:PD

seem not to belong here and are only because of the . before similar to :AP90,VCP that should come after the vowels and before the consonants, agree @drdhaval2785 ?

:AP90,VCP
.anuSizwavasuDAsuraSaMsita.:PD
.anyasUkta.:PD
.anyonyajanmamaraRASanaBItaBIta.:PD
a:AP,AP90,BEN,BHS,BOP,BUR,CAE,CCS,GRA,GST,MCI,MD,MW,MW72,PD,PE,PW,PWG,SCH,SHS,SKD,STC,VCP,WIL,YAT

Another, non-PD but general question is if we should know at least how many homonyms lie behind a word? Like

a:AP,AP90,BEN,BHS,BOP,BUR,CAE,CCS,GRA,GST,MCI,MD,MW,MW72,PD,PE,PW,PWG,SCH,SHS,SKD,STC,VCP,WIL,YAT

There are 2-4 different as in dictionaries and we count it as one. @funderburkjim what's your take?

drdhaval2785 commented 9 years ago

There are two distinct issues in this posting.

Issue 1 - addition of . before some words and a blank word. It is undoubtabley a mistake. We should correct it.

Issue 2 - homonyms. This is deep water. It is almost nearly impossible to group same words based homonyms mechanically. Let us see what @funderburkjim has to say in this regard.

funderburkjim commented 9 years ago

re the 3 PD headwords with a period.

This is an error in the 'key1' form of the headword, and I have amending hw1.py to discard these periods. Note: these cases were mentioned in file pdhw1_note.txt, part of the xml download. All the dictionaries have many unresolved headword issues mentioned in their Xhw1_note.txt file.

Why are those periods there in the first place? In the original HK version, these periods were vertical bars. Normally, in the HK version, vertical bars were used for daRqa. However, I think they are used in these three cases for another purpose, to indicate 'wide spacing' in the text. In the conversion to SLP1, vertical bars were changed to periods (SLP1 code for daRqa).

Incidentally, that ':AP90,VCP' line in sanhw1.txt is some kind of program error (in sanhw1.py) , or else some bad data in ap90hw2.txt and vcphw2.txt.

gasyoun commented 9 years ago

amending hw1.py does not sounds to be a good idea, as long as we still get them in the united .txt file. What do you mean by 'wide spacing' in the text? Is not ':AP90,VCP' the visarga?

funderburkjim commented 9 years ago

Interesting to see the question of homonyms arise, as I've been working on homonym correction in MW - when I get a free moment I'll post this work somewhere.

If homonyms were marked in all the dictionaries, then this would permit a more refined correspondence between headwords of different dictionaries.

funderburkjim commented 9 years ago

I think hw1.py is the appropriate place to remove detritus from key2 and construct key1. The periods are the detritus in these cases. There is no change to pd.txt (the periods are still there - look at a display for PD for anyasUkta and you'll see them).

'wide spacing' means (in the scan 'A B C' instead of 'ABC'). In the anyasUkta example, there is space in the devanagari, in contrast to no space in nearby headwords. I don't think this has any significance in these cases, but sometimes the Thomas' Sanskrit typists use this vertical bar around a phrase which they deem to have been printed with wide spacing. Thus, I don't think the periods in the typing are necessarily errors in this case. The display of daRqa's in this case is misleading, and could be corrected, but the correction is tedious and of small value, in my judgment.

funderburkjim commented 9 years ago

The colon in ':AP90,VCP' in sanhw1.txt is probably NOT a visarga, but simply the field-separator used in this file. It would separate an 'empty string' key1 value from the list of two dictionaries where this empty string occurs. In looking at ap90hw2.txt, I see two cases where there is an empty headword, after 'liMpAkaH' and after 'SulvaM'. From vcphw2.txt, there is one empty headword, after gaqqarikA. These are surely errors.

Here is resolution of errors:

AP90, empty headword after liMpAkaH is at line 156692 of ap90.txt
<P>.{#--kaM#}¦ A citron or lime.
The error here is coding this as a headword,  rather than just the sub-form liMpAkaM
Correction
<>{@{#--kaM#}@} A citron or lime.

AP90, empty headword after SulvaM, from line 177167 of ap90.txt:
<P>.{#--jaM#}¦ brass.
Similar error - this is not separate headword, but indicating SulvajaM

VCP empty headword after gaqqarikA, line 186777 of vcp.txt
Here the current vcp.txt has 
186780 old <HI>{@ @}¦ pu0 gaquka + pfzo0 . 1 BfNgAre jalapAtraBede Sabdara0
This is due to an error in a  correction!  The headword should be, I think, gaqquka
186780 new <HI>{@gaqquka @}¦ pu0 gaquka + pfzo0 . 1 BfNgAre jalapAtraBede Sabdara0

@drdhaval2785 , @Shalu411 Do you agree with gaqquka ?

The three changes above have been made.

sanhw1.txt has been remade.

drdhaval2785 commented 9 years ago

I agree with gaqquka

There are both versions of word available around, but not much difference. Majority go with 'u'. So I agree with 'u'/

gaqquka:BUR,MW,MW72,PW,PWG,SCH,SHS,WIL,YAT
gaqqUka:MW,SHS,WIL,YAT

drdhaval2785 commented 9 years ago

Homonyms in #99. Closing this issue

sanskrit-lexicon / CORRECTIONS

sanhw1.txt Sorting Issue (PD .a.) #96