sanskrit-lexicon / alternateheadwords

Prepare list of alternate headwords for all Cologne dictionaries
1 stars 0 forks source link

SKD alternate headwords #9

Closed drdhaval2785 closed 7 years ago

drdhaval2785 commented 7 years ago

Total 345 to examine.

drdhaval2785 commented 7 years ago

Because this code focussed on alternate headword finding, it didn't optimize on headword finding much. Because of this, following situation arose. puM got attached to headword. Doing manual correction to such cases.

9:ba(va)qavAkftaH puM:vaqavAkftaH puM:bavavAkftaH puM 0:vi(bi)mbawaH puM:bimbawaH puM:vibiawaH puM 0:vi(bi)mbozWaH tri:bimbozWaH tri:vibiozWaH tri 0:vi(bi)mbOzWaH tri:bimbOzWaH tri:vibiOzWaH tri 0:vf(bf)hatpAdaH puM:bfhatpAdaH puM:vfbftpAdaH puM 0:vf(bf)hadBAnuH puM:bfhadBAnuH puM:vfbfdBAnuH puM

drdhaval2785 commented 7 years ago

More than one alternate headword 3:bA(vA)spaH (zpaH) puM:vAspaH:bAvAaH:255563:255569 vAspaH, vAzpaH, bAzpaH

drdhaval2785 commented 7 years ago

Print error. Bracket should have 'ba'. capture

drdhaval2785 commented 7 years ago

0:va(va)ndI:vandI:vavaI:353543:353547 Typo error. Should be va(ba)ndI. This already covers the next. So removing the next entry 0:va(ba)ndI [n]:bandI n:vabaI n

0:va(va)rhaM:varhaM:vavaaM:358573:358585 0:va(va)rhI:varhI:vavaI:358680:358690 0:va(va)lkalaH:valkalaH:vavaalaH:358815:358823 0:sinI(vA)vAlI:sivAvAlI:sinIvAlI:455163:455173

drdhaval2785 commented 7 years ago

@funderburkjim Ready to install.

gasyoun commented 7 years ago

puM got attached to headword

Strange, anyway.

drdhaval2785 commented 7 years ago

@funderburkjim https://github.com/sanskrit-lexicon/alternateheadwords/blob/master/data/SKD/skdahw3.txt is the file to incorporate.

funderburkjim commented 7 years ago

@drdhaval2785 I believe that you discovered some typo/print errors that need correction in SKD.

But, I was a bit confused on which were such errors.

Would you summarize just the corrections that need to be made, so I'll know what corrections to install for SKD?

funderburkjim commented 7 years ago

Also, I seem to remember seeing some error in context of preverb for PWG that you noticed, but have not been able to find the example among issue comments. Do you remember where this was ?

funderburkjim commented 7 years ago

After an initial misreading of the skdahw3 file, here is my understanding of the lines in the file. Please check me on this.

Example:

3:kube(ve)raH:kuveraH:kubeveH:71062:71072

Field description:

funderburkjim commented 7 years ago

Strategy 1

Strategy for integration of alternate headwords.

Rearrange things so that skdhw2.txt is recomputed differently. For convenience, I'll call the old skdhw2.txt by a different name, skdhw2a.txt. It will be computed exactly as before.

But there will be an extra step, which will merge skdahw3.txt and skdhw2a.txt into the new skdhw2.txt.

In the kubera example above, use the line range '71062,17072' to match to identify the line in skdhw2a:

2-144:kuba:71058,71061
2-144:kuberaH:71062,71072   <<< old LINE# =  L=8094
2-144:kubjaH:71073,71105

Next, insert a new line after the matched line. The section of new skdhw2.txt will look like

2-144:kuba:71058,71061
2-144:kuberaH:71062,71072   <<<  LINE# =  L=8094
2-144:kuveraH:71062,71072   <<<  new line  (think of as L = 8094.01)
2-144:kubjaH:71073,71105

Next, we'll have a change in the xml file.
We still have the 'old' record with L = 8094 and key1 = kuberaH.

But also there will be a 'new' record, with L = 8094.01 and key1 = kuveraH. key2, and all other details of the new record will be the same as for kuberaH.

In this SKD example set based on skdahw3.txt, there will be fewer than 10 alternate headwords for any case. However, in the PWG preverb example, there will often be more than 10 * but fewer than 100* alternate headwords. This detail is important in the construction of the Sqlite database, where currently the assumed format of 'L' (lnum) is DECIMAL(10,2) UNIQUE. If we encounter a case of alternate headwords with 100+ alternates, this sqlite declaration will need to be changed.

The scheme just desribed will in the course of events flow the alternate headwords into sanhw1,2 and into hwnorm1c. And displays will respond to the alternate headwords.

One drawback to this scheme which comes to mind is that the xml (and thus, the displays) will have no markup indicating that 'kuveraH' is an alternate headword of 'kuberaH'.

I'll wait for comments before proceeding to implementation.

drdhaval2785 commented 7 years ago

I agree with proposed methodology of integration.

gasyoun commented 7 years ago

If we encounter a case of alternate headwords with 100+ alternates, this sqlite declaration will need to be changed.

Even for such popular dhatus as nI that have many prefixes, there will be none bigger than 100+, that's my hypothesis.

The scheme just desribed will in the course of events flow the alternate headwords into sanhw1,2 and into hwnorm1c. And displays will respond to the alternate headwords.

Yahoo!

Both parts are really important. As per displays the last missing piece of puzzle will be some prerecorded rules. Like if you search for kar, that you can enter it like kR, whatever is the form used in the direct dictionary.

One drawback to this scheme which comes to mind is that the xml (and thus, the displays) will have no markup indicating that 'kuveraH' is an alternate headword of 'kuberaH'.

Let's add a tag, why not? Or just publish kube(ve)raH key2?

funderburkjim commented 7 years ago

Strategy 2

There's one thing objectionable with Strategy 1: the programmatic location where L is assigned.

Currently, MW has L already assigned. But for the other dictionaries, the 'L' for a headword is defined implicitly, as the line number of the headword in the Xhw2.txt list of headwords; this is the current assumption.

Several other programs use this simple implicit rule to recompute the L-number when it is needed. For example:

Now if we complicate the L-number generation algorithm (as per Strategy 1), and if we leave the form of Xhw2.txt unchanged, then we'll have to repeat the L-number generation computation in all of the programs that use the L-value from xhw2, such as those listed above. This is what we would need to do if strategy 1 is followed.

However, these other programs really shouldn't have to know how L is constructed, they just need to know the end result, the value of L for a given record of Xhw2.

Thus, the main point of Strategy 2 is to put the constructed L-code as an additional field in Xhw2.txt. This would make our sample skdhw2.txt look like (assuming we put computed L at the end):

2-144:kuba:71058,71061:8093
2-144:kuberaH:71062,71072:8094  <<<  LINE# =  L=8094
2-144:kuveraH:71062,71072:8094.01  <<<  new line  (think of as L = 8094.01)
2-144:kubjaH:71073,71105:8095

Now, the other programs mentioned above will still need to be changed, but, once strategy 2 is implemented for all the dictionaries, those other programs will actually be conceptually simpler, as they will only read the L value from Xhw2.txt, rather than compute the L number from Xhw2.txt.

gasyoun commented 7 years ago

Thus, the main point of Strategy 2 is to put the constructed L-code as an additional field in Xhw2.txt. This would make our sample skdhw2.txt look like (assuming we put computed L at the end):

Makes sense.

other programs will actually be conceptually simpler, as they will only read the L value from Xhw2.txt, rather than compute the L number from Xhw2.txt.

Sure, no need to complicate and that fix should not be tough. Well thought, as usual, Jim!

funderburkjim commented 7 years ago

Strategy 2a

Strategy 2 seems ok for Xhw2.txt.

However, I'm bothered by the X.xml part. Strategy 2 would have have the xml records for alternate headwords be constructed identically to the xml records for non-alternate headwords. In fact, the only way that we can distinguish that a record of Xhw2.txt is an alternate headword is from the form of the L-number in the Xhw2.txt record, namely that an alternate headword record would be a non-integer while a regular (non-alternate) headword record would have its L-number to be an integer.

Thus, implicitly, we are saying that records of Xhw2.txt are of only two types - regular and alternate.

Maybe this is ok, maybe there are not going to be any other record types for Xhw2 as time goes by.

However, the record-repetition in X.xml seems bothersome. I'm thinking of the case of STC, where there may be many dozens of embedded headwords, or of PWG where there may be numerous extra preverb headwords. In these two cases, the extra headwords are NOT ALTERNATE SPELLINGS of the original headword, but rather completely distinct headwords which the given dictionary happens to mention within the body of another headword. [More to come]

funderburkjim commented 7 years ago

Strategy 3

The previous comments make me wonder whether we need to add still another field to the records of Xhw2. This field would be a code indicating the relation of the headword present in the record to the underlying entry specified by the line-range of the record. We might have this extract code take values:

So our skdhw2 records might look like

2-144:kuba:71058,71061:8093
2-144:kuberaH:71062,71072:8094  <<<  LINE# =  L=8094
2-144:kuveraH:71062,71072:8094.01:alt  <<<  new line  (think of as L = 8094.01)
2-144:kubjaH:71073,71105:8095

OR , if we decide to make the code for 'normal' records explicit:

2-144:kuba:71058,71061:8093:n
2-144:kuberaH:71062,71072:8094:n  <<<  LINE# =  L=8094
2-144:kuveraH:71062,71072:8094.01:alt  <<<  new line  (think of as L = 8094.01)
2-144:kubjaH:71073,71105:8095:n

Now as to X.xml, we can have different forms depending on the record type code. We could add an attribute to the <h> element, <h n="CODE">, where the value CODE of the n attribute is as above ('n', 'alt'). We could have this attribute to be optional; with the understanding that its absence would indicate a 'normal' record.

Then for the 'alt' type, we could have a different form of the xml record in X.xml. Instead of the usual form for the <body>, we could introduce an attribute to the body element, of the form <body ref="[L-number of the 'parent']">.

e.g. for our skd example:

The record of skd.xml for kuberaH would be as currently:

<H1><h><key1>kuberaH</key1><key2>kube(ve)raH</key2></h>
<body><HI/><s>kube(ve)raH, puM, (kumbatIti . kuba i ki AcCAdane</s><lb/>
  ...
</body>
<tail><L>8094</L><pc>2-144</pc></tail></H1>

and the record for for the alternate headword kuveraH:
<H1><h n="alt"><key1>kuveraH</key1><key2>kube(ve)raH</key2></h>
<body ref="8094">
</body>
<tail><L>8094.01</L><pc>2-144</pc></tail></H1>

Such a change to the xml structure would be more informative, and would avoid probably confusing repetitions of the contents of the <body> element within X.xml.

It would also allow filtering of X.xml for the alternates.

One drawback to this is that the display programs would have to be changed to handle the new record type (e.g., by using the value of the body ref attribute to retrieve the parent record for the body of the display when the user query is kuveraH).

Is strategy 3 getting too complicated?

gasyoun commented 7 years ago

form of the L-number in the Xhw2.txt record, namely that an alternate headword record would be a non-integer while a regular (non-alternate) headword record would have its L-number to be an integer.

That's good enough.

STC, where there may be many dozens of embedded headwords, or of PWG where there may be numerous extra preverb headwords. In these two cases, the extra headwords are NOT ALTERNATE SPELLINGS

Agree, that's the first big test.

n = Normal. We could simplify Xhw2 by allowing this to be optional alt = alternate headword, as in our skd case imp = implied, for the preverbs from pwg, the sub-compounds of stc, maybe everything els

I would have not just n, but say d, t or dt. Some dictionaries by default in the printed book have headwords in d[evangari], others t[ranslitaration] and mixed cases dt. If I want to see it as it was (let's say Jim makes such moode), that just normal would be not enough. :n is like nochange, too similar, that's not great.

the value CODE of the n attribute is as above ('n', 'alt'). We could have this attribute to be optional; with the understanding that its absence would indicate a 'normal' record.

Agree.

One drawback to this is that the display programs would have to be changed to handle the new record type (e.g., by using the value of the body ref attribute to retrieve the parent record for the body of the display when the user query is kuveraH).

That's a minor issues, compared to all the pluses.

Is strategy 3 getting too complicated?

I do not think so. Nothing extra here.

funderburkjim commented 7 years ago

I would have not just n, but say d, t or dt

Those extra distinctions are dictionary-level meta-data (i.e. apply to all entries in a particular dictionary); this implies (to me) that they should not be part of the record metadata. If we get to the point of unifying our description of dictionaries so that all dictionaries are special cases of a general form, then that 'd,t,dt' type of meta-information might be applicable.

As to 'n' meaning no-change -- any of these codes have only contextual meaning. The 'n'=nochange context is for correction standard forms, a different context than that of the xml-file entries. Thus, ok to use 'n=normal' in the xml-file context.

gasyoun commented 7 years ago

Thus, ok to use 'n=normal' in the xml-file context.

So be it.

funderburkjim commented 7 years ago

The alternate headwords for SKD have now been installed. Hurray!

The alternates are now represented in

Hope @drdhaval2785 and @gasyoun will do some exploring (the skdahw3.txt will be useful as the source of the alternates).

When others agree that this approach seems ok, this approach can be applied to the other dictionaries for which Dhaval has constructed the alternates, and also for the PWG preverbs.

funderburkjim commented 7 years ago

Interesting technical note: The development was done on the local htdocs copy of the SKD orig, pywork, and web directories (downloaded from S3 copies of Cologne directories).

Then, the changed files (about 17, in various directories) on the local machine were uploaded to the corresponding locations of the Cologne server. This automation was possible by having a python program write an upload script.

The newest technical feature, to me, of this upload script was that it uses the pscp (Putty scp) command-line program. pscp.exe is like the Unix scp (secure copy) program, but with the huge automation advantage of accepting a '-pw PASSWORD' parameter on the command line (scp doesn't have this password feature).

gasyoun commented 7 years ago

Hurray for Jim! Long live Jim!

Putty scp) command-line program

Yeah, and it's much safer than FTP as well.

kuvera:AP,BEN,BHS,BOP,BUR,MW,MW72,PWG,SHS,STC,VCP,WIL,YAT kuveraH:SKD kuveraka:PWG,SHS,VCP,WIL,YAT kuverakaH:SKD kuveranalinI:PWG kuveravana:PWG kuveravallaBa:PWG

I did some exploring and wanted to know if there should be some markup (maybe *) of these words in sanhw1.txt, for example?

aMSa:aMSa:BEN,BHS,BOP,BUR,CAE,CCS,GRA,GST,IEG,INM,MD,MW,MW72,PD,PE,PUI,PW,PWG,SCH,SHS,SKD,STC,VCP,WIL,YAT;aMSaH:AP,AP90,SKD aMSaka:aMSaka:GST,MW,MW72,PD,PW,PWG,SCH,SHS,VCP,WIL,YAT;aMSakaM:SKD;aMSakaH:AP,AP90,SKD

ahh

In hwnorm1c.txt why is aMSakaM in same line as aMSaka, did not notice it before, is it the way it should be? If ok, ignore my question.

funderburkjim commented 7 years ago

Yes, aMSakaM should be in same line as aMSaka, in hwnorm1c.

The format of hwnorm1c is rather complex. Let's take the aMSaka line as example:

aMSaka:aMSaka:GST,MW,MW72,PD,PW,PWG,SCH,SHS,VCP,WIL,YAT;aMSakaM:SKD;aMSakaH:AP,AP90,SKD

aMSaka   the normalized spelling
  a list of (non-normalized) spellings whose normalization is aMSaka
  The first non-normalized spelling here just happens to be the same as the normalized spelling.
  aMSaka:GST,MW,MW72,PD,PW,PWG,SCH,SHS,VCP,WIL,YAT;    all these dictionaries have this spelling
  aMSakaM:SKD;     all these (just one) dictionaries have the aMSakaM spelling
  aMSakaH:AP,AP90,SKD   all these dictionaries have the aMSakaH spelling.

The rules currently used for normalization of hwnorm1c are specified in the normalize_key function of the hwnorm1c.py program.

Here are the comments from that normalize_key function; they provide a good idea of normalized spelling:

 #1. normalize so that M is used rather than homorganic nasal
 #2. normalize so that 'rxx' is 'rx' (similarly, fxx is fx)
 #3. ending 'aM' is 'a' (Apte)
 #4. ending 'aH' is 'a' (Apte)
 #4a. ending 'uH' is 'u' (Apte)
 #4b. ending 'iH' is 'i' (Apte)
 #5. 'ttr' is 'tr' (pattra v. patra)
 #6. ending 'ant' is 'at'
 #7. 'cC' is 'C'
gasyoun commented 7 years ago

aMSaka:aMSaka:GST,MW,MW72,PD,PW,PWG,SCH,SHS,VCP,WIL,YAT;aMSakaM:SKD;aMSakaH:AP,AP90,SKD

What if we count all the dictionaries, that relate to aMSaka? If we ignore for some purpose of stats the M in aMSakaM:SKD and just say - normalized aMSaka in 15 dictionaries.

drdhaval2785 commented 7 years ago

I tested DOtakOSeyaM and CatraM. Not viewable either in basic or advanced view. kuveraH seems integrated.

funderburkjim commented 7 years ago

Re: DOtakOSeyaM and CatraM

Both of these DO come up in SKD Basic. e.g., image

funderburkjim commented 7 years ago

normalized aMSaka in 15 dictionaries.

This looks right, inferrable from hwnorm1c by counting the distinct dictionaries occurring in the various spellings.

funderburkjim commented 7 years ago

some markup (maybe *) of these words in sanhw1.txt,

I would prefer not to add markup to sanhw1, since it is used by other programs (such as the program that constructs hwnorm1c.txt).

Another possibility might be to construct a sanhw1_extra.txt file that would contain all the 'extra' headwords from various dictionaries. While this file could be exactly like sanhw1 except for including only extra headwords, it might be better to have a different format; for instance, it might be desired to know, for an extra headword, what is the 'parent' headword.

gasyoun commented 7 years ago

extra headword, what is the 'parent' headword.

Agree.

drdhaval2785 commented 7 years ago

I loved the display that X is alternate of Y. Very well thought.

drdhaval2785 commented 7 years ago

The approach seems stabilized now. Let us explore other dictionaries.

gasyoun commented 7 years ago

Let us explore other dictionaries.

Apte and SKD done? Vacaspatyam left?

drdhaval2785 commented 7 years ago

Time to close?