AP v. AP90 headwords, part 2

funderburkjim commented 7 years ago

We've corrected several hundred headwords spellings in AP and AP90 in issues #332 and #334.

The comparison program has been rerun and I've started going through the cases where the program marks the comparison with a question mark. There are about 1000 of these.

At the moment, here's my procedure for comparing.

In a browser, open two copies of the list display, one for AP and one for AP90
open the ap90_ap_hw2_short.txt file in a text editor.
open a scratch text file to keep results
Go through the ap90_ap_hw2_short file, searching for '?'.
- examine local neighborhood of the current '?' line. Usually you will see some obvious explanation, involving a misspelling in either AP, AP90 or occasionally both. Sometimes, there will be no spelling error, but the ? is caused by a word that legitimately appears in only one of the dictionaries.
- When you find a spelling error, make a note of it.
When you've done a session, post the error notes to a comment in this issue.

So far, I've worked through line 2930 (?1?apaTAsa...) and have come up with 31 corrections.

The next comment has the first batch of corrections. Anyone who helps can use this simple format. Sticking to this format will allow me to write a simple program to parse the corrections and autogenerate most of the change transactions. Note, I'm keeping the AP corrections separate from the AP90.

funderburkjim commented 7 years ago

Batch 1 - lines 1-2930

Jan 25, 2017. Extra AP90 corrections
L = 235, aGa -> aG type=p, missing virama
L = 539, atigraha -> atigrah, type=t
L = 613, ativrahmacaryaM -> atibrahmacaryaM
L = 655, ati -> ati-lomaSa or ati-romaSa
L = 749, atf -> attf, type=t
L = 1341, anApta -> anAptf, type=p,cf. ap90
L = 2472, aprarizwiH -> aparizwiH
L = 2629, apaTAsaH-> apahrAsaH
==================
Jan 25, 2017. Extra AP corrections
L=280, aNka -> aNk, type=p, missing virama
L=428, ajahalliNgama -> ajahalliNgam , type=p, missing virama
L=584, atikAnta -> atikrAnta, type=t  (also others)
L = 587, atikamaRam -> atikramaRam, type=t  (also others)
L = 588, atikamaRIya -> atikramaRIya
L = 589, atikudDa -> atikrudDa
L = 590, atikUra -> atikrUra
L = 816, atyahita -> atyayita, type=p 
L = 822, atala -> atula, type=p, missing vowel diacritic
L = 1048, aDikam -> aDikram
L = 1436, anAkanda-> anAkranda
L = 1437, anAkAnta -> anAkrAnta
L = 1681, anukIH -> anukrIH
L = 1693, anukakaca -> anukrakaca
L = 1696, anukam -> anukram
L = 2416, anyAddakz -> anyAdfkz
L = 2474, anvArohaRama -> anvArohaRam, type=p, missing virama
L = 2502, apakalaNkakaH -> apakalaNkaH , type=p? cf mw, ap90
L = 2530, apakoSaH -> apakroSaH
L = 2715, aparikama -> aparikrama
L = 2906, apahnAsaH -> apahrAsaH

gasyoun commented 7 years ago

?2?hotvan NO-AP90 hotvan

means there is no hotvan in AP90, right?

In a browser, open two copies of the list display, one for AP and one for AP90

We can't have the yellow box here in page scans to ease word search, right?

?1?hotrIya,hotriya hotrIya hotriya

hotriya:AP,CAE,CCS,GRA,MD,MW,PUI,PW,PWG hotrIya:AP90,BEN,CAE,MW,MW72,PW,PWG,SHS,VCP,WIL,YAT hotrIyaM:SKD

Belonging to an oblation same meaning. GRA has it from Rigveda. Both valid as per PW,PWG, but for Apte I would go for only one, because meanings identical.

funderburkjim commented 7 years ago

no hotvan in AP90, right?

Yes.

We can't have the yellow box here in page ...

I'm not sure what 'yellow box' means. But, it has something to do with developing a UI.

I'm ambiguous on this UI development. If it takes me 8-10 hours to develop a helpful UI, is it worth doing? Will there be enough other participation to make the development time and effort cost effective? Another point of view might be that whatever the immediate benefit of UI development in a particular case, I should do it because otherwise there will definitely be almost no participation by others.

This is a quandary.

gasyoun commented 7 years ago

UI development in a particular case, I should do it because otherwise there will definitely be almost no participation by others.

Yes, that's obvious. And with @SergeA for sure. But in this case I hardly understand why the old code, the UI already developed can't be modified. It's too different, right?

funderburkjim commented 7 years ago

that's obvious

I guess this is not yet obvious to me. I'm so used to using ad-hoc methods. Perhaps I need to change my mindset to one of almost always thinking of UI as an essential component of problem solution. The benefit is that UI enables contribution and engagement by others, and this engagement has numerous unexpected benefits.

I terms of an appropriate UI for this case...

The difference here, it seems to me, is that there are several entries that need to be examined together to understand the situation. Take the first '?' example.

?1?aGa,aGana aGa aGana

The Python comparison process generated this example, but you can't understand what is going on by looking just at this line. You have to see, in this case, the two prior lines:

aG NO-AP90 aG
aGa aGa aGa
?1?aGa,aGana aGa aGana

Now, even before looking at dictionaries, it seems clear that 'aG' is a verb in AP. Why is it not in AP90? Then, we see that 'aGa' (presumably some adjective) occurs in both AP90 and AP. Then, we see, on the third line, that there is a 2nd 'aGa' in AP90, which is paired with word 'aGana' in AP.

So now, we can speculate that maybe that first aGa in AP90 (the one in 2nd line) maybe really should be an 'aG' -- we know that sometimes virAmas are missed, either in digitization or print.

So now we are ready to look at dictionary entries. We need to look at 'aGa' in AP90 and see if the first one really should be spelled 'aG' - and we find this to be so (the print is missing a virAma). Then, we can double check that this corrected 'aG' in ap90 corresponds in sense to the already present 'aG' in AP. It does.

So we generate the correction

L = 235, aGa -> aG type=p, missing virama

Anyway, that's the process that seems to be relevant. And the other cases so far examined by me are somewhat similar.

But I'm still not sure what the UI should look like for a 'case'.

gasyoun commented 7 years ago

I guess this is not yet obvious to me. I'm so used to using ad-hoc methods. Perhaps I need to change my mindset to one of almost always thinking of UI as an essential component of problem solution. The benefit is that UI enables contribution and engagement by others, and this engagement has numerous unexpected benefits.

I mean in most cases where I can live without UI, @SergeA can't. He was waiting a few years since I convinced you to try the first UI and now you see, after @Shalu411 is gone and @drdhaval2785 frozen, that UI makes a difference. But in this AP vs. AP90, PWG vs. PW, MW vs. MW72 - and UI made for one will work for at least 2 different pairs - the most important dictionaries and hundreds of misspelled headwords.

You have to see, in this case, the two prior lines:

Sure, but an HTML with clickable links would make more sense, than copy-pasting hundreds of time.

we know that sometimes virAmas are missed, either in digitization or print.

So we can generate a sublist of possible cases where the only difference might have been a dropped of virama?

But I'm still not sure what the UI should look like for a 'case'.

Even if it would be just a list of relevant entries from sanhw1, like

hotriya:AP,CAE,CCS,GRA,MD,MW,PUI,PW,PWG hotrI:AP,AP90,MW,MW72,SKD hotrIya:AP90,BEN,CAE,MW,MW72,PW,PWG,SHS,VCP,WIL,YAT hotrIyaM:SKD

and words with links (an HTML page is a primitive UI as well, a GUI), instead of pure txt, that would quicken and there would be no need to have 2 windows open initially, at least I would not use them

hEmavatI hEmavatI hEmavatI ?1?hEyaNgavInam,hEyaNgavam hEyaMgavInaM hEyaMgavam ?2?hEraRyavAsas hEraRyavAsas NO-AP ?2a?hEraRya NO-AP90 hEraRya ?3?hEraRyaka NO-AP90 hEraRyakaH hErika hErikaH hErikaH

drdhaval2785 commented 7 years ago

Hi @funderburkjim, I am back from my slumber. Will be willing to work on some pending issue. Can you give me some coding work of useful nature, so that I can contribute. I would not get too much of time to jump into correction submission as of now, but tools I can create.

drdhaval2785 commented 7 years ago

Based on my experience with my paper in Normalizing headwords, there is one tip specifically for AP90.

AP90 has tendency to use M instead of m at end. If before comparision M is converted to m, some false positives may be weeded out.

I didnt go through the file, but Marcis' comment has one example. There may be other similar cases. I guess in present case, we accounted for N-M comparision, but not for terminal M-m comparision.

?1?hEyaNgavInam,hEyaNgavam hEyaMgavInaM hEyaMgavam

funderburkjim commented 7 years ago

hEyaNgavInam

Good idea. I'll look into that.

Regarding 'tools' - I'm not exactly sure what this covers. I consider the displays, esp. the apidev displays to be tools. It might be useful to have a display based on the hwnorm1c data. This would require (a) building a database (sqlite file) and (b) a search suggestion function (php) for this database that would take into account spelling normalization. This much would probably be a fairly self-contained task, which could then be the front end of a Cologne display for multidictionary lookup.

If this sounds interesting to you, I'll think in more detail what ingredients (and prototypes) you might need to construct this.

funderburkjim commented 7 years ago

Batch 1: probable cases of missing virAma in AP spelling

16 cases. Status: DONE @SergeA deva-UI

SLP1 UI

@gasyoun 's comments got me to thinking more about how to leverage the existing UI (that used in #332, #334) in the remaining AP/AP90 cases.

There are some subsets (filters) of the remaining cases that can be examined in the previous way.

As a start, this batch deals with the cases where the AP spelling is the same as the AP90 spelling, but with an extra 'a' at the end . There are 16 cases and I suspect that most of them are cases where there is a missing virAma in the AP spelling.

There are also a few other filters that might be analyzable similarly.

funderburkjim commented 7 years ago

Batch 2: probable cases of missing virAma in AP90 spelling

There are 20 of these. Status: done by @SergeA (29 Jan 2017)

deva url UI

slp1 url UI

funderburkjim commented 7 years ago

Batch 3: probable cases of missing 'r' in AP spelling

31 cases. done by @SergeA (29 Jan 2017)

UI Devanagari UI SLP1

In #334, we noticed many digitization spelling errors where one of the ligatures for 'kr' in the AP dictionary had been misinterpreted as 'k'.

There are still some of these remaining that were not caught in #334. Perhaps the present list of 31 will get all (or most) such cases.

funderburkjim commented 7 years ago

Batch 4 possible cases of AP spelling error

23 cases. Status DONE @funderburkjim 01/30/2017.

NOTE: many of these are hard to decide.

UI Devanagari UI SLP1

In these cases, the merging of the headwords of AP90 and AP

paired two differently spelled words
The edit distance between the two is exactly 1
The AP spelling occurs ONLY in AP
The AP90 spelling occurs in some other dictionary than AP90

Thus, there is probable cause to think that the AP spelling might be wrong.

funderburkjim commented 7 years ago

Batch 5 possible cases of AP OR AP90 spelling error

36 cases. Status DONE @funderburkjim 01/30/2017

NOTE: Almost all of these were actually AP90 errors.

UI Devanagari UI SLP1

In these cases, the merging of the headwords of AP90 and AP

paired two differently spelled words
The edit distance between the two is exactly 1
The AP spelling occurs ONLY in AP
The AP90 spelling occurs ONLY in AP90

Thus, there is some possibility that one or the other dictionaries has a spelling error - no a-priori evidence to favor either one. But the pairing suggests that we should look at these cases.

funderburkjim commented 7 years ago

Batch 6: possible cases of AP90 spelling error

5 cases. Status DONE @funderburkjim 01/30/2017.

UI Devanagari

UI SLP1

In these cases, the merging of the headwords of AP90 and AP

paired two differently spelled words
The edit distance between the two is exactly 1
The AP90 spelling occurs ONLY in AP90
The AP spelling occurs in some other dictionary than AP

Thus, there is probable cause to think that the AP90 spelling might be wrong.

funderburkjim commented 7 years ago

The 6 batches of above are ready.

If you tackle any of them, just make a note in the comments (maybe change status from TODO to DONE and user name.

If any remain to be done on Monday, I'll do them then.

Also, if anyone has ideas of other specific filters that might be programmed, do mention.

gasyoun commented 7 years ago

@funderburkjim very interesting. Experimenting with non-sandhi headwords

Rd

zawKaRda:SCH KaRda:IEG caRdeSvarapperuvilE:IEG amAvAsyaSARdilyAyana:VEI aBizekamaRdapa:IEG aRdika:IEG

nq

kUpadanqa:MW OCR error danqa:PE OCR error

St ST

sw sW

0

Rn nR

aRnimittaka:PD aRnirUpita:PD zaRnavatiSrAdDanirRaya:ACC zaRnavatiSrAdDaprayoga:ACC sanRI:STC suvaRnakadalI:SKD

RY YR

aYRit:PD is false positive, because a grammar term (a-ñṇit) (Gr.) (pratyaya) other than those marked by the indicatory letters ñ and ṇ

nj - 14 results, but all seem non-changable

ganjwar:IEG

Ys

anaYsa:PD anaYsamAsa:PD anaYsamAsagrahaRa:PD anaYsamAsatva:PD naYsamAsa:ACC,MW naYsUtrArTavAda:ACC,MW

SergeA commented 7 years ago

The situation with k/kr in AP1957 is very sad. Many and many words in the examples are misspelled. Please, count the number of "kr" in ap90 and ap1957. The difference will give the approximate number of erroneous cases. I think on the basis of ap90 it is possible to make a "kr"-word list and then search those words with misspelled "k" in ap1957. But if there will be too many such cases it makes sense to do a special UI with supposed "kr"-correction and two buttons - accept or reject.

funderburkjim commented 7 years ago

Under Batch 5 #6 udGfzwam (AP) v. udDfzwam(AP90) -

AP is right (it is G not D).

But both dictionaries have numerous headwords near this where a 'udG' is misspelled as 'udD' [The misspelling can in part be confirmed by alphabetical ordering.]

For AP:

udgrIva  ok
udDaH  -> udGaH   
udDanaH -> udGanaH
udGAtin ok
udGaw ok
udDawitam -> udGanaH
udDAwaH -> udGAwaH
udDAwakaH wrong
udDAwana wrong
udDAwita wrong
udDawwakaH wrong
udDawwanam wrong
udDawwita wrong
udDasam wrong
udDAtaH wrong
udGAtin wrong
udDuz wrong
udGuzwa ok
udDozaH wrong
udGfz ok
udDarzaRam wrong
udGfzwam ok
udDoRa wrong
uddaMSaH ok ---  now we're into udd...

For AP90:

udgrIva
udDaH wrong
udDanaH wrong
udGAtin ok
udGaw ok
udDawitaM wrong
udDAwaH wrong
udDAwakaH wrong
udDAwana wrong
udDAwita wrong
udDawwakaH wrong
udDawwanaM wrong
udDawwita wrong
udDasaM wrong
udDAtaH wrong
udDuz wrong
udGuzwa ok 
udDozaH wrong
udDfz wrong
udDarzaRaM wrong
udDfzwaM wrong
uddaMSaH ok  now we're into udd words.

funderburkjim commented 7 years ago

Batches 4-6 finished.

Ready to begin install process.

funderburkjim commented 7 years ago

Batches 7,8

status: DONE (@funderburkjim 01/30/2017)

For convenience in installing, made UIs for the 'udD' cases mentioned above.

ap slp1 and ap90 slp1

funderburkjim commented 7 years ago

@SergeA noticed , in regard to spalling of arTApaya in ap90 and arTApay in AP:

there is no unique standard for the nominal verb bases :( AP90 gives like MW AP1957 gives like PWG

This kind of correspondence can find help refine the correspondences of hwnorm1.
The hard part may be to find filters 'like' these. We have an existing filter for the MW cases

in this MWvlex file. Namely, search for <vlex>Nom.</vlex>. This should provide a good starting point for finding most nominal verbs in other dictionaries.

funderburkjim commented 7 years ago

One reason for apparently duplicate cases.

For instance SuBa -> SuB. The spelling 'SuBa' is seen on more than one line of the entry.

In most versions of the program that generates cases, this situation causes a different case for each line containing the string of letters ('SuBa'). In a few versions of the program, where we're focused on headwords, only the first line (the one with the headword) generates a case.

funderburkjim commented 7 years ago

Corrections re batches 1-8 now installed.

funderburkjim commented 7 years ago

Batch 9. likely AP errors

status: DONE 02/02/2017 @funderburkjim

26 cases

slp1 UI and Deva UI

~~In these cases, the AP90 spelling ends in 'a', and the AP spelling is the same, but with an ending anusvara. Examination of a few cases leads to the suspicion that the AP anusvara is in error, as the entry is an adjective.~~

As Dhaval points out below, the description of this batch is wrong.

It is the AP90 spelling which has the ending 'M': AP90 = AP+'M' I still suspect that mostly AP is wrong, and that the reason will typically be that AP entry is NOT an adjective (the text is not marked as a.).

funderburkjim commented 7 years ago

Batches 10,11 likely AP errors

status: Batch 10 done, 02/02/2017 @SergeA and @funderburkjim Batch 11 done, 02/02/2017 @funderburkjim 26 cases

batch 10: 31 cases: slp1 UI and Deva UI

batch 11: 23 cases: slp1 UI and Deva UI

These are some randomly chosen cases that I suspect are AP spelling errors. They include the [thankfully small] number of additional 'k/kr' errors in AP headwords, a few 'J/jY' errors, and various others that caught my eye.

There are several 'duplicate' cases -- I thought these were removed, but apparently not :( . Luckily, its easy in UI to call the duplicates no change and move to next case.

However, a few apparent 'duplicate' cases occur because there are two entries in AP with the same headword spelling in our digitization.. It's possible that one of these spellings is right and the other one wrong.

drdhaval2785 commented 7 years ago

Batch 9. likely AP errors

In these cases, the AP90 spelling ends in 'a', and the AP spelling is the same, but with an ending anusvara. Examination of a few cases leads to the suspicion that the AP anusvara is in error, as the entry is an adjective.

Examination reveals that the logic was not properly translated into code. The output has AP90 having M at end and AP not having M at end. This seems to be default behaviour.

You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/CORRECTIONS/issues/335#issuecomment-276783020, or mute the thread https://github.com/notifications/unsubscribe-auth/AFfQ_KrcnfJtG24Pe1U6hEG1osLdxTk8ks5rYPSAgaJpZM4LuSPm .

On 2 Feb 2017 02:39, "funderburkjim" notifications@github.com wrote:

Batch 9. likely AP errors

status: TODO

26 cases

slp1 UI http://www.sanskrit-lexicon.uni-koeln.de/scans/APScan/2014/pywork/correctionwork/issue-335b/205/update.php and Deva UI http://www.sanskrit-lexicon.uni-koeln.de/scans/APScan/2014/pywork/correctionwork/issue-335b/205/update.php?input=deva

In these cases, the AP90 spelling ends in 'a', and the AP spelling is the same, but with an ending anusvara. Examination of a few cases leads to the suspicion that the AP anusvara is in error, as the entry is an adjective.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/CORRECTIONS/issues/335#issuecomment-276783020, or mute the thread https://github.com/notifications/unsubscribe-auth/AFfQ_KrcnfJtG24Pe1U6hEG1osLdxTk8ks5rYPSAgaJpZM4LuSPm .

funderburkjim commented 7 years ago

logic was not properly translated into code.

Right, there is discrepancy between code and description.

Note revision to description in comment above.

funderburkjim commented 7 years ago

@drdhaval2785 Did you notice my reply above to your 'work on tools' request?

Thought you might not have seen it since no reply from you. There are other possibilities besides the hwnorm1 idea.

drdhaval2785 commented 7 years ago

Yes I did. I didnt know how much I will be able to contribute. So was thinking how to draft an answer. For me, tools mean non-HTML stuff. I would not be able to contribute towards UI of any sort.

funderburkjim commented 7 years ago

@drdhaval2785

the hwnorm1 idea

There are some aspects of this which definitely would involve UI, but some parts would not.
The parts that would not involve UI might include:

review the logic of the current hwnorm1c construction (i.e., the rules for normalizing spelling). I made these up some time, but it would be good to have someone else review the ideas.
Improve the normalization by taking into account some dictionary-specific headword spelling conventions. For instance, SKD almost always uses nom. sg. form for nouns, so 'pitA' for 'pitf', for instance. Clearly SKD's pitA should be considered the same as pitf in other dictionaries. But how to do this to avoid false positives? I'm not sure.
There's also the question of how to properly associate roots in the different dictionaries. For instance, our digitization of WIL has 'gama' for the root, but there is also a m. noun 'gama' in WIL. We should associate the WIL verb entry 'gama' with the usual 'gam' of other dictionaries, but associate the m. noun 'gama' of WIL with the usual 'gama' of other dictionaries. How to do this?
There are many other such relations among headword spellings of specific dictionaries. Not all the relations have to be resolved at once.
The UI part has interesting aspects, such as how to present a multi-dictionary display. Or how to have a suite of multi-dictionary displays? But all these questions will only be as good as the underlying headword correspondences present in hwnorm1c.txt (or some later enhanced version)

pwg literary sources

We have made the requisite data files for supporting links to the names of literary sources for PW.

I think we have the requisite information needed to do the similar for PWG. Once this infrastructure is available, then I can do the relatively easy part of enhancing the displays of PWG to add links.

drdhaval2785 commented 7 years ago

On 2 Feb 2017 9:15 a.m., "funderburkjim" notifications@github.com wrote:

@drdhaval2785 https://github.com/drdhaval2785 the hwnorm1 idea

There are some aspects of this which definitely would involve UI, but some parts would not. The parts that would not involve UI might include:

review the logic of the current hwnorm1c construction (i.e., the rules for normalizing spelling). I made these up some time, but it would be good to have someone else review the ideas.
Improve the normalization by taking into account some dictionary-specific headword spelling conventions. For instance, SKD almost always uses nom. sg. form for nouns, so 'pitA' for 'pitf', for instance. Clearly SKD's pitA should be considered the same as pitf in other dictionaries. But how to do this to avoid false positives? I'm not sure.
There's also the question of how to properly associate roots in the different dictionaries. For instance, our digitization of WIL has 'gama' for the root, but there is also a m. noun 'gama' in WIL. We should associate the WIL verb entry 'gama' with the usual 'gam' of other dictionaries, but associate the m. noun 'gama' of WIL with the usual 'gama' of other dictionaries. How to do this?

pwg literary sources

We have made the requisite data files for supporting links to the names of literary sources for PW.

I think we have the requisite information needed to do the similar for PWG. Once this infrastructure is available, then I can do the relatively easy part of enhancing the displays of PWG to add links.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/CORRECTIONS/issues/335#issuecomment-276860645, or mute the thread https://github.com/notifications/unsubscribe-auth/AFfQ_H1cJxJTTKusyZOknYZbJgcTKhdIks5rYVFwgaJpZM4LuSPm .

On 2 Feb 2017 9:15 a.m., "funderburkjim" notifications@github.com wrote:

@drdhaval2785 https://github.com/drdhaval2785 the hwnorm1 idea

There are some aspects of this which definitely would involve UI, but some parts would not. The parts that would not involve UI might include:

review the logic of the current hwnorm1c construction (i.e., the rules for normalizing spelling). I made these up some time, but it would be good to have someone else review the ideas.

Will do so.

Improve the normalization by taking into account some dictionary-specific headword spelling conventions. For instance, SKD almost always uses nom. sg. form for nouns, so 'pitA' for 'pitf', for instance. Clearly SKD's pitA should be considered the same as pitf in other dictionaries. But how to do this to avoid false positives? I'm not sure.

https://github.com/sanskrit-lexicon/hwnorm1/blob/master/normalization.pdf may help in this regards. It covers all 33 dictionaries on points mentioned. Jim, what I would like to hear from your side is - what other places do dictionaries differ in conventions. You note the places, I will do comprehensive research. I intend to make version 2 of this paper comprehensive. This will take care of dictionary specific tweaks.

There's also the question of how to properly associate roots in the different dictionaries. For instance, our digitization of WIL has 'gama' for the root, but there is also a m. noun 'gama' in WIL. We should associate the WIL verb entry 'gama' with the usual 'gam' of other dictionaries, but associate the m. noun 'gama' of WIL with the usual 'gama' of other dictionaries. How to do this?

The only way I see to do so would be to separate the entries in different dictionaries on basis of meaning and not on headwords. Then only lexical and semantic similarities may be tagged properly.

pwg literary sources

We have made the requisite data files for supporting links to the names of literary sources for PW.

I think we have the requisite information needed to do the similar for PWG. Once this infrastructure is available, then I can do the relatively easy part of enhancing the displays of PWG to add links.

This seems interesting. So we go through the whole process which we did for PW or do some smart work? I guess many would be common in PW and PWG. Let us make final PW literary source list as our starting list of literary sources of PWG. Whatever is missing or new can be altered accordingly.

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/CORRECTIONS/issues/335#issuecomment-276860645, or mute the thread https://github.com/notifications/unsubscribe-auth/AFfQ_H1cJxJTTKusyZOknYZbJgcTKhdIks5rYVFwgaJpZM4LuSPm .

funderburkjim commented 7 years ago

normalization of dictionary structure

This is another idea prompted primarily by @drdhaval2785 's inquiry.

@fxru did the first step in gathering current information about the DTDs of the various dictionaries. But I haven't had time to build on this work. However, I think an important deficiency of the current state of the Cologne digitizations is lack of uniformity in certain details:

of coding of words. I'm thinking primarily about the AS (letter-number) v. IAST Unicode representation of the various IASTs of the dictionaries. Some recent work has addressed this question for a few of the dictionaries, but similar work needs to be done for all the dictionaries. The task for a particular dictionary is straightforward: construct a transcoding xml file and apply it. But this has to be done carefully, with attention both to the dictionary's actual IAST conventions and to current IAST standards. This also should be applied to MW.
of obsolete markup. There are some details of markup which seem obsolete (e.g. {|...|} to indicate 'wide spacing'. This occurs in several dictionaries and seems to serve no useful purpose. Obsolete markup needs to be identified and removed. (There is also obsolete markup in MW, e.g. the <c> element, and perhaps some others.
of inconsistent markup - notably the markup identified text in other languages. Probably there should be one standard markup (e.g., a <lang n="Greek">) which should be used to replace all the various ad-hoc markup. This also applies to MW.
of inconsistent representation of entries in the digitization. Currently, each dictionary identifies the headwords in its own idiosyncratic way. While the xml form of the dictionaries provides uniformity, I think we need to do the same thing at an earlier stage of the dictionary process, so that the digitizations themselves (not just the xml derivate of the digitization), has a uniform structure.

This work of providing uniformity among the dictionaries will be of immense value to further work on the dictionaries, both by us and by others in the future. It will allow, even more than now, the development of tools to parse the dictionaries for various purposes of analysis, and will provide a foundation for the enhancement of the dictionaries by adding markup.

funderburkjim commented 7 years ago

@drdhaval2785 I'll start by reviewing the status of the PWG literary source data. Will post in a separate issue under PWG.

gasyoun commented 7 years ago

suspicion that the AP anusvara is in error, as the entry is an adjective.

Well done.

SKD's pitA should be considered the same as pitf in other dictionaries.

Yeah, and I guess it has been left undocumented by Dhaval, or am I wrong?

associate roots in the different dictionaries

I would go listwise. Firs we extract all the known lists of roots from each dictionary. I have some concordances, let's sit down together and see how to automate. Too many manual operations are involved with dhatus.

I think we have the requisite information needed to do the similar for PWG. Once this infrastructure is available, then I can do the relatively easy part of enhancing the displays of PWG to add links.

Yeah, the reasearch is almost over (for now) and could be implemented as it is.

I intend to make version 2 of this paper comprehensive. This will take care of dictionary specific tweaks.

Now that I call a good morning to start my day with.

But I haven't had time to build on this work.

Did he abondon it or finished?

I think we need to do the same thing at an earlier stage of the dictionary process, so that the digitizations themselves (not just the xml derivate of the digitization), has a uniform structure.

Should we realy care much about it? Sure it would be good, but is it of priority and practical value?

funderburkjim commented 7 years ago

Batches 9-11 are ready for installation.

@SergeA Thanks!

funderburkjim commented 7 years ago

The corrections of batches 9-11 have now been installed.

After rerunning the program that merges AP and AP90 headwords, there are now about 700 that are marked with question marks. A quick examination suggests that there are probably still quite a few misspellings. Since this issue is getting rather extended, I'm closing it and opening another to handle further cases.

The revised ap90_ap_hw2_short.txt has been revised.

I'll develop UIs for more corrections next week.

@SergeA There are about 230 of the [?] cases that match 2 words, one from AP and one from AP90. These cases seem like the most fertile ground for examination.

I was thinking about jdoing all these in 10 batches or so, with corresponding batches and cases, one for AP and one for AP90. This is because there is no obvious way to guess whether the error for a given case is in the AP spelling or in the AP90 spelling. Then, procedure would be to open up and work at the same time on two batches: batch1-AP and batch1-AP90, batch2-AP and batch2-AP90 , etc.

Does this sound like a reasonably efficient approach ?

funderburkjim commented 7 years ago

@SergeA Here is a crude computation relating to the 'k/kr' issue .

'kr' occurs in 2801 lines (out of 267898 lines) in AP.txt (1.05%)

'kr' occurs in 2167 lines (out of 199968 lines) in AP90.txt (1.08%) in AP90.

The lower percentage might be taken as (crude) evidence that there are 0.03% k/kr errors remaining in AP.txt, or about 80 lines. If this computation is not completely bogus, that's a fairly small number, so the worst of this problem with AP.txt is behind us.

gasyoun commented 7 years ago

0.03% k/kr errors remaining in AP.txt, or about 80 lines

Adorable stats. Yeah, the approach is what we can only dream of.

SergeA commented 7 years ago

Then, procedure would be to open up and work at the same time on two batches: batch1-AP and batch1-AP90, batch2-AP and batch2-AP90 , etc. Does this sound like a reasonably efficient approach ?

Maybe yes. But I see here one problem - both batches will use the same scan tab. And we need to keep open both scans simultaneously. Is it possible to separate them as scan_tab_1 and scan_tab_2?

there are 0.03% k/kr errors remaining in AP.txt, or about 80 lines

Is it right to count lines and not occurrences? Looks too good to be true. I was afraid there were thousands of them.

sanskrit-lexicon / CORRECTIONS