Request of SLP1 text of all dictionaries

drdhaval2785 commented 10 years ago

Jim, If we can have the SLP1 text of all dictionaries on their respective repositories, we would be in better position to play around with them and pick out the errors. e.g. I could access MW and PWG in SLP1. This was we could point 91 possible errors in issue #2 . If similarly, list of other dicts are also provided, we may get still more errors by comparing patterns.

gasyoun commented 10 years ago

If you mean a headword only list, I can make any, see https://github.com/gasyoun/SanskritLexicography/tree/master/HeadwordLists samples.

drdhaval2785 commented 10 years ago

Headword only list should suffice for cleaning up headwords first. The step to cleaning the entries is a bit more tedious. Right now - yes - only Headword lists of all dictionaries from Cologne site.

funderburkjim commented 10 years ago

An slp1 form of the headwords (for all dictionaries EXCEPT MW) is in file Xhw2.txt, as part of the Xxml download for the dictionary. For instance, for PW, go to download page for PW, and download pwxml.zip. In that download are several files, including:

pwhw2.txt the headwords in slp1 (a colon-separated file - headword in middle, as I recall)

pw.xml This has the Sanskrit Devanagari words in <s>X</s> elements, with X being in slp1.

Same for other dictionaries.

Is this enough to work from?

drdhaval2785 commented 10 years ago

@funderburkjim - I would not waste time in doing what @gasyoun is good at. @gasyoun - Please provide the list of slp1 of all possible dictionaries at https://github.com/gasyoun/SanskritLexicography/tree/master/HeadwordLists

gasyoun commented 10 years ago

@drdhaval2785 "list of slp1" - sure, let me master the batch part with my VBEE scripts, just in two weeks I'll be there. Doing manually one by one is no fun part.

drdhaval2785 commented 9 years ago

@gasyoun Now your PhD is over - I guess these two weeks time is over.

drdhaval2785 commented 9 years ago

@gasyoun Is this on todo list ? Otherwise we close the issue

gasyoun commented 9 years ago

@drdhaval2785 reread all the comments and still can't get - you want full text of dictionaries in SLP1 instead of just headwords? My converter's can do harm - because the tags are different and could get lost.

funderburkjim commented 9 years ago

Ever since this issue was raised in Oct 2014, I have made it an objective to convert the base form of each digitization from HK to SLP1. By the base form of a digitization for dictionary X I mean X.txt. (e.g., pwg.txt, md.txt, etc.) This task has been done for 21 of 36 of the dictionaries (see below)

Let me explaiin a little more, using PWG as an example.

The original digitization of PWG exists as a text file named pwg_orig.txt. This is the file as obtained from Thomas. It has at least two features which make it hard to work with:
- It uses an old encoding for 8-bit ascii characters, ~~call~~ called CP1252 (code-page 1252).
- Devanagari is coded using the Harvard-Kyoto transliteration
pwg_orig_utf8.txt converts the cp1252 encoding of extended ascii to the current standard utf8 encoding. This is a comparitively straightforward conversion, and there is an inverse conversion in order to validate that no information loss occurs.
pwg_orig_utf8_slp1.txt Here the coding of Devanagari is converted from HK to SLP1. This is actually what I call the 'base form' of the digitization. All the corrections we make are 'installed' starting with this version,
pwg.txt is the current corrected form of the dictionary. It is also in the utf8 encoding, and Devanagari is coded as SLP1.

The construction of the slp1 base version (pwg_orig_utf8_slp1.txt) is surprisingly tricky. The reason is that there are various minor oddities in the HK coding. One especially tricky part is the use of the period punctuation mark in coding of text which is Devanagari. The period in 'standard' HK and SLP1 is used to represent the daRqa. However, this period also commonly occurs in the bilingual dictionaries as English (or German, etc.) punctuation. In some digitizations, Thomas has used a vertical bar in Devanagari to represent the daRqa, and a period to represent non-Sanskrit punctuation. But usually there are inconsistencies in the use of the period in text marked as Devanagari, and this question has to be addressed. It is a challenge and makes the construction of the slp1 form non-trivial, tedious and non-enjoyable. That's my excuse for why some of the dictionaries with HK-coded Devanagari have not been converted to SLP1 yet.

Here's a list where the Devanagari IS converted to SLP1 in the base form and there is an x_orig_utf8_slp1.txt form of the dictionary:

ACC,AP90,AP,BEN,BOR,BUR,CAE,CCS,GST,MCI,MD,MW72,PWG,PW,SCH,SHS,SKD,WIL,YAT

Here's the list where there is no x_orig_utf8_slp1.txt form:

MW,,VCP  have slp1 coding. See below
BHS,GRA,SNP,STC,VEI  have no Devanagari. All Sanskrit is in AS (Anglicized Sanskrit)

These 4 dictionaries have a small amount of HK coded Devanagari, and are a secondary TODO list.
IEG,INM Devanagari only in preface, PE (26 instances) , PUI (3 instances)

These dictionaries have substantial HK coded Devanagari.  They form the main TODO list.
AE,BOP,KRM,MWE,PD,PGN

MW Devanagari is already in SLP1 form. There are files mw_orig.txt and mw_orig_utf8.txt. mw_orig.txt is the form of MW1899 that Thomas provided way back in 2006 when Peter and I first became involved in the Cologne Sanskrit-Lexicon project that Thomas began in the 1990s. The current reference form for MW is mw.xml.
VCP The base form has a different file name: vcp0.txt, which is utf8 and has text in SLP1.

Incidentally, the x.xml files for all these dictionaries have Devanagari coded as SLP1.

gasyoun commented 9 years ago

Let me repeat. I'm afraid to ask questions to Jim. Because when he starts to answer it's a new entry in a to-be published Encyclopædia. If not chapter. I hope you understand now Dhaval why I can't do all the tricks and could only add more mess. Jim is scientific from alpha to omega. 1) 21 of 36 since Oct 2014 means we might hope for full SLP1isation by early 2015. Half done in six months, as a background task. This is crucial at least in the part that is connected with the headwords, although none of the deeper issues comes out at this level. 2) call CP1252 -> called CP1252 danqa -> danda 3) inverse conversion in order to validate that no information loss occurs - what script does it? Did not get pwg.txt how it's different from pwg_orig_utf8_slp1.txt. I was thinking it is based on the .xml file. 4) Thomas has used a vertical bar in Devanagari to represent the danqa, and a period to represent non-Sanskrit punctuation should we keep this practice in the future? What would lessen or pain? 5) Hope that the list of where there is no x_orig_utf8_slp1.txt will pass in the order described, so get MW, VCP, AE and PD in a year or so. 6) base form has a different file name: vcp0.txt should we unify, before it's too late?

funderburkjim commented 9 years ago

re: what script does inverse conversion? The script cp1252_to_utf8.py converts from cp1252 to utf8. The script utf8_to_cp1252.py does the inverse conversion, from utf8 to cp1252.

These scripts are part of the xml downloads for each dictionary.

funderburkjim commented 9 years ago

re 'Did not get pwg.txt how it's different from pwg_orig_utf8_slp1.txt.'

pwg.txt contains corrections; when you or anyone submits a correction for pwg, this correction gets installed into pwg.txt; however, pwg_orig_utf8_slp1.txt does not get these corrections.

You could think of pwg.txt as the 'latest version' of pwg_orig_utf8_slp1.txt.

pwg.xml is created from pwg.txt (by script make_xml.py).

funderburkjim commented 9 years ago

Re Thomas use (within coding of Devanagari) of vertical bar for danda, and period for English punctuation.

This convention is true in many dictionaries, but not all, as I recall.

We shouldn't keep this in coding of Devanagari. Since we have decided to use SLP1 as the coding system for Devanagari, we should follow the SLP1 conventions. In SLP1, the period represents danda.

But this then leave open the question of how to represent, in SLP1, a 'true' period? The answer I've used is to take the true periods out of SLP1 - the true period is not Devanagari, so should not be included as part of a section of text identified as coding Devanagari.

For instance, suppose we see in a dictionary the English sentence: The word for dog is श्वन्. Thomas would typically code this as The word for dog is {#zvan.#} (note period inside {##}). If, in conversion to SLP1 this was coded as The word for dog is {#Svan.#} and if this were then converted back to Devanagari, we would see The word for dog is श्वन्।, which disagrees with the original sentence because the period of the original sentence has been treated as a danda. The solution is to have the SLP1 conversion of the sentence to be The word for dog is {#Svan#}. (i.e., to move the period outside of the scope of the {##} Devanagari delimiters.

This is the approach taken in converting from Thomas' HK coding (e.g. pwg_orig_utf8.txt) to an SLP1 coding (pwg_orig_utf8_slp1.txt). This task is accomplished by a script called 'transcode.py' which is in the convertwork directory of the xml download for pwg.

funderburkjim commented 9 years ago

Re: vcp0 - Yes, I probably should change this file name for the sake of uniformity.

Re MW: MW(1899) is the odd man out. The base form is mw.xml. There is not likely to be a mw_orig_utf8_slp1.txt. Devanagari is coded as SLP1 in mw.xml.

gasyoun commented 9 years ago

true periods out of SLP1 - the true period is not Devanagari, so should not be included as part of a section of text identified as coding Devanagari. oh so it's where the fun starts. But I understand the concerns and agree. Some RegEx magic in your python scripts will bring Thomas idea to a standard that will be usable in both directions.

funderburkjim commented 9 years ago

Changed name of vcp0.txt to vcp_orig_utf8_slp1.txt, so the name of this base form is consistent with others.
Constructed SLP1 base form for mwe (mwe_orig_utf8_slp1.txt).

funderburkjim commented 9 years ago

Constructed SLP1 base for for 'ae'

gasyoun commented 9 years ago

AE different because of the non-pratipadika forms or what?

funderburkjim commented 9 years ago

@gasyoun The conversion of Devanagari coding in AE from HK to SLP1 only pertains to entries, since the headwords are English. For all of the dictionaries, the conversion to SLP1 applies not just to the headwords, but to all of the Devanagari coded originally as HK. So the non-pratipadika forms (as in AP) are not an issue. The reason it is complicated usually has to do with rather 'trivial' issues, like use of non-standard HK (such as n~ instead of the usual HK J for palatal nasal), and the much trickier issue of 'English' periods in Devanagari.

funderburkjim commented 9 years ago

The Devanagari in the base form of PD has now been converted to SLP1. Only three more dictionaries have significant conversions to SLP1 : BOP,KRM,PGN. I'll aim to do those soon.

gasyoun commented 9 years ago

n~ instead of the usual HK J and 'English' periods in Devanagari are not trivial at all. Because you never know ahead what's before you. So actually what you do is nut just conversion, it's cleanup and better markup.

funderburkjim commented 9 years ago

BOP now converted to SLP1. Similar issues with n~ and danda/period resolved.

gasyoun commented 9 years ago

KRM, PGN left, hurray!

funderburkjim commented 9 years ago

KRM now converted to SLP1. Similar issues with n~ and danda/period resolved.

Considerable work would be required to improve the markup of KRM, so that its displays may more closely correspond to the printed page. Here are some issues. (The headwords are roots in DAtupAWa form, so for instance 'gamx'):

In the scan, the footnotes are mentioned as a superscript in the body of an entry and the text of the footnotes appear at the bottom of a page. In the digitization, the footnote text appears within the body of an entry at its place of mention. This is one factor that obscures a comparison between the display and the scans.
In the scans, the body of the entry often has a tabular form. But the current markup does not permit a reconstruction of this tabular form in a display.

Such a task requires input of a Sanskrit Scholar, who understands the nature of the information in this text.

gasyoun commented 9 years ago

Not a Sanskrit scholar, but someone who understands layout coding. It'll have to be delayed for better times, which will take years to reach us, I guess. KRM markup is of 25th priority, I would propose.

funderburkjim commented 9 years ago

PGN now converted to SLP1. Similar issues with n~ and danda/period resolved.

In PGN, Devanagari text only occurs in material that is not, currently, part of the pgn.xml (and thus not part of the displays). This material is (probably) present in the Chapter Footnotes of PGN.

It is something of a 'force' to represent the digitization of PGN as a dictionary like the 'real' dictionaries MW, PWG, etc. This observation likely applies to several of the other so-called 'specialized' dictionaries of the Cologne Sanskrit-Lexicon.

This complete the primary SLP1-ization of the dictionaries (the primary TODO list mentioned in the comment of March 21.

There only remains the secondary 'TODO' list in this SLP1-fest.

gasyoun commented 9 years ago

There is a Chapter Footnotes file of PGN or it's the non-OCRed part? So IEG,INM,PE, PUI left. Is there something I can help with?

drdhaval2785 commented 9 years ago

I would term this job super quick Jim. Pity that i couldnt be actively involved. Actually i put the house on fire and then ran away. Satisfying indeed the way Jim responds. Once SLP1 for all dicts are available we would have more candidates for comparision of faultfinder.

funderburkjim commented 9 years ago

Conversion to SLP1 completed for the secondary TODO list: IEG, INM, PE, PUI. This completes the conversion to SLP1 for all 36 dictionaries. To recap, dictionaries STC,GRA,SNO,BHS,VEI have no 'x_orig_utf8_slp1.txt' form since they have no Devanagari. There is also no mw_orig_utff_slp1.txt, since the base form for MW (1899) is mw.xml. For each of the other 30 dictionaries, there is an x_orig_utf8_slp1.txt base form.

funderburkjim commented 9 years ago

regarding Once SLP1 for all dicts are available we would have more candidates for comparision of faultfinder:

Actually, the headwords for all the dictionaries have ALWAYS been in SLP1 form (except for the three English-Sanskrit dictionaries, of course). Recall that, if X is one of these dictionaries, then Xhw2.txt consists of the headwords in SLP1. This was true even before this conversion work. The conversion work dealt with the Devanagari text in X.txt, as Devanagari in X.txt was, before SLP1 conversion, still represented in the HK form that Thomas' original digitizations provided.

Admittedly this was confusing. At least this one confusion is now removed in the digitizations.

At any rate, I definitely agree with the sentiment that we should finish the headword checking process via faultfinder for those dictionaries whose headword-differences generated by faultfinder have not yet been examined. These dictionaries are listed in the 'faultfinder TODO(1)' section of issue 90. The dictionary in this list with the largest set of faultfinder candidates is PD. Finishing this task that Dhaval began will be an important milestone in our correction process.

drdhaval2785 commented 9 years ago

Thanks for clarifying the matter. I was under the wrong impression.

gasyoun commented 9 years ago

You'll have a chance to get back, Dhaval. There are still some tiny issues left.

drdhaval2785 commented 8 years ago

@funderburkjim It seems that the sanhw1.txt file has not been updated in last three months. I see a lot of changes pouring in and changes installed. Time to give a new sanhw1.txt file to the world.

funderburkjim commented 8 years ago

sanhw1.txt revised, as mentioned elsewhere. Currently, it is awkward to revise sanhw1.txt on Github (run a script at Cologne, download to local Github CORRECTIONS repository, sync to Github.)

That's my excuse for irregular revisions.

gasyoun commented 8 years ago

@drdhaval2785 there was an update a week ago. Not sure what you meant.

drdhaval2785 commented 8 years ago

Great. Now we have sanhw1.txt and sanhw2.txt mostly updated. Let's close the issue.

sanskrit-lexicon / CORRECTIONS

Request of SLP1 text of all dictionaries #7