sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

Not found mw.txt in downloads #156

Closed drdhaval2785 closed 8 years ago

drdhaval2785 commented 8 years ago

http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2014/downloads/mwtxt.zip doesn't have mw.txt. All other dicts have it.

Rather no download has mw.txt. Shouldn't it be surviving somewhere?

gasyoun commented 8 years ago

MW is different than all others and should not have one.

funderburkjim commented 8 years ago

Yes, Marcis has it right.

Note that there IS a 'mwtxt.zip' download in the MW downloads page.

The mw_orig.txt file therein was originally called MONIER.ALL, and is the version of Thomas's digitization that we started with in 2006 or thereabouts.

Since that mw_orig.txt is in the now hard-to-use encoding CP1252, I created a version in the now popular utf-8 encoding, and called that mw_orig_utf8.txt.

This original version is presented just for its possible historical interest.

The current situation for MW is that the xml version is not derived from an 'mw.txt'. Rather, the xml version is THE base version for MW.

One other no-doubt confusing detail is that there are TWO xml versions. Currently, the BASE version is monier.xml, which is a copy of a MYSQL table at Cologne.

The other version (mw.xml) differs only in one detail, namely in the placement of the SLP1 accent diacrtics. In monier.xml these diacritics (There are 3, /=udatta, \=anudatta, ^=svarita) are placed BEFORE the vowel being accented, and in mw.xml they are placed AFTER that vowel.

A simple program (referenced in redo_xml.sh) constructs mw.xml from monier.xml.

The current display programs (those in MWScan/2014/) assume the accent placement form of mw.xml.

Some of the older display programs assume the accent placement form of monier.xml.

Theoretically, there is no need to maintain the MYSQL monier table (and monier.xml), but I have decided to keep it. There is a useful 'online' update of the MYSQL table (http://www.sanskrit-lexicon.uni-koeln.de/mwupdate/index.html) that I often use to make random corrections to the MYSQL file. But, this is now non-essential, and the update process could take a form similar to that used for pwg, pw, and the other dictionaries.

drdhaval2785 commented 8 years ago

I got the point. My limited issue is that this is the only dictionary where I can not autogenerate the corrections t be made, like other dictionaries. If you can tell me how do you create upd files for correction submissions for MW, i would be obliged.

Request 2 - Can you please provide AP and PD downloads via mail for correction research purpose, because they are restricted. Therefore currently I am not able to generate upd files for them.

Request 3 -

ccs.txt also seems to be a bit different from other .txt files. In the sense that the digitization spans multiple files. Therefore, my modification of your generate.py doesn't work on ccs.txt files well. Any clarification / method suggested ?

gasyoun commented 8 years ago

So mw.xml should be used and others forgot. AP and PD provided, Dhaval, once again.

funderburkjim commented 8 years ago

@drdhaval2785 Re AP, PD -

You should be able to modify, in an obvious way, the pw_init.sh script of #143, for these dictionaries. If this isn't what you're looking for, please request further.

funderburkjim commented 8 years ago

Regarding ccs: Looking at ccs.txt, I see that the main peculiarity is that it essentially has one word per line of the file.

For the purpose of the version of generate.py that takes <path-to-ccs.txt> and <path-to-ccshw2.txt> as the first two command-line parameters, that version should work fine to make routine changes to headword spelling, in the same way as for typical other dictionaries.

Maybe I'm missing the point of your question - explain further, if so.

funderburkjim commented 8 years ago

Regarding generate.py for mw, here is a dropbox link of a variation of generate.py that I used as part of the processing of the recent MW corrections for #131. The readme.txt file is a quite detailed account of the steps taken. Whenever I regenerate the s3 copy of pywork for mw, this will be included; it was created after the pywork.zip was most recently constructed.

Here, generate.py works by using monier.xml in place of (mw.txt, and mwhw2.txt), since these two files don't exist for mw.

The creation of mwlog.txt is a different step, which is required to interface with the MYSQL version of monier.xml at Cologne. From your point of view, you should probably just ignore this. In terms of functionality, these could be replaced by writing a variant of the standard 'updateByLine.py' that the other dictionaries employ. This variant could read (a) a copy of old monier.xml, and (b) mwupd.txt created by generate.py, and generate output new monier.xml. Then the steps would be (a) redo_xml.sh, (b) ../mwaux/mwkeys/redo.sh.

funderburkjim commented 8 years ago

I assume this issue is now solved, so will close.

gasyoun commented 8 years ago

Why is the dropbox link not synced to github, did not get. @drdhaval2785 "Then the steps would be (a) redo_xml.sh, (b) ../mwaux/mwkeys/redo.sh." - can you repeat the steps?