sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

Trial 'o vs O' composite file corrections, 1-10 #157

Closed drdhaval2785 closed 8 years ago

drdhaval2785 commented 8 years ago

I have taken entries 201 to 210 from http://drdhaval2785.github.io/o_vs_O/output3/composite2a.html for test purpose.

Input

UPDATED in response to https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issuecomment-156867548 and subsequent alteration in https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issue-116719290.

After examination my change.txt file is

PUI:kizkinDaguhA:kizkinDAguhA:n:Different place, on Kailash.
SCH:kadAcitkatva:kAdAcitkatva:n:a
SCH:kAlikulAmfta:kAlIkulAmfta:t:ACC is more authentic when it comes to names of works. There are other works around which also start with kAlI. kAlikula is not defensible.
CCS:jalOkOvaseka:jalOkovaseka:t:
ACC:jAtisaMkarya:jAtisAMkarya:t:
PWG:ekAkikeSarin:ekAkikesarin:n:a
MW:dakziRApraYc:dakziRAprAYc:t:There is a carret above.
MW:bastABivASin:bastABivAsin:n:w
MW:antaScaRqAla:antaScARqAla:n:MW has repeatedly taken caRqAla as valid form.
MW:amBaHsyAmAka:amBaHSyAmAka:t:There definitely is a mark above 's'.

Program

The program is at https://gist.github.com/drdhaval2785/13e1211f6f333bb2cb31. Run prepareupd.sh

Expected output

  1. They are stored in cologne/pw/pywork/correctionwork/correction-issue-155/upd folder.
  2. For each dictionary we generate three files. (1) DICTCODEupd.txt for copy pasting to manualByLine02.txt. (2) DICTCODEupd.tsv, a tab separated file having the same fields. (3) DICTCODEnochange.txt - file storing no change cases.
  3. There are three composite files also (1) allchangeupd.txt (2) allchangeupd.tsv and (3) allnochange.txt files which have entries of all dictionaries.

    Presumptions for the code

  4. All dictionaries are placed in cologne directory. e.g. cologne/pw
  5. Every dictionary (e.g. cologne/pw) has the following subdirectories (1) DICTCODEtxt (pwtxt), (2) DICTCODEweb1 (pwweb1) and (3) DICTCODExml (pwxml).
  6. DICTCODEtxt folder has DICTCODE.txt file (e.g. cologne/pw/pwtxt has pw.txt).
  7. DICTCODExml/xml folder has DICTCODEhw2.txt file (e.g. cologne/pw/pwxml/xml has pwhw2.txt file)
  8. Current code i.e. generate.py, change.txt and prepareupd.sh are placed in cologne/pw/pywork/correctionwork/correction-issue-155 folder.
  9. All these directories are downloadable from cologne dictionary download page.
drdhaval2785 commented 8 years ago

Output is (UPDATED in response to https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issuecomment-156867548 and subsequent alteration in https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issue-116719290)

Change file

; ACC, Issue 155, Case 1, User Dhaval, 2015-11-16
; jAtisaMkarya -> jAtisAMkarya # typo # 
19137 old <HI>{#jAtisaMkarya#}¦ on mixed castes, by Çivala1la Sukula. Oudh III, 16.
19137 new <HI>{#jAtisAMkarya#}¦ on mixed castes, by Çivala1la Sukula. Oudh III, 16.
; SCH, Issue 155, Case 1, User Dhaval, 2015-11-16
; kAlikulAmfta -> kAlIkulAmfta # typo # ACC is more authentic when it comes to names of works. There are other works around which also start with kAlI. kAlikula is not defensible.
10540 old .{#kAlikulAmfta#}100{#Ka1likula1mr2ta#}¦ n. Titel eines Werkes , Opp. Cat. 1. [Schµ10688] €2
10540 new .{#kAlIkulAmfta#}100{#Ka1likula1mr2ta#}¦ n. Titel eines Werkes , Opp. Cat. 1. [Schµ10688] €2

Note that we don't have mw.txt. Therefore, there are no output generated for them.

No change cases

PUI:kizkinDaguhA:kizkinDAguhA:Different place, on Kailash.
PWG:ekAkikeSarin:ekAkikesarin:alternate words
SCH:kadAcitkatva:kAdAcitkatva:alternate words
drdhaval2785 commented 8 years ago

@funderburkjim please have a look whether the output are proper for you updation or not.

drdhaval2785 commented 8 years ago

This is for testing purpose. No need to install it right now. Once we finalize the format, installation may be done.

Some shorter version of the change file may be used to copy paste to github also for tracking of issues.

Only images would be needed to supplied later on.

gasyoun commented 8 years ago

Images on github?

funderburkjim commented 8 years ago

Applying the 3-code suggestion mentioned in #154 might alter the sample to:

PUI:kizkinDaguhA:kizkinDAguhA:n:Different place, on Kailash.
SCH:kadAcitkatva:kAdAcitkatva:n:fehlerhaft fur kAdAcitkatva.
SCH:kAlikulAmfta:kAlIkulAmfta:p:ACC is more authentic when it comes to names of works. There are other works around which also start with kAlI. kAlikula is not defensible.
CCS:jalOkOvaseka:jalOkovaseka:o:
ACC:jAtisaMkarya:jAtisAMkarya:o:
PWG:ekAkikeSarin:ekAkikesarin:n:keSarin and kesarin both are alternate words.
MW:dakziRApraYc:dakziRAprAYc:o:There is a carret above.
MW:bastABivASin:bastABivAsin:n:vAsin is wrong reading for vASin. See MW.
MW:antaScaRqAla:antaScARqAla:n:MW has repeatedly taken caRqAla as valid form.
MW:amBaHsyAmAka:amBaHSyAmAka:o:There definitely is a mark above 's'.

Other than that minor adjustment (to just use 'p','o', or 'n' for the 'category code' , which @drdhaval2785 terms 'errorcode' or 'nochangecode'), I think the format is fine.

I hadn't thought of the dictionary code, but I can see its utility to the reviewer when, as in your example, the review may involved corrections to various dictionaries.

In actual processing, I woud separate such a file according to dictionary, and then process the updates for each dictionary separately.

Including the 'n' code is useful, as then it would permit automation of updates to the global nochange.txt file (#153), so that file could be maintained as part of the update installation process.

A small note to the reviewer who goes to the trouble of creating this file, is that he should avoid using a colon character in the last 'note' field, since that colon character is used as a field separator.

A note to the update installer is that the constructed change file (the one with the old/new records) needs to be examined manually, since sometimes there will be additional changes that a program can't reasonably handle. The main case I'm thinking of is the dictionaries where there is a 'key2' field , and that key2 field has some kind of markup (such as hyphens or accents) which will foil the regex replacement logic that works for key1.

Regarding no 'mw.txt' (and no mwhw2.txt), I've recently written a version of generate.py that helps the automation for mw also. It generates the xupd.txt and xupd.tsv based on monier.xml.

funderburkjim commented 8 years ago

Once we reach consensus on exactly which category codes to use, I think this file format would be good; and I'll make a version of generate.py that assumes an input file in this format.

funderburkjim commented 8 years ago

However, I don't think we should make it mandatory for a reviewer (corrector) to be required to make such a file, if such construction of such a file seems burdensome to the reviewer.

I'm quite willing and glad to take the corrections in the looser form of issue comments that has been used heretofore.

drdhaval2785 commented 8 years ago
The main case I'm thinking of is the dictionaries where there is a 'key2' field , and that key2 field has some kind of markup (such as hyphens or accents) which will foil the
regex replacement logic that works for key1.

I guess by regex replacement logic you mean this line in generate.py new = re.sub(hw,hw1,old)

I was about to raise an issue that the program throws an error when tried with some escape characters. (This was in context of Abbreviations submission. '(' or '*' which occur in Abbrvs raised an error about regex.)

In my opinion new = old.replace(hw,hw1) would be a good candidate for character replacement, and regex would be far far off. It would achieve what we want without regex problems.

In terms of php equivalents str.replace() = str_replace and re.sub = preg_replace. I prefer str.replace for our usage.

drdhaval2785 commented 8 years ago

@funderburkjim

Once we reach consensus on exactly which category codes to use, I think this file format would be good; and I'll make a version of generate.py that assumes an input file in this format.

I have modified the proposed format to only 'p','t' and 'n' for print error, typo and no change respectively. All other codes are shifted to notes (can be used optionally). So, the file format is as you desire. I have tried to take part in your burden. Code at https://gist.github.com/drdhaval2785/13e1211f6f333bb2cb31 would generate the files you need. They will be dictionarywise and one master file for seeing all corrections and no change cases at one place. Detailed description in Expected Outcome heading of the issue. Please give it a try and let me know. If there are some changes needed, let me try. If I fail, you may carry on. This way I would get some feeling of updating the codes you work with.

gasyoun commented 8 years ago

category codes are ready for marble I think. As per "I'm quite willing and glad to take the corrections in the looser form of issue comments that has been used heretofore." - I think it's bad idea to do cleanup, if there are some tech-savvy people who can do the most them-self. I'm very happy to see Dhaval's pro-activeness in a field that is very important for me.

drdhaval2785 commented 8 years ago

As https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154 format has attained finality, this trial version ends.

gasyoun commented 8 years ago

Hurray!

funderburkjim commented 8 years ago

@drdhaval2785 Glad you are learning how things work. Apparently, the idea of making s3 copies of the pywork, web, and org directories was a good way to let you get 'hands on' experience with the system. (Agree?)

If so, perhaps I should institute some regular routine on revising those s3 copies. Any suggestion on frequency?

drdhaval2785 commented 8 years ago

Apparently, the idea of making s3 copies of the pywork, web, and org directories was a good way to let you get 'hands on' experience with the system. (Agree?)

Absolutely yes.

Any suggestion on frequency?

15th and 30th of every month

drdhaval2785 commented 8 years ago

@funderburkjim Any thoughts on https://github.com/sanskrit-lexicon/CORRECTIONS/issues/157#issuecomment-156905980?

gasyoun commented 8 years ago

Bi-weekly indeed makes sense. If the server goes down, there should be at least two of us who should have it all.

funderburkjim commented 8 years ago

@drdhaval2785 re errors with submission. '(' or '*'

If you want to change s1 = 'an (extra left paren' to s2 = 'an extra left paren' in Python:

s2 = re.sub(r'\(','',s1)   # use '\(' to mean a literal left-paren rather than usual regex meaning of '('
funderburkjim commented 8 years ago

@drdhaval2785 Regarding use of string replacement instead of regex replacement.

Probably would work exactly the same in our usage.

It would not solve the problem of how to autoadjust key2. Note the 4 examples given in readme.txt for benfey that were manual adjustments in benupd_edit.txt. No obvious way to do these adjustments by program. Easier to do manually. At least that's how it seems to me.