Closed drdhaval2785 closed 8 years ago
Output is (UPDATED in response to https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issuecomment-156867548 and subsequent alteration in https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issue-116719290)
Change file
; ACC, Issue 155, Case 1, User Dhaval, 2015-11-16
; jAtisaMkarya -> jAtisAMkarya # typo #
19137 old <HI>{#jAtisaMkarya#}¦ on mixed castes, by Çivala1la Sukula. Oudh III, 16.
19137 new <HI>{#jAtisAMkarya#}¦ on mixed castes, by Çivala1la Sukula. Oudh III, 16.
; SCH, Issue 155, Case 1, User Dhaval, 2015-11-16
; kAlikulAmfta -> kAlIkulAmfta # typo # ACC is more authentic when it comes to names of works. There are other works around which also start with kAlI. kAlikula is not defensible.
10540 old .{#kAlikulAmfta#}100{#Ka1likula1mr2ta#}¦ n. Titel eines Werkes , Opp. Cat. 1. [Schµ10688] €2
10540 new .{#kAlIkulAmfta#}100{#Ka1likula1mr2ta#}¦ n. Titel eines Werkes , Opp. Cat. 1. [Schµ10688] €2
Note that we don't have mw.txt. Therefore, there are no output generated for them.
No change cases
PUI:kizkinDaguhA:kizkinDAguhA:Different place, on Kailash.
PWG:ekAkikeSarin:ekAkikesarin:alternate words
SCH:kadAcitkatva:kAdAcitkatva:alternate words
@funderburkjim please have a look whether the output are proper for you updation or not.
This is for testing purpose. No need to install it right now. Once we finalize the format, installation may be done.
Some shorter version of the change file may be used to copy paste to github also for tracking of issues.
Only images would be needed to supplied later on.
Images on github?
Applying the 3-code suggestion mentioned in #154 might alter the sample to:
PUI:kizkinDaguhA:kizkinDAguhA:n:Different place, on Kailash.
SCH:kadAcitkatva:kAdAcitkatva:n:fehlerhaft fur kAdAcitkatva.
SCH:kAlikulAmfta:kAlIkulAmfta:p:ACC is more authentic when it comes to names of works. There are other works around which also start with kAlI. kAlikula is not defensible.
CCS:jalOkOvaseka:jalOkovaseka:o:
ACC:jAtisaMkarya:jAtisAMkarya:o:
PWG:ekAkikeSarin:ekAkikesarin:n:keSarin and kesarin both are alternate words.
MW:dakziRApraYc:dakziRAprAYc:o:There is a carret above.
MW:bastABivASin:bastABivAsin:n:vAsin is wrong reading for vASin. See MW.
MW:antaScaRqAla:antaScARqAla:n:MW has repeatedly taken caRqAla as valid form.
MW:amBaHsyAmAka:amBaHSyAmAka:o:There definitely is a mark above 's'.
Other than that minor adjustment (to just use 'p','o', or 'n' for the 'category code' , which @drdhaval2785 terms 'errorcode' or 'nochangecode'), I think the format is fine.
I hadn't thought of the dictionary code, but I can see its utility to the reviewer when, as in your example, the review may involved corrections to various dictionaries.
In actual processing, I woud separate such a file according to dictionary, and then process the updates for each dictionary separately.
Including the 'n' code is useful, as then it would permit automation of updates to the global nochange.txt file (#153), so that file could be maintained as part of the update installation process.
A small note to the reviewer who goes to the trouble of creating this file, is that he should avoid using a colon character in the last 'note' field, since that colon character is used as a field separator.
A note to the update installer is that the constructed change file (the one with the old/new records) needs to be examined manually, since sometimes there will be additional changes that a program can't reasonably handle. The main case I'm thinking of is the dictionaries where there is a 'key2' field , and that key2 field has some kind of markup (such as hyphens or accents) which will foil the regex replacement logic that works for key1.
Regarding no 'mw.txt' (and no mwhw2.txt), I've recently written a version of generate.py that helps the automation for mw also. It generates the xupd.txt and xupd.tsv based on monier.xml.
Once we reach consensus on exactly which category codes to use, I think this file format would be good; and I'll make a version of generate.py that assumes an input file in this format.
However, I don't think we should make it mandatory for a reviewer (corrector) to be required to make such a file, if such construction of such a file seems burdensome to the reviewer.
I'm quite willing and glad to take the corrections in the looser form of issue comments that has been used heretofore.
The main case I'm thinking of is the dictionaries where there is a 'key2' field , and that key2 field has some kind of markup (such as hyphens or accents) which will foil the
regex replacement logic that works for key1.
I guess by regex replacement logic you mean this line in generate.py
new = re.sub(hw,hw1,old)
I was about to raise an issue that the program throws an error when tried with some escape characters. (This was in context of Abbreviations submission. '(' or '*' which occur in Abbrvs raised an error about regex.)
In my opinion
new = old.replace(hw,hw1)
would be a good candidate for character replacement, and regex would be far far off.
It would achieve what we want without regex problems.
In terms of php equivalents str.replace()
= str_replace
and re.sub
= preg_replace
.
I prefer str.replace for our usage.
@funderburkjim
Once we reach consensus on exactly which category codes to use, I think this file format would be good; and I'll make a version of generate.py that assumes an input file in this format.
I have modified the proposed format to only 'p','t' and 'n' for print error, typo and no change respectively. All other codes are shifted to notes (can be used optionally). So, the file format is as you desire.
I have tried to take part in your burden. Code at https://gist.github.com/drdhaval2785/13e1211f6f333bb2cb31 would generate the files you need. They will be dictionarywise and one master file for seeing all corrections and no change cases at one place. Detailed description in Expected Outcome
heading of the issue.
Please give it a try and let me know.
If there are some changes needed, let me try.
If I fail, you may carry on. This way I would get some feeling of updating the codes you work with.
category codes are ready for marble I think. As per "I'm quite willing and glad to take the corrections in the looser form of issue comments that has been used heretofore." - I think it's bad idea to do cleanup, if there are some tech-savvy people who can do the most them-self. I'm very happy to see Dhaval's pro-activeness in a field that is very important for me.
As https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154 format has attained finality, this trial version ends.
Hurray!
@drdhaval2785 Glad you are learning how things work. Apparently, the idea of making s3 copies of the pywork, web, and org directories was a good way to let you get 'hands on' experience with the system. (Agree?)
If so, perhaps I should institute some regular routine on revising those s3 copies. Any suggestion on frequency?
Apparently, the idea of making s3 copies of the pywork, web, and org directories was a good way to let you get 'hands on' experience with the system. (Agree?)
Absolutely yes.
Any suggestion on frequency?
15th and 30th of every month
@funderburkjim Any thoughts on https://github.com/sanskrit-lexicon/CORRECTIONS/issues/157#issuecomment-156905980?
Bi-weekly indeed makes sense. If the server goes down, there should be at least two of us who should have it all.
@drdhaval2785 re errors with submission. '(' or '*'
If you want to change s1 = 'an (extra left paren' to s2 = 'an extra left paren' in Python:
s2 = re.sub(r'\(','',s1) # use '\(' to mean a literal left-paren rather than usual regex meaning of '('
@drdhaval2785 Regarding use of string replacement instead of regex replacement.
Probably would work exactly the same in our usage.
It would not solve the problem of how to autoadjust key2. Note the 4 examples given in readme.txt for benfey that were manual adjustments in benupd_edit.txt. No obvious way to do these adjustments by program. Easier to do manually. At least that's how it seems to me.
I have taken entries 201 to 210 from http://drdhaval2785.github.io/o_vs_O/output3/composite2a.html for test purpose.
Input
UPDATED in response to https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issuecomment-156867548 and subsequent alteration in https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issue-116719290.
After examination my change.txt file is
Program
The program is at https://gist.github.com/drdhaval2785/13e1211f6f333bb2cb31. Run
prepareupd.sh
Expected output
Presumptions for the code