Closed drdhaval2785 closed 8 years ago
@funderburkjim would like to comment whether the format is amenable to mechanical handling consistently or not, and whether any other additional field would be necessary? @gasyoun and @zaaf2 - Please comment whether you are ready to take the additional pains of writing some small letters in a txt file to ease life for Jim?
So no change words are in the same file? Should not we have NO CHANGE in a new file? Otherwise it could become rather big and messy. Not sure what is the difference between ocr error and digitization error. digitization error = markup? I'm ready to take additional pain to lessen the burden of Jim's work, so he has more time for valuable contributions. Thanks for such a clarification, Dhaval.
@drdhaval2785 Am willing to give a try to using a text file as you suggest.
The next batch you do, give this a try and make the txt file, and I'll see what's involved in using it. I will defer comments on the suggested file format until I see a live example.
One small suggestion on file format.
Let lines beginning with a semicolon be regarded as comments, not a 'no change'. Such a 'comment' line would be for file readability, but would be ignored by programs which process the file.
Your format already has a nochange code option
I think a smaller number of 'errorcodes' and 'nochange' codes would be suffice for my processing.
In terms of processing the records, just three are adequate:
The 'd - digitization error' seems identical to 'o - ocr error'.
The 'l - lexicographer error' and 'm miscellaneous error' are refinements of 'p - print error'.
And, as you already note, 's - separate words', 'a - alternate words', and 'w - wrong reading' are refinements of 'n - no change'.
These refinements could be included in the 'note' field.
@funderburkjim The format is revised once again depending on your suggestions. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issue-116719290 Please have a look and give a go ahead.
Looks fine to me: dictcode:currenthw:correcthw:errorcode:note (with p,t,n as errorcode), and other codes as abbreviations in the 'note' field.
I like the idea of 'keeping both words' in those marked 'nochange' But there arises the question of where to identify the dictionaries pertaining to the 2nd form of headword. dictcode identifies the dictionary of currenthw, but correcthw comes from one or more other dictionaries. To make a nochange.txt record for 'correcthw', we would (ideally at least) need one or more dictionary codes for correcthw.
One solution would be to use in nochange.txt
an *
for the dictionary code of correcthw, meaning that it is correct in
any dictionary where it occurs. This has the virtue of not adding more complication to the
nice 5-field form dictcode:currenthw:correcthw:errorcode:note .
I guess we won't have to bother about dictcode of correcthw in case of nochange. The program for generating nochange.txt is designed in such a way that it reads only the hw. Dictdata is brought from sanhw1.txt or sanhw2.txt files.
Let's call all decisions final.
Yes. The format is final from my side.
Agreed from my side.
Finalizing this format now.
@funderburkjim
How would you like the L-numbers
to be added to the form ?
e.g.
dictcode:currenthw:correcthw:errorcode:note#lnum
Please see - The last entry note#lnum
is optional. It can be left blank.
In case of some methodology, if we are unable to generate 'lnum' automatically, we can leave this field out.
@gasyoun What is your take?
Our online correction submission form has L number. Now we have sanhw2.txt with L numbers. So, in practice it should be possible to generate the txt file with L numbers.
Only our code for upd.txt file needs to be modified a tiny bit.
Yes, if there are several words similarly spelled, where only L differs, they are needed. Otherwise several of my own submission would not make sense. It would be surplus in most cases, but 1 out of 10 needs it, so we need to leave a place for L. But only if automatically copypasted, otherwise a burden. And I would not make it a required, but possible field. Because the rest I can handle with hands, the L I have to check.
dictcode:currenthw[,lnum]:correcthw:errorcode:note
Is what Jim proposes and I agree. Made the change in the original proposal. [,lnum] is optional.
note
optional as well. Even errorcode
- optional? When unsure.
@gasyoun raised a possibility of mechanization at https://github.com/sanskrit-lexicon/CORRECTIONS/issues/138#issuecomment-156001333. There has been a concern raised by @funderburkjim at https://github.com/sanskrit-lexicon/CORRECTIONS/issues/138#issuecomment-156241704 that there are various correction submission formats on github, which don't allow him to use generate.py uniformly to generate pwupd.txt and pwupd.tsv files mechanically and therefore there is duplication of efforts.
If we can give him a corrected txt file in a fixed format, we would be able to help him in a big way. All he has to do is
python generate.py
pwupd.txt
tomanualByLine02.txt
My suggested plan
dictcode:currenthw:correcthw::
.python generate.py
to generate pwupd.txt and pwupd.tsv files from change.txt.$ python updateByLine.py ../orig/pw2.txt manualByLine02.txt ../orig/pw.txt
sh redo_hw.sh
sh redo_xml.sh
Suggested format
Updated in response to https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issuecomment-156867548
e.g.
Arguments
dictcode
Preferrably lowercase only, because of ease of typing.
"acc","cae","ae","ap90","ap","ben","bhs","bop","bor","bur","ccs","gra","gst","ieg","inm","krm","mci","md","mw72","mw","mwe","pd","pe","pgn","pui","pwg","pw","sch","shs","skd","snp","stc","vcp","vei","wil","yat"
currenthw
Current headword in SLP1 transliteration
lnum
L-number of the headword. It is optional. When you want to submit write it like
currenthw,lnum
. Needed mostly in MW, where there are many homonyms and different L-s for different senses of same word.correcthw
Correct form for SLP1 transliteration
errorcode
p
- print errort
- typon
- no changeNote - Don't worry about currenthw and correcthw in case of no change. We will take care programmatically that currenthw would not be converted to correcthw. It makes sense to keep both the words in nochange.txt, because we have examined both the words and come to a conclusion that none requires change.
note
Updated in response to https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issuecomment-156867548
Note may be written in free style, whatever you want. But depending on our experiences with correction submission, the following are recurrently appearing notes. So, we have created a short form for it.
a
- alternate words - subset of nochangew
- wrong reading - subset of nochangel
- lexicographer error - subset of print error / typos
- separate words - subset of nochangec
- convention error - subset of print errorm
- multiple headwords - subset of nochange.g
- print smudge - subset of print errorHow to use these short forms -
pw:kesarin:keSarin:n:a
If you want to write some detailed note -
pw:kesarin:keSarin:n:Both words are alternate to each other.