sanskrit-lexicon / CORRECTIONS

Correction history for Cologne Sanskrit Lexicon
8 stars 5 forks source link

Suggested format for correction submission #154

Closed drdhaval2785 closed 8 years ago

drdhaval2785 commented 8 years ago

@gasyoun raised a possibility of mechanization at https://github.com/sanskrit-lexicon/CORRECTIONS/issues/138#issuecomment-156001333. There has been a concern raised by @funderburkjim at https://github.com/sanskrit-lexicon/CORRECTIONS/issues/138#issuecomment-156241704 that there are various correction submission formats on github, which don't allow him to use generate.py uniformly to generate pwupd.txt and pwupd.tsv files mechanically and therefore there is duplication of efforts.

If we can give him a corrected txt file in a fixed format, we would be able to help him in a big way. All he has to do is

  1. python generate.py
  2. Append the data of pwupd.txt to manualByLine02.txt
  3. Update the lists by shell script.

    My suggested plan

  4. I provide a .txt file along with the .html file we are working with right now.
  5. That .txt file would have dictcode:currenthw:correcthw::.
  6. When we submit to github an error, we should note the error code and note in our local copy of .txt file.
  7. After correction submission, we hand over Jim the .txt file also.
  8. Jim works with the following logic - (a) if the line starts with ';' - add to nochange.txt, (b) elseif the line doesn't have 'errorcode' or 'nochangecode' - add it to nodecision.txt, (c) else add to change.txt.
  9. Run python generate.py to generate pwupd.txt and pwupd.tsv files from change.txt.
  10. Next $ python updateByLine.py ../orig/pw2.txt manualByLine02.txt ../orig/pw.txt
  11. sh redo_hw.sh
  12. sh redo_xml.sh

    Suggested format

Updated in response to https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issuecomment-156867548

dictcode:currenthw[,lnum]:correcthw:errorcode:note

e.g.

bop:mfgalAYcana,6332:mfgalAYCana:p:Maybe a print smudge
or
bop:mfgalAYcana:mfgalAYCana:p:Maybe a print smudge

Arguments

dictcode

Preferrably lowercase only, because of ease of typing.

"acc","cae","ae","ap90","ap","ben","bhs","bop","bor","bur","ccs","gra","gst","ieg","inm","krm","mci","md","mw72","mw","mwe","pd","pe","pgn","pui","pwg","pw","sch","shs","skd","snp","stc","vcp","vei","wil","yat"

currenthw

Current headword in SLP1 transliteration

lnum

L-number of the headword. It is optional. When you want to submit write it like currenthw,lnum. Needed mostly in MW, where there are many homonyms and different L-s for different senses of same word.

correcthw

Correct form for SLP1 transliteration

errorcode

p - print error t - typo n - no change

Note - Don't worry about currenthw and correcthw in case of no change. We will take care programmatically that currenthw would not be converted to correcthw. It makes sense to keep both the words in nochange.txt, because we have examined both the words and come to a conclusion that none requires change.

note

Updated in response to https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issuecomment-156867548

Note may be written in free style, whatever you want. But depending on our experiences with correction submission, the following are recurrently appearing notes. So, we have created a short form for it. a - alternate words - subset of nochange w - wrong reading - subset of nochange l - lexicographer error - subset of print error / typo s - separate words - subset of nochange c- convention error - subset of print error m - multiple headwords - subset of nochange. g- print smudge - subset of print error

How to use these short forms - pw:kesarin:keSarin:n:a

If you want to write some detailed note - pw:kesarin:keSarin:n:Both words are alternate to each other.

drdhaval2785 commented 8 years ago

@funderburkjim would like to comment whether the format is amenable to mechanical handling consistently or not, and whether any other additional field would be necessary? @gasyoun and @zaaf2 - Please comment whether you are ready to take the additional pains of writing some small letters in a txt file to ease life for Jim?

gasyoun commented 8 years ago

So no change words are in the same file? Should not we have NO CHANGE in a new file? Otherwise it could become rather big and messy. Not sure what is the difference between ocr error and digitization error. digitization error = markup? I'm ready to take additional pain to lessen the burden of Jim's work, so he has more time for valuable contributions. Thanks for such a clarification, Dhaval.

funderburkjim commented 8 years ago

@drdhaval2785 Am willing to give a try to using a text file as you suggest.

The next batch you do, give this a try and make the txt file, and I'll see what's involved in using it. I will defer comments on the suggested file format until I see a live example.

funderburkjim commented 8 years ago

One small suggestion on file format.

Let lines beginning with a semicolon be regarded as comments, not a 'no change'. Such a 'comment' line would be for file readability, but would be ignored by programs which process the file.

Your format already has a nochange code option

funderburkjim commented 8 years ago

I think a smaller number of 'errorcodes' and 'nochange' codes would be suffice for my processing.

In terms of processing the records, just three are adequate:

The 'd - digitization error' seems identical to 'o - ocr error'.

The 'l - lexicographer error' and 'm miscellaneous error' are refinements of 'p - print error'.

And, as you already note, 's - separate words', 'a - alternate words', and 'w - wrong reading' are refinements of 'n - no change'.

These refinements could be included in the 'note' field.

drdhaval2785 commented 8 years ago

@funderburkjim The format is revised once again depending on your suggestions. https://github.com/sanskrit-lexicon/CORRECTIONS/issues/154#issue-116719290 Please have a look and give a go ahead.

funderburkjim commented 8 years ago

Looks fine to me: dictcode:currenthw:correcthw:errorcode:note (with p,t,n as errorcode), and other codes as abbreviations in the 'note' field.

I like the idea of 'keeping both words' in those marked 'nochange' But there arises the question of where to identify the dictionaries pertaining to the 2nd form of headword. dictcode identifies the dictionary of currenthw, but correcthw comes from one or more other dictionaries. To make a nochange.txt record for 'correcthw', we would (ideally at least) need one or more dictionary codes for correcthw.

One solution would be to use in nochange.txt an * for the dictionary code of correcthw, meaning that it is correct in any dictionary where it occurs. This has the virtue of not adding more complication to the nice 5-field form dictcode:currenthw:correcthw:errorcode:note .

drdhaval2785 commented 8 years ago

I guess we won't have to bother about dictcode of correcthw in case of nochange. The program for generating nochange.txt is designed in such a way that it reads only the hw. Dictdata is brought from sanhw1.txt or sanhw2.txt files.

gasyoun commented 8 years ago

Let's call all decisions final.

drdhaval2785 commented 8 years ago

Yes. The format is final from my side.

funderburkjim commented 8 years ago

Agreed from my side.

drdhaval2785 commented 8 years ago

Finalizing this format now.

drdhaval2785 commented 8 years ago

@funderburkjim How would you like the L-numbers to be added to the form ? e.g. dictcode:currenthw:correcthw:errorcode:note#lnum

Please see - The last entry note#lnum is optional. It can be left blank. In case of some methodology, if we are unable to generate 'lnum' automatically, we can leave this field out.

@gasyoun What is your take?

Our online correction submission form has L number. Now we have sanhw2.txt with L numbers. So, in practice it should be possible to generate the txt file with L numbers.

Only our code for upd.txt file needs to be modified a tiny bit.

gasyoun commented 8 years ago

Yes, if there are several words similarly spelled, where only L differs, they are needed. Otherwise several of my own submission would not make sense. It would be surplus in most cases, but 1 out of 10 needs it, so we need to leave a place for L. But only if automatically copypasted, otherwise a burden. And I would not make it a required, but possible field. Because the rest I can handle with hands, the L I have to check.

drdhaval2785 commented 8 years ago

dictcode:currenthw[,lnum]:correcthw:errorcode:note

Is what Jim proposes and I agree. Made the change in the original proposal. [,lnum] is optional.

gasyoun commented 8 years ago

note optional as well. Even errorcode - optional? When unsure.