2-gram vs MW, part 1 - Githubissues

drdhaval2785 commented 8 years ago

Examine html

txt file is here. It maybe taken as base for making corrections in standard convention.

Total 309 entries to be examined.

I encourage @gasyoun to examine and submit corrections in standard convention.

@funderburkjim UPDATE on 1.1.2016- After submission in total of 9 parts, here is the standard format file for processing. https://github.com/sanskrit-lexicon/CORRECTIONS/blob/master/ngram/output/corrections/allvsMW_2_corrected.txt Best luck

gasyoun commented 8 years ago

mw72:aulapoi:aulapi:t

aulapi

ap90:rOzhi:rOziha:t

rau

sch:vivvokinI:n:reference article

@drdhaval2785 :n: - in no change cases just the source word is enough. The 2nd word should be optional, no?

gasyoun commented 8 years ago

rAmeSvaraaDvarasuDAmaRi ACC

space ignored in key1. No real mistakes, several similar cases.

How to sort such cases out @funderburkjim ? Is the hiatus list used to weed out false positives, @drdhaval2785 ?

gasyoun commented 8 years ago

shs:zwFh:n:indian outdated orthography

str

ccs:saMorahAra:saMprahAra:t

ccs:[L=26627] [p= 480-1]:[L=26627] [p= 480-2]:t

sampra

gasyoun commented 8 years ago

@drdhaval2785 If I copy-paste from http://sanskrit-lexicon.github.io/CORRECTIONS/ngram/output/html/allvsMW_2.html I get hippocrEtus PE that is useless in all ways. PE:hippocrEtus:n or PE:hippocrEtus:hippocrEtus:t Would make more sense and would economy time.

drdhaval2785 commented 8 years ago

Why do you need to copy paste frim HTML? I make an extra copy of the .txt file. Keep HTML and txt open side by side. Examine the PDFs of HTML and make changes to TXT file and examine and submit in bunch of 5. That way, at the end, I also have a full correction submission file to hand over to Jim.

gasyoun commented 8 years ago

Oh, did not open txt before.

ieg:agronomoi,191:agronomoi:n:oi
pui:ajimHa,239:ajimHa:n:Ha,mH
skd:atwaNa,706:atwaNa:n:tw
inm:aditeHputra,170:aditeHputra:n:eH

Looks nice. Only thing I would make abbreviations capital, like IEG instead of ieg.

drdhaval2785 commented 8 years ago

The dicts were capital before. Made lowercase in the format for ease of typing in case of manual submission. Otherwise pressing Shift for three letters is cumbersome.

gasyoun commented 8 years ago

sch:akLpta,173:akLpta:n:Lp,kL kLp (only L dhatu)+ ieg:agronomoi,191:agronomoi:n:oi Greek+ pui:ajimHa,239:ajimHa:n:Ha,mH+ skd:atwaNa,706:atwaNa:n:tw atano

gasyoun commented 8 years ago

inm:aditeHputra,170:aditeHputra:n:eH (two different words, aditeH+putra, why not use key2)? inm:aditeHsuta,171:aditeHsuta:n:eH (two different words) skd:adwaNa,774:adwaNa:n:dw

adtane

@drdhaval2785 SKD remains a mystery for me. The Na in adwaNa does not seem to belong to the word, similar as in other cases.

gasyoun commented 8 years ago

acc:adButacaritaISvaraBAzita,301:adButacaritaISvaraBAzita:n:aI (two different words) pwg:aDyArUWa,2144:aDyArUWa:n:UW + sch:anavakLpta,2216:anavakLpta:n:Lp,kL + pui:anuHlAda,472:anuHlAda:n:Hl + pw:anuzwupkArmIRa,4904:anuzwupkArmIRa:n:pk +

gasyoun commented 8 years ago

ccs:aviQUs,2102:avidvaMs:t:iQ,QU +

avi

drdhaval2785 commented 8 years ago

@drdhaval2785 SKD remains a mystery for me. The Na in adwaNa does not seem to belong to the word, similar as in other cases.

That would remain mystery for you until and unless you read kavikalpadruma of vopadeva Appendix III. Pages 95-100. There it explains the it-marker system of Vopadeva's grammar.

For 'N' read capture

For 'a' read capture

gasyoun commented 8 years ago

So let's cut off the anubandhas?

drdhaval2785 commented 8 years ago

cut off the anubandhas?

Include in the verb study. Not that simple. Needs a research tag.

gasyoun commented 8 years ago

First cut off, then research :neckbeard:

drdhaval2785 commented 8 years ago

The discussion here has gone haywire. So let's install these submissions and continue submissions in some other issue.

funderburkjim commented 8 years ago

@drdhaval2785 As I read it, you have been so kind as to prepare allvsMW_2_corrected.txt),

which aggregates all the corrections from this issue and all those mentioned above (thru #226).

Thus, I need only work from allvsMW_2_corrected to cover all these separate issues.

Just wanted you to confirm this, before I start installing tomorrow.

drdhaval2785 commented 8 years ago

@funderburkjim You guessed it right. And I also cross checked that there is no entry where

errorcode is 'n' and words are different.
words are the same and errorcode is not 'n'. So majority of possible errors have been taken care of. You may install it in one go.

gasyoun commented 8 years ago

Dhaval is a wonder-man. He invents a method (with or without hints). He uses it. He submits in a ready to go format. It's only a matter of pressing Enter. Am I wrong, Jim?

funderburkjim commented 8 years ago

It's more than pressing 'Enter', but the standard form of submission considerably simplifies and makes routine the installation, and I appreciate that Dhaval has so prepared these standard form corrections.

Of course, I also view an aspect of my part of the installation task to be a diligent gatekeeper, and thus examine each submission.

gasyoun commented 8 years ago

How much time does a non-Enter acceptance of submission takes?

funderburkjim commented 8 years ago

Re ap90:rOzhi:rOziha:t I think rOzhi is correct. Notice the virAma under the 'z'

Can't find rOzhi in any other dictionary, and can't find rOziha in any dictionary.

Withdraw this analysis. Agree with @drdhaval2785 in #224. rOhiz . MW confirms.

funderburkjim commented 8 years ago

Re How to sort out such cases as rAmeSvaraaDvarasuDAmaRi ACC

Consulting a list of words with a space in key2 would be a step in this direction.

Since key1 also drops out avagraha ', such a list might include those with a single quote in key2.

To make such lists completely reliable might take more work than expected, since what is taken as 'key2' might be rather complicated for some dictionaries. And, in some cases (recall recent discussion of VEI) what is currently saved in the 'key2' field of X.xml might not be the best choice for key2.

funderburkjim commented 8 years ago

Re :n: - in no change cases just the source word is enough. The 2nd word should be optional, no?

You could make the second word 'empty':

:headword::n: blah blah

This would save you time, and not require code rewrite. Since for ':n:' cases nothing is done with the data except posting to file `corrections_nochange.txt', it doesn't matter what is in that third field, except that the field is there.

funderburkjim commented 8 years ago

Re make abbreviations capital, like IEG instead of ieg. Agree with Dhaval, keep lower case. Programs assume lower case, I think.

The only reason capitals were used at all was that capital letters appear in the directory names at Cologne , like scans/IEGScan/.... . It would have been better to have lower case throughout, but hard to change now.

funderburkjim commented 8 years ago

re: pwg:aDyArUWa,2144:aDyArUQa:t:UW I think it should be print error.

Compare rUQa:

Also, compare glyphs for UW and UQa:

funderburkjim commented 8 years ago

finished analysis of corrections this issue.

funderburkjim commented 8 years ago

Finished all analyses.

127 no-changes added to corrections_nochange.txt

Beginning installation

60 changes in 20 dictionaries.

funderburkjim commented 8 years ago

Corrections installed.

funderburkjim commented 8 years ago

Here are some behind-the-scenes details regarding this marathon of installation of changes.

At the current level of automation, it takes about 15-20 minutes per dictionary for installation. So 5-6 hours for these 20 dictionaries.

The automation uses partial templating. In this case, 'partial' means that some base template files are used, but that the templates require manual adjustment in various places.

The first step for a given dictionary involves copying some files from a base model; here is that step for Wilson dictionary.

n/2014/pywork/correctionwork/
cp -r /afs/rrz.uni-koeln.de/vol/www/projekt/sanskrit-lexicon/http/docs/scans/MW72Scan/2014/pywork/correctionwork/issue-189 .
cd issue-189/
rm mw72*
rm prev_change.txt
rm pw_readme.txt

Edit readme.txt:
 mw72 -> wil
 etc.

Then, the readme.txt file is edited and instructions therein are followed. Here is that file for latest Wilson update:


;Corrections to WIL
; Ref https://github.com/sanskrit-lexicon/CORRECTIONS/issues/189
This is in directory pywork/correctionwork/issue-189/.

Input file is change.txt
step 1. Generate wilupd.txt, wilupd.tsv, and wilnochange.txt
sh prepareupd.sh wil dhaval 189

step 1a. Make manual adjustments to wilupd.txt:
 cp wilupd.txt wilupd_edit.txt  # corrections in a copy
 1 revisions needed

step 2. Install corrections using wilupd_edit.txt
 - By examination of pywork/update.sh, the last line is
python updateByLine.py ../orig/wil1.txt manualByLine1.txt ../orig/wil.txt 

 - So, we append wilupd_edit.txt to the end of file manualByLine1.txt
   cd ../../ 
   cp manualByLine1.txt prev_manualByLine1.txt
   cat prev_manualByLine1.txt correctionwork/issue-189/wilupd_edit.txt > manualByLine1.txt
 - Then, in pywork directory, issue that last command of update.sh, as shown
   above

 - Then, create the headwords file, wilhw2.txt
 - Then, recreate (a) wil.xml, (b) ../web/sqlite/wil.sqlite
sh redo_hw.sh
sh redo_xml.sh

The rest of the steps update downloads and documentation
step3.  
  web/webtc directory:
  edit web/webtc/download.html, and change as-of-date at bottom of file
  remake downloads:
    cd to downloads directory
    sh redo_all.sh

If needed, initialize make_sync.sh and update_sync.sh:
- 1. Copy prototypes from MD:
 cp ../../../MDScan/2014/pywork/make_sync.sh .
 cp ../../../MDScan/2014/pywork/update_sync.sh .
- 2. Edit update_sync.sh so consistent with above:
   Edit make_sync.sh for wil

prepare new sync update file.
  sh make_sync.sh

step4a: copy wilupd.tsv to php/correction_response/
 From pywork directory:
cp correctionwork/issue-189/wilupd.tsv ../../../../php/correction_response/

step4b: append wilupd.tsv to end of cfr.tsv
 cd ../../../../php/correction_response/
 cp cfr.tsv cfr-prev.tsv
 cat cfr-prev.tsv wilupd.tsv > cfr.tsv
 rm wilupd.tsv 

step5a.  On local machine, open Github application, then open local
 CORRECTIONS repository in Explorer
step5b. Open GitBash terminal,
 cd Documents/GitHub/CORRECTIONS/
 sh redo_cfr.sh

step6a. edit history.txt, and write note of the changes
step6b. update dictionaries/WIL/wil_printchange.txt using cfr.tsv
    This done by reformatting the 'print error' records of wilupd.txt
    NOTE: 1 cases added

step6c. update corrections_nochange.txt from wilnochange.txt
   NOTE: 0 cases added

step7.  prepare new sanhw1.txt and sanhw2.txt
step7a. On Cologne server, change to scans/awork/sanhw1
  (Assuming still in php/correction_response):
  cd ../../scans/awork/sanhw1/
  sh redo_update.sh
step7b. In local CORRECTIONS repository,
  sh redo_sanhw12.sh

step8. sync with GitHub
step8a. Create commit
step8b. 'Sync'

step9. Make 'installation complete' note in #189.

step10.  Update s3 backup of wil
step10a. Assuming in php/corretion_responses:
  cd  ../../scans/awork/virtualenv/aws/
step10b. Be sure the redo_all.sh of above is finished.
  Make the script, execute it, and deactivate
  python make_copy_environ.py wil
  source s3bk_wil.sh
  rm s3bk_wil.sh

As you can see, installation actually involves 3 systems: Cologne, GitHub (via local repository), and AWS s3.

gasyoun commented 8 years ago

It hurts. It's a sorrow path you walk, Jim.

sanskrit-lexicon / CORRECTIONS

2-gram vs MW, part 1 #189