crefmatch - Githubissues

funderburkjim commented 8 years ago

There is now a program to match literary source abbreviations from two sources:

the 500 or so records from the bibliographies in the different volumes of PW, as represented in the file pw_ls/pwbib/pwbib1.txt
the 2700 or so records from the file sortedcrefs.txt under pw_dhaval.

Some summary statistics are computed. notably:

502 records from pwbib1.txt
2711 records from ../pw_dhaval/abbrvwork/abbrvoutput/sortedcrefs.txt
359 matching abbreviations  (71%)
73275 total abbreviation instances from crefs
57092 of these accounted for by matching abbreviations  (78%)

The program is crefmatch.py.

A good enhancement of the program might be to generate an output file in which the two sources are 'merged' and sorted in some useful order. It may be that such a listing will suggest many obvious corrections

gasyoun commented 8 years ago

Amazing! 71% matching abbreviations is rather because there are same simple mismatches. A side-by-side table would help, indeed. But @drdhaval2785 is thinking of hibernation mode lately.

drdhaval2785 commented 8 years ago

Commit 55e0ae65aaa0bf6f353829d29bfc82bfa0348d0e I have done a slight modification in the crefmatch.py (without Jim's approval).

Now we have three files

https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pwbib/crefbibintersect.txt - literary resources seen in both 'cref' and 'pwbib1'. (359)
https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pwbib/bibminuscref.txt - literary resources seen in 'pwbib1' but absent from 'cref' (130)
https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pwbib/crefminusbib.txt - literary resources seen in 'cref' but absent in 'cref' (2352)

drdhaval2785 commented 8 years ago

Creating some study of pw.xml based on these three files.

Will post the results tomorrow. `

gasyoun commented 8 years ago

I see dirt in bibminuscref.txt How to mark it for cleaning?

PAN4K4AT.OHNEnähereAngabe - OHNE nähere Angabe is the German text that needs to be deleted.

funderburkjim commented 8 years ago

@drdhaval2785
Thanks for warning about syncing local repository. Did sync with PWK as first step today, and it grabbed your changes.

Made a couple of adjustments to your modifcations of crefmatch.py:

You had three 'print key,' statements. At first, the program failed on these with 'UnicodeEncodeError'. changed to print key.encode('utf-8'), <message> to solve this
Then, commented out these three print stmts, since the key is already being written to an appropriate file, per your modifications.

funderburkjim commented 8 years ago

@gasyoun Adjusted pwbib0.txt per your note. There is now a slight variance between the digitization and the text for that record (the '==' was moved to just after the abbreviation). I thought this variance was acceptable.

Then reran. (sh redo.sh in pwbib).

Now PAN4K4AT is one of the matching abbreviations.

drdhaval2785 commented 8 years ago

A good enhancement of the program might be to generate an output file in which the two sources are 'merged' and sorted in some useful order.

Please see the following three files, notably the first one.

https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pwbib/diffstudy/bibminuscref.xml - Entries of pw.xml, found in pwbib1.txt but not found in sortedcref.txt
https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pwbib/diffstudy/crefminusbib.xml - Entries of pw.xml found in sortedcref.txt but not found in pwbib1.txt
https://github.com/sanskrit-lexicon/PWK/blob/master/pw_ls/pwbib/diffstudy/crefbibintersect.xml - Entries of pw.xml found in both sortedcref.txt and pwbib1.txt

gasyoun commented 8 years ago

XMLs are hard to navigate. Can't we have pure HTML, please?

funderburkjim commented 8 years ago

Think this one has served its purpose, and can be closed.

sanskrit-lexicon / PWK

crefmatch #17