Closed funderburkjim closed 8 years ago
Amazing! 71% matching abbreviations is rather because there are same simple mismatches. A side-by-side table would help, indeed. But @drdhaval2785 is thinking of hibernation mode lately.
Commit 55e0ae65aaa0bf6f353829d29bfc82bfa0348d0e I have done a slight modification in the crefmatch.py (without Jim's approval).
Now we have three files
Creating some study of pw.xml based on these three files.
Will post the results tomorrow. `
I see dirt in bibminuscref.txt
How to mark it for cleaning?
@drdhaval2785
Thanks for warning about syncing local repository. Did sync with PWK as first step today, and it grabbed your changes.
Made a couple of adjustments to your modifcations of crefmatch.py:
print key.encode('utf-8'), <message>
to solve this @gasyoun Adjusted pwbib0.txt per your note. There is now a slight variance between the digitization and the text for that record (the '==' was moved to just after the abbreviation). I thought this variance was acceptable.
Then reran. (sh redo.sh in pwbib).
Now PAN4K4AT is one of the matching abbreviations.
A good enhancement of the program might be to generate an output file in which the two sources are 'merged' and sorted in some useful order.
Please see the following three files, notably the first one.
XMLs are hard to navigate. Can't we have pure HTML, please?
Think this one has served its purpose, and can be closed.
There is now a program to match literary source abbreviations from two sources:
Some summary statistics are computed. notably:
The program is crefmatch.py.
A good enhancement of the program might be to generate an output file in which the two sources are 'merged' and sorted in some useful order. It may be that such a listing will suggest many obvious corrections