PW bib new work - Githubissues

funderburkjim commented 8 years ago

This issue documents some work which may be of use in connection with #56. The work is done in the pwbib_new_work directory of this repository.

The redo.sh script computes the various results.

mergebibnew.txt This file merges pwbib1 (the PW bibliography references) and pwbib_new (the references which are needed to resolve literary source references appearing in the digitization pw.txt of the dictionary).

The records are sorted by abbreviation (ignoring Anglicized-Sanskrit-numbers, and capitalization). If some of the unknown abbreviations of pwbib_new are spelling variants of references already in the bibliography, this ordering may emphasize this fact. Also, the handful of duplicates in the bibliography (that occur in different volumes of the text), will be identified.
properrefs1.txt There is a list properrefs.txt in pw_dhaval/abbrvwork/abbrvout/, which contains about 72000 instances of references, along with the headword in which the references occur. Our previous work has made it possible to match each of these proper references with a particular abbreviation now appearing in mergebibnew. This properrefs1 file makes this match explicit by adding to each of these instances an additional field which is the matching abbreviation of mergebibnew.
bibnew_disp1.txt This text file is an enhancement to mergebibnew, which takes into account properrefs1. In addition to slightly reformatting the display of the fields of mergebibnew line items, an additional piece of information shows the number of cases (the count) from properrefs1 that match to the abbreviation of the line item. Also, for the 'new' references, a list of up to 10 of the corresponding headword instances is shown. Possibly, this may prove useful in tracking down some of the unknown references.
bibnew_disp2.txt This file integrates one other source of information, the bibliographic entries for Monier-Williams. These are interspersed among the mergebibnew records by using the abbreviation spelling of MW. Since there are a few different spelling conventions in PW and MW, the placement in the file of some of these MW cases is not ideal.

There is a lot of data to examine here. Again, the main focus is to provide clues to what the 'title' should be for the 'new' PW bibliographic entries that have been uncovered by our matching regimen. Preliminary examination of bibnew_disp2 suggests that some fraction of the unknown titles may be readily inferred.

Also, some of the unknown abbreviations may be suggestive to those of us familiar with the Sanskrit corpus.

gasyoun commented 8 years ago

New horizons, too wide, too bright. Let's get back to correction of headwords :walking:

funderburkjim commented 8 years ago

bibnew_disp2_edit.txt is intended to be a file that we edit to 'fill in the blanks' of the 'new' literary source references of PW.

Currently there are 287 of these, identified by the string title= in the file.

Only one of these is determined at the moment (at line 47 of the file).

funderburkjim commented 8 years ago

A thesis by Jachertz (in pdf and digitized forms). Note - I had problems viewing the pdf via the browser. It displays properly with Adobe Reader.

Beginning at line 405 of the digitization, there is a list of works, probably collated from (pw = PWK), (PW=PWG), and maybe MW (?).

Very brief usage suggests that some of our missing cases are likely mentioned here.

For instance at line 563 of the digitization, there is p><b>Ar4g.</b> s. Arjunasama1gama, which provides a resolution for 'ARG4' at line 92 of bibnew_disp2_edit.txt.

This example also illustrates some of the problems of doing this collation, for the Jachertz digitization mis-spells the abbreviation as 'Ar4g' instead of the pdf's 'Arg4' .

gasyoun commented 8 years ago

This example also illustrates some of the problems of doing this collation, for the Jachertz digitization mis-spells the abbreviation as 'Ar4g' instead of the pdf's 'Arg4' .

@funderburkjim So there are 287 cases to be checked in Jachertz?

funderburkjim commented 8 years ago

@gasyoun The 287 cases are the unresolved literary source abbreviations in PWK.

One of the sources we might use to resolve these cases is Jachertz.

It might be that the references in MW will also help us to resolve the PWK unknowns. That is why the MW references are interspersed in the bibnew_disp2_edit.txt. For instance, the two unresolved cases 'ANUPADA' and 'ANUPADAS' are likely to be 'anupada-sūtra ' which appears as an MW literary source.

gasyoun commented 8 years ago

Thanks. Just today had a fight. One man told that there are no unknown abbreviations in PW(G-K). I had to laugh.

funderburkjim commented 8 years ago

The man can easily prove us wrong by filling in the blanks in bibnew_disp2_edit.txt :)

sanskrit-lexicon / PWK

PW bib new work #61