Reorganize repository, PWG Bibliography

funderburkjim commented 8 years ago

In preparation for the pwg blbligraphy, I reorganized the top level of the repository to have two folders:

Russianwords - contains images and a text file pertaining to the Russian words in PWG
misc -
- subfolder 'convertwork' which describes the conversion of PWG to SLP1 transliteration
- PWG-accents.pdf which documents the coding of accents in PWG digitization

funderburkjim commented 8 years ago

Made a 'pwg_ls' folder at the top level of repository. This will hold the work pertaining to the pwg bibliography. The intention is to have a close formal relation between pwg_ls and the directory 'pw_ls' of the PWK repository.

funderburkjim commented 8 years ago

Made stub files and folders under pwg_ls:

pwg_dhaval - for extracting actual references from pwg.xml NOTE: pwg.xml should NOT be kept in the repository but in a separate directory parallel to the top level of this repository.
pwgbib
- digitization
- pwgbib1_utf8.txt is the utf8-encoded digitization of the literary source abbreviations from volume 1 of PWG.
- readme.md contains descriptions of files, programs, and where corresponding images may be found.

funderburkjim commented 8 years ago

It probably would be a good idea for someone (@gasyoun ?) to proofread the digitization (pwgbib1_utf8.txt) right at the start.

gasyoun commented 8 years ago

@funderburkjim 600 of non-IAST is hard. I can proofread, but only at least pseudo-IAST, please.

funderburkjim commented 7 years ago

@gasyoun I've spent today trying to prepare a good IAST version of pwgbib1 for you to work with. It is here.

My aim was to convert to 'modern' IAST, which of course differs in several details from the scheme in the printed PWG bibliography.

Details on the conversion are in this readme.md; it is best to look at the 'raw' file.

Thanks for help in proof-reading.

funderburkjim commented 7 years ago

Digitization of bibliography from vols 2,3 is now available here in IAST.

For links to corresponding images, see the readme.md.

gasyoun commented 7 years ago

https://github.com/sanskrit-lexicon/PWG/blob/master/pwg_ls/pwgbib/digitization/pwgbib23_roman.txt

Proofread pwgbib23_roman, fixed several minor mistakes mostly convertion errors). Ready.

funderburkjim commented 7 years ago

@gasyoun Glad you caught these. Things like 'English' and 'Patanjali' were errors where the transcoding was misapplied. There's no good way to catch such cases other than by human intervention, such as you applied. I'll be on lookout for similar misapplications in the other parts of the bibliography.

funderburkjim commented 7 years ago

Thomas found a few bibliographic entries in volume 4.

pwgbib4_roman has these in IAST.

Did a first proofread of this.

No scan image available at this moment.

gasyoun commented 7 years ago

Bharatiya-UpasargarthaChandrika-P1-1976.pdf

About MBh. and R. quote verification:

upasarga

drdhaval2785 commented 7 years ago

@funderburkjim,

There is a lot of material spread in this issue. Time to write a summary and organize something in single comment / file?

gasyoun commented 7 years ago

There is a lot of material spread in this issue.

Not that much, actually :1st_place_medal:

funderburkjim commented 7 years ago

@drdhaval2785

Here's a first take on what is required to make the pwg literary source links possible.

What we are aiming for is to duplicate for PWG the final step as in the displayprep for PW directory. Once we have a sortbib.txt for PWG, the Cologne server display logic should have what it needs.
- specifically, we need the analogue of sortbib.txt in this directory.
- the readme in the above PWK directory has a good description of the format of sortbib
The pwg_ls/pwgbib/digitization directory in this PWG repository has the work that has been done on the digitized bibliographies for PWG. They are arranged as:
- pwgbib1 (volume 1 of PWG)
- pwgbib23 (volumes 2,3)
- pwgbib4 (volume 4)
The readme in this PWG digitization directory has links to the scans from which the digitizations were prepared.
For each of these, there are several forms:
- pwgbib1_orig.txt (similar for volumes 2,3 and vol. 4) This is the digitization from Thomas.
- _utf8.txt - conversion from the cp1252 encoding from Thomas to UTF-8 encoding
- _roman.txt - convert from AS transliteration of IAST to unicode IAST
I think the three X_roman.txt files are the only relevant one.
The task thus resolves into writing one or more programs to parse the X_roman.txt files, and construct a file like sortbib.txt from these inputs.

There may be other issues that arise, but this looks like a reasonable summary of steps.

Note: The extensive work we did in correlating the actual PWK literary source references to the printed references could also be done for the PWG references. However, it seems better to defer this work, which will likely have a degree of complexity for PWG as it had for PWK. Let's focus on making use of the digitized printed references for PWG, as described above.

funderburkjim commented 7 years ago

@drdhaval2785 If you work on this, I suggest you add a displayprep directory and do the construction of sortbib.txt for PWG therein.

gasyoun commented 7 years ago

degree of complexity for PWG as it had for PWK. Let's focus on making use of the digitized printed references for PWG, as described above.

Indeed. So it's more about coding, than real research at this state.

So Thomas has nothing done for vol 5, 6, 7?

NYĀYAMĀLĀV. = NYĀYAMĀLĀVISTARA, nach Anführungen bei MUIR, Sans- [Page1219-1b+ 21] krit Texts.

For reference purposes [Page1219-1b+ 21] has no meaning, I guess.

funderburkjim commented 7 years ago

@drdhaval2785 Are you planning to work on this ? If not, I think I may tackle it next week.

funderburkjim commented 7 years ago

@drdhaval2785 Since you haven't commented, I'm assuming that you are involved with other things, like stardict.

So, I'll begin working to get literary source links for PWG.

drdhaval2785 commented 7 years ago

I am sorry to have kept this unanswered. Please go ahead. I will not be able to handle it now.

funderburkjim commented 7 years ago

pwgbib14 contains the digitization of the literary source textual material for PWG. It has been formatted with the aim of making correspondences to the actual literary source instances within the pwg.xml digitization.

The digitization readme directory describes the details of coding of pwgbib14 (at the bottom of readme).

The <HI code="xxx"> lines indicate the different entries; there are 426 of these.

By contrast there are on the order of 9000 different actual proper reference forms in pwg.xml.

So, the next task is to make a good first approximation to matching actual forms to the codes in pwgbib14.

funderburkjim commented 7 years ago

That first approximation is now available in the PWG displays. Check it out!

Here is a brief summary of the approach taken:

pwgls.txt is constructed from latest version of pwg.xml. It contains a mapping from the text of <ls> elements to pwg bibliography records. There is a lot of approximation going on here.
- The code is in the abbrvwork directory of this repository.
- The preparatory programs (as described in the readme) construct intermediate files in the abbrvoutput directory; due to file size concerns, this abbrvoutput directory is in the .gitignore for the repository.
- Some summary statistics indicating scope of coverage:
  - abbrvlist -- 414,000 instances of <ls> in pwg.xml
  - properrefs - 344,000 instances the are 'properly formed'. 70,000 cases are excluded because the <ls> text begins with numbers, parentheses, etc.
  - cleanrefs - 9,300 These are the unique instances of properrefs AFTER CLEANING. Cleaning means typically changing something like P. 3, 1, 134. to P.
  - Now the cleanrefs are matched against the 422 entries from pwgbib14 (digitization of bibliography sections of the text)
    - about 4700 of the 9300 cleanrefs are successfully matched, accounting for 310,000 of the cleanrefs.
    - about 4600 of the cleanrefs remain unmatched, accounting for 34,000 of the cleanrefs.
pwg.xml modification using pwgls.txt, the construction of pwg.xml is altered to add an 'n' attribute to those <ls> elements that are matched. e.g. <ls n="1.230">P. 3, 1, 134.</ls> indicates that this ls-element refers to the 230th entry of volume 1 PWG BIbliography, namely PĀṆINI'S acht Bücher grammatischer Regeln (GILD. Bibl. 244).
An html version of a display of the pwg bibliography is prepared. For each of the entries, there is anchor (e.g. <a name="1.230"> for Panini.
The display code (disp.php) for pwg is modified to construct a hyperlink for displayed items. For instance if you use a display for PWG and request the information for deva,

If you now click on the first link, to P. 3, 1, 134. , then a window pops up for the PWG bibiliography, scrolled to the right spot:

funderburkjim commented 7 years ago

The main work remaining to be done is to improve the coverage of the matching. This will involve making corrections to PWG, which the work just described has not addressed. I'll mention this in a new issue so we'll remember it is on our todo list.

I think this particular issue can be safely closed.

gasyoun commented 7 years ago

P. 3, 1, 134.

Linking to Panini (real book reference, not just the abbreviation) is a simple as https://github.com/sanskrit-lexicon/Cologne/issues/93 @drdhaval2785 will agree. There are some books where the linkig is easy. @juhnowski if you ask me, I would think about such corpora things first, UI comes next, because it's a long story and there is no quick urge.

After reading the newest documentation I can only say - if God would have forgotten how he created the Earth, Jim would write a summary on that as well. After reading it, one could redo the whole thing again and again.

The links did not worked in Chrome for me, nothing did not open. And my AdBlock kept silent.

2017-03-07_17-51-05

gasyoun commented 7 years ago

I do not understand how to help. What exactly and in what file to do. I opened matchcrefs:

12@No Match@Comm.)@Comm.) 1@No Match@BALLANTYNE:@BALLANTYNE:

It is supposed that Comm.) is not = to Comm.) or what? Or it means that all () should get out of the match, so there should be additional cleanup?

1@No Match@Comm.) BṚH.@Comm.) BR2H.

I can hardly see what can bee done here. The only thing I can think of is that Comm.) can be connected with an abbreviation before, and not after and that the connection as it is is concidental.

1@Match ~2 1.033 BENF. Chr.@BENFEY verbessert hat). SUŚR.@BENFEY verbessert hat). SUC2R. 1@Match ~2 1.033 BENF. Chr.@BENFEY annimmt) DAŚAK. in BENF. Chr.@BENFEY annimmt) DAC2AK. in BENF. Chr. 1@No Match@Auge SUŚR.@Auge SUC2R. 1@No Match@Ausg.@Ausg. Ausgabe = Edition

verbessert hat and annimmt is not part of the abbreviation, is just a German text.

funderburkjim commented 7 years ago

Regarding PWG links not working in Chrome.

I'm also using chrome, and the links work fine. I also have an ad blocker (UBlock Origin). Maybe open developer tools and see if there is any reason given.

funderburkjim commented 7 years ago

Regarding 'Comm.,.' in matchcrefs. This may be referring to an un-named commentary on BṚH..
How to handle this is unclear. Most obvious would be to just link to BṚH. ĀR. UP. in pwgbib.txt. If so, then one solution would be to make a correction to pwg digitization to change the scope of the ls tag. For example from <ls>Comm.) BṚH.</ls> to Comm.) <ls>BṚH.</ls>.

While the details of the change you could leave to me, you could help by indicating what the link should point to.

The BALLANTYNE instance is different. The only mention of this author in pwgbib is as editor of PAT. YOGAŚ. . SO, it might be that that should be the link.

In terms of priorities, I would use the 'count' field as a guide. For instance, it would make sense to find solutions for those 35 'No Match' cases where there are 100+ instances.

If you are actively wanting to work on these, maybe you should have the ancillary datafiles in the abbrvoutput directory that I have thus far excepted from Git coverage. For instance, the abbrvlist file has every <ls> instance, in dictionary order, and includes the headword and L-number; thus, with this one could examine the context of the ls-element,. Please advise if you need this now. Total size of abbrvoutput directory is 32MB.

gasyoun commented 7 years ago

one solution would be to make a correction to pwg digitization to change the scope of the ls tag.

I would go for it.

The BALLANTYNE instance is different. The only mention of this author in pwgbib is as editor of PAT. YOGAŚ. . SO, it might be that that should be the link.

Makes sense.

it would make sense to find solutions for those 35 'No Match' cases where there are 100+ instances.

Can you order them in order of priority, please?

Total size of abbrvoutput directory is 32MB.

Please share it.

funderburkjim commented 7 years ago

abbrvoutput directory now uploaded.

Can you order them in order of priority?

Sure: see the discussion in #22 re matchrefs, and the regex therein. Edit the matchcrefs file locally, select the lines with the regex. There are 35 selected lines of No Matches. Order these 35 by size of first field (count). There are a few with 1000+ --- these are first to examine and resolve.

gasyoun commented 7 years ago

Jim, I'm too dumb. I do not get it.

185 No Match Sp. Sp.

What's wrong with Sp.? Sp stands for Spruche, a well-known book, so what to do with it? How to make it match?

nomatch

funderburkjim commented 7 years ago

There's may be nothing wrong with Sp. The problem likely is that the PWG bibliography pwgbib.txt does not have this reference which you recognized.

The solution in such a case would be to generate a synthetic new entry for pwgbib.

It is also possible that we might find some items missing from the pwg bibliography are present in the pwk bibliograph (eg, in sortbib).

gasyoun commented 7 years ago

Ok, so 1) checked pwgbib.txt (none) 2) checked sortbib.txt (none).

What do I do next? I know that Sp. = https://www.worldcat.org/title/indische-spruche-sanskrit-und-deutsch-herausgegeben-von-o-bohtlingk/oclc/557531710&referer=brief_results is meant. Where and how to note it? I will start one by one, but I still do not get how to work in batch mode.

gasyoun commented 7 years ago

Different case, where I see a match.

pwgbib.txt has none 1.314 <HI code="Verz. d. B. H.">Verz. d. B. H. = WEBER'S Verzeichniss der Berliner Sanskrit-Hand-<lb>schriften. Bildet den ersten Band von: Die Handschriften-Verzeich-<lb>nisse der Königlichen Bibliothek, herausgegeben von dem König-<lb>lichen Oberbibliothekar, Geheimen Regierungsrath Dr. PERTZ. Berlin<lb>1853. 8º. 1.315 <HI code="Verz. d. Kopenh. H.">Verz. d. Kopenh. H. = WESTERGAARD'S Verzeichniss der Kopenhagener<lb>Sanskrit-Handschriften in: Codices orientales bibliothecae regiae<lb>Havniensis yussu et auspiciis regis Daniae augustissimi Christiani<lb>octavi enumerati et descripti. Pars prior, codices indicos continens.<lb>Havniae 1846. 4º. 1.316 <HI code="Verz. d. Pet. H.">Verz. d. Pet. H. = BÖHTLINGK'S Verzeichniss der Petersburger Sanskrit-<lb>Handschriften in: DORN, das Asiatische Museum der Kais. Akad. der<lb>Wiss. St. Petersburg 1846. S. 720. fgg.

sortbib.txt has 1 Verz.d.Oxf.H 1291 AUFRECHT, Verzeichniss der Oxforder Handschriften.

Only Verz.d.Oxf.H has no dot at the end (matchrefs has it like Verz.d.Oxf.H.), I guess that matters?

gasyoun commented 7 years ago

S. and Z.

S. (Seite) = page. Z. (Zeile) = line. They are no real entities, but should be added as sub entities and marked as such. Jim, should we add a new level to reference entries? Like the page and line where the quote occurs?

pwgbib.txt has none 1.334 <HI code="Z. d. d. m. G.">Z. d. d. m. G. = Zeitschrift der Deutschen morgenländischen Gesell-<lb>schaft. Leipzig. 1.335 <HI code="Z. f. d. K. d. M.">Z. f. d. K. d. M. = Zeitschrift für die Kunde des Morgenlandes. Göt-<lb>tingen (Bd. I--III) und Bonn (Bd. IV--VII). 1.336 <HI code="Z. f. d. W. d. Spr.">Z. f. d. W. d. Spr. = Zeitschrift für die Wissenschaft der Sprache.<lb>Herausgegeben von Dr. A. HOEFER. 1.337 <HI code="Z. f. vgl. Spr.">Z. f. vgl. Spr. = Zeitschrift für vergleichende Sprachforschung auf dem<lb>Gebiete des Deutschen, Griechischen und Lateinischen herausgege-<lb>ben von Dr. THEODOR AUFRECHT und Dr. ADALBERT KUHN. Berlin.

sortbib.txt has both S. (Seite) und Z. (Zeile)

funderburkjim commented 7 years ago

@gasyoun This issue is closed. Let's move this discussion to #22.

sanskrit-lexicon / PWG

Reorganize repository, PWG Bibliography #20