Editing changes to pwbib0.txt

funderburkjim commented 8 years ago

pwbib0.txt starts as a copy of pwbib_utf8.txt.

A few editing changes are made to facilitate further programmatic manipulation.

funderburkjim commented 8 years ago

Relevant bibliographics lines are identifiable because they start with either

+. indicates some checking has been done
'.' checking not done

Other lines in the file can be ignored by programmatic parsers.

Most relevant lines end with the volume in which the entry appears, in the format `(vol. N)', where N is 1 to 6.

A few relevant lines have slight variants of the volume indicators. These are changed, as follows:

84 old +.DEVATA7DHJ.BRA7HM. == DAIV. BR. vol. 1 <NF>
84 new +.DEVATA7DHJ.BRA7HM. == DAIV. BR. <NF> (vol. 1)

95 old +.G4AIM.BHA7R. == G4AIMINI'S BHA7RATA, nach Citaten in Ind. St. vol. 1 NF
95 new +.G4AIM.BHA7R. == G4AIMINI'S BHA7RATA, nach Citaten in Ind. St. <NF> (vol. 1)

96 old +.GAL. == GALANO's Wörterbuch, Abschrift von WEBER;vgl. Monatsbericht der Kön. Pr. Akad. der Wissensch. 1876, S. 801. fgg. (CAPPELLER).
96 new +.GAL. == GALANO's Wörterbuch, Abschrift von WEBER;vgl. Monatsbericht der Kön. Pr. Akad. der Wissensch. 1876, S. 801. fgg. (CAPPELLER). (vol.1)

134 old +.JOGAT.UP. == JOGATATTVOPANISHAD in der Bibl. ind. (GELDNER und ROTH)
134 new +.JOGAT.UP. == JOGATATTVOPANISHAD in der Bibl. ind. (GELDNER und ROTH) (vol. 1)

funderburkjim commented 8 years ago

Only two cases (lines 84, 95) have the <NF> markup.

@thomasincambodia What does this 'NF' mean?

funderburkjim commented 8 years ago

A program determines that the relevant records have a regular structure that is easily parsed.

python pwbib_parse0.py pwbib0.txt

Lines 1-8, and lines 511-538 are 'not relevant' lines (line 538 is the last line of file).

Thus, there are are 502 bibliographic entries.

The program parses these into four fields:

checked (line starts with +.)
abbreviation : in AS transliteration, at least for abbreviation of Sanskrit works
title : the expansion of the name of the work, as it appears in scans. This also contains some AS transliteration
volume number (1-6)

funderburkjim commented 8 years ago

Next steps might be :

compare with the list in pw_dhaval
generate a form where AS transliteration is converted to Unicode.

I plan to work on these soon.

gasyoun commented 8 years ago

Not sure what comparison with pw_dhaval should bring up, but Unicode I'm sure will appreciate. Let me know if any signs undeciphered yet.

funderburkjim commented 8 years ago

There is an irregularity in 82 of the records. This is an irregularity of the text (for volume 1, http://www.sanskrit-lexicon.uni-koeln.de/scans/PWScan/index.php?sfx=jpg&vol=1, similarly for volumes 2-6. The lists of works are part of the preface in each case.)

The 'usual' form of the records is xxx == yyy , where xxx is usually the abbreviation, and yyy is the expansion. Thus, the == helps to distinguish the two parts of abbreviation and title.

For the 82 'irregular' lines, there is no == in the text.

For these cases, I have assumed that the abbreviation consists of that part of the line up to the first space character. This permits the programs to proceed.

It will be necessary to manually adjust at least these cases before a final form is obtained.

Some aide to what needs adjustment will be indicated when we try to match the abbreviations in pwbib to those seen in the text in pw_dhaval.

funderburkjim commented 8 years ago

The printed text has some lower and upper case 'm' characters with a tilde above.

The digitization codes these as m~ and M~ sometimes, and sometimes it codes it as M5 or m5.

I could not find a Unicode representation for these m with tilde. So replaced them with a 'dot above' : Ṁ and ṁ.

This was accomplished by modifying 'as_roman.xml' for this, from the form it has in pw_dhaval.

funderburkjim commented 8 years ago

Conversion of pwbib0 to unicode is done by:

python pwbib1.py pwbib0.txt pwbib1.txt

The output file is separated into 7 tab-delimited fields:

abbreviation (in AS transliteration)
3-digit sequence number
'+' or '-' indicating that some checking has been done, or not done
'==' or 'xx' to indicate that the text has the '==' or not, as described above
volume number (1 to 6)
abbreviation (in Unicode)
title (in Unicode)

Someone needs to do some proof-reading of this, to see if there are any systematic adjustments that can be made by the transcoding to unicode, or for any other reason.

Before starting with manual adjustments, let's see if any systematic adjustments are warranted.

gasyoun commented 8 years ago

I could not find a Unicode representation for these - sure, because there is none. No m with tilde, and that is where the unicode game goes wrong. I would use http://www.fileformat.info/info/unicode/char/1d6f/index.htm (https://en.wikipedia.org/wiki/Tilde no good) and make it look about in our font - best solution I can think of. Just replacing with different symbol - good idea? Where is the pwbib1.txt file? All I have is http://drdhaval2785.github.io/pw/abbrv/display.html

funderburkjim commented 8 years ago

re 'where is the pwbib1.txt file?' In case you haven't already found it, here it is.

funderburkjim commented 8 years ago

Regarding 'u1d6f' : I don't think this should be used, since the tilde is not above the small m, but at least according to the fileformat link, appears in the middle.

Since the function of the m-tilde in the pwk titles appears to be as an anusvara, the 'dot above' is not a bad solution. In effect the m-dot IS a special symbol. I think it works fine.

gasyoun commented 8 years ago

the m-dot IS a special symbol - indeed, but one that is in conflict, because becomes visually equal with other dictionary standards. tilde is not above the small m - will be in our font. In unicode we will have to cheat. It can't have all of the AS magic anyway. pwbib1.txt needs an IAST column. PWK transliteration was outdated even when the dictionary was printed. Agree, Jim?

funderburkjim commented 8 years ago

Reg 'pwbib1.txt needs an IAST column' The pwbib1.txt DOES have an IAST column (the last two columns, in fact, one for abbreviation and one for text.)

maltenth commented 8 years ago

@funderburkjim
NF means 'Not Found' and usually refers to an unsuccessful attempt to verify a source

gasyoun commented 8 years ago

@thomasincambodia it's a good feeling to see the father of Cologne Dictionaries here. I welcome you - hope we can continue the work you have started years ago. There are many unanswered questions regarding the files, so your help and advice would benefit all of us.

If I get it right @thomasincambodia created C:\SANSKRIT\BOEHTLIN\PWBIB.ALL in 03.11.03. The main list consists of 502 entries, but there is an additional list (23 cases) at the end. It means that these entries are somehow bad. Bad. because 1) mistake C2A7M5KH. statt [instead of] C2A7N5KH., 2) person instead of book BHA7RADVA7G4A ?Person, 3) detective cases Ind. fьr Ind.St.?. Full list:

Nicht o. falsch aufgelistete Eintrдge

AC2VAV. VIKRAM. AMR2TAN.UP. und AMR2TAB.UP. = AMR2T.UP. A7NANDAL. ANUKRAM. AV.PRA7T. BHA7RADVA7G4A ?Person BHA7GURI ?Person MA7DHAVI7JADHA7T. MA7DHAVI7JADHA7TUVR2TTI BHAR.NA7T2JAC2. sollte BHA7R.NA7T2JAC2. sein. Im pw mal so mal so BR2HASPATI Person? C2ABDAR. C2A7M5KH. statt C2A7N5KH. C2AM5K. statt C2AN5K. SA7DHANAM GOVIND. HULTZSCH BHAR. HA7RI7TA WHITNEY,Ind. Ind. fьr Ind.St.? K4AKRAD. = K4AKR.

There are literaly issues in every line.

Spr. 252 - == 1 Spr. Indische Sprüche, herausg. von O. BÖHTLINGK. 2te Aufl. Von 7614 an in Melanges asiatiques, T. VIII, S. 217. fgg. Ebendaselbst S. 203. fgg. stehen die durch "zu Spr. " bezeichneten Varianten.

Spr. and Spr. II are two totally different 3 volume works. To have them combined makes no sense. Because the numbers of the Spruche they refer to are totally different. This is overoptimisation.

VARA7H.BR2H. 281 + == 1 VARÂH.BṚH. VARÂHAMIHIRA'S BṚHAG7G7ÂTAKA (KERN).

Can we eliminate AS rudiments (like G7G7) from last column as well?

VS. PRA7T. 299 - == 1 VS. PRÂT. PRÂTIÇÂKHJA zu VS in Ind. St. 4. MITÂ??SHABÂ. Bei zwei Zahlen ist der VJAVAHÂRÂDHJÂJA, Calcutta 1829, gemeint, bei zwei Zahlen mit folgendem {%a%}oder{%b%} nebst Angabe der Zeile -- das vollständige Werk in 4to.

PRÂTIÇÂKHJA is not IAST, Jim. It does not even have a name. It's dead. Same as Mueller's transliteration was by 1911. Rudolf von Roth used it in 1846 in his Zur Litteratur und Geschichte des Weda and Die älteste Wedengrammatik oder die Prâtiçâkhja Sûtren. It should be Prātiçākhya (and still would be in France and Russia), but for the sake of unification and MW standard, let's have it Prātiśākhya. That is the way it's quoted in Taittirīya-Prātiśākhya by P. Scharf included. PRÂTIÇÂKHJA is not IAST. Prātiśākhya is IAST. We do not use Â nowadays, we use Ā. And that makes a difference. oudl.osmania.ac.in/bitstream/handle/OUDL/11279/214181_Priyadarsika_A_Sanskrti_Drama_By_Harsha_Vol_X.pdf the scan I could not download.

For the same reason

BHÂVAPRAḰÂÇA, Calcutta 1875 und Hdschr. (ROTH). is not identical with Bhāvaprakāśa, Calcutta 1875 und Hdschr. (ROTH). Because the 2nd one you can google. The first will give no results at all. And as we are gathering data about the sources, findind meta data on them will not hurt. That is why I'm for IAST. We have none here, Jim.

Much of the missing data is easy to regain

PRIJADARÇIKÂ, Calcutta 18?? (CAPPELLER). is PRIJADARÇIKÂ, Calcutta 1876 (CAPPELLER).

funderburkjim commented 8 years ago

re 'PRÂTIÇÂKHJA is not IAST, Jim. It does not even have a name'

Right, I agree it's hard to work with.

Find me a list of the changes that need to be made to convert it to more useful IAST, and I can create a specialized transcoding file to generate a version with more useful IAST.

@gasyoun You can help by suggesting what to convert. e.g. in the PRÂTIÇÂKHJA example, there is at least one change Ç -> Ś. Probably also 'J' -> Y . There also may be some 'AS' forms (letter-number) still remaining that the as_roman transcoder file left unchanged, and the proper mapping for these would be needed.

So provide a list in the form: 'old new' (or 'old -> new') .

gasyoun commented 8 years ago

==== Corrections ====

NIDÂNAS (ÛTRA) (A. WEBER). -> NIDÂNA (SÛTRA) (A. WEBER). DATTAKAÇ (IROMAṆI), Calcutta 1867 (JOLLY). -> DATTAKA (ÇIROMAṆI), Calcutta 1867 (JOLLY).

ÇṀKARAVIG7AJA in der Bibl. ind. - ÇṀK is fishy, some converting issue. ANUKRAMAṆIḰ zu ṚV -> ANUKRAMAṆIKÂ zu ṚV GIT. -> GÎT. GITAGOVINA, Ausg -> GÎTAGOVINDA, Ausg

ÇRIP. -> ÇRÎP. ÇRIPATI. -> ÇRÎPATI.

NIL. -> NÎL NILAK. mit einer Zahl -> NÎLAK. mit einer Zahl NILAK. -> NÎLAK. NILAR. Up. -> NÎLAR. Up.

==== Converting ====

Â Ā Û Ū ; ÂÇVALÂJAN'S GṚHJASÛTRA;Ausg. von STENZLER. Î Ī Ç Ś J Y Ḱ C N7 N5 Ń Ñ SH Ṣ G7 G4 J M7 M5 M̃ ; ÂNADAGIRI, Glossator zu ÇAM7KARÂḰÂRJA'S Comm. zu BṚH. ÂR Up. in der Bibi. ind. (KERN). ṁ m̃; ǴAIMINI'S Mimâṁsâdarçana in der Bibl. ind. â ā ç ś

TÂRAN7THA'S G7ÂTAKAM

ā ī ū ṛ ṝ ḷ ḻ ṅ ñ ṭ ḍ ṇ ś ṣ ḥ ṁ ṃ Ā Ī Ū Ṛ Ṝ Ḻ Ṅ Ñ Ṭ Ḍ Ṇ Ś Ṣ Ḥ Ṁ

funderburkjim commented 8 years ago

I see you somehow got a capital M with tilde. Do you know what this is in Hexadecimal?
Is some 'unicode combining character' involved?

I'm doubtful that this should be used. Like the Cedilla, it is not part of modern representation of Sanskrit, right?

gasyoun commented 8 years ago

I see you somehow got a capital M with tilde. - combined unicode magic. Do you know what this is in Hexadecimal? - it is not, it's a combination of two codes. Like the Cedilla, it is not part of modern representation of Sanskrit, right? - right. Our task is to be true to the book. 2nd level - make it easy to use / convert.

funderburkjim commented 8 years ago

Re CORRECTIONS from above: These changes made to pwbib0.txt, after consulting scans.

ÇṀKARAVIG7AJA . -> ÇAṀKARAVIG7AJA per vol. 4. This is a typo, corrected in pwbib0.
ANUKRAMAN2IK4 -> ANUKRAMAN2IK4A7 per vol. 4. Typo.
GIT.GITAGOVINA, Ausg -> GÎT. GÎTAGOVINDA, Ausg per vol 1. Typos
ÇRIP. ÇRIPATI. -> -> ÇRÎP. ÇRÎPATI. pwbib0 is coded correctly (I7). The transcoding file as_roman.xml is in error here. It says to use \u00d4 for I7 (capital letter i with circumflex), but the correct code is \u00ce. as_roman.xml so corrected.

NOTE TO SELF: \u00d4 is supposed to be capital letter o with circumflex. But, the print (in pwbib1.txt) shows as plain capital I. Why?
NIL. -> NÎL typo. Print smudgy.
NILAK. mit einer Zahl -> NÎLAK. mit einer Zahl typo. Print smudgy.
NILAK. -> NÎLAK. typo. Print smudgy.also NILAKAN2T2HA -> NI7LAKAN2T2HA
NILAR. Up. -> NÎLAR. Up. typo. Print smudgy. Also NILARDROPANISHAD -> NI7LARDROPANISHAD

The first two corrections were NOT made: NIDÂNAS (ÛTRA) (A. WEBER). -> NIDÂNA (SÛTRA) (A. WEBER). DATTAKAÇ (IROMAṆI), Calcutta 1867 (JOLLY). -> DATTAKA (ÇIROMAṆI), Calcutta 1867 (JOLLY).

The reason is that the spelling agrees with scans (in vol. 6 and vol. 5)

funderburkjim commented 8 years ago

Made some changes to as_roman.xml. The objective of these changes was to represent in unicode the scan type. So pwbib1.txt is supposed to be the 'authentic' (same as scan) unicode version.

A separate transcoder (as_iast.xml) will be developed to generate from pwbib0.txt a pwbib1_iast.txt.

Here are the improvements made to as_roman.xml

m5 -> m̃ and M5 -> M̃
- these use the unicode combining tilde (\u0303). So, m̃ = \u006d\u0303, and M̃ = \u004d\u0303
- Further corrected pwbib0.txt to use a consistent 'AS' notation, namely m5 and M5 .

3 instances of m~ (changed to m5) 
   94:+.G4AIM. == G4AIMINI'S Mima7m~sa7darc2ana in der Bibl. ind. (vol. 1)
    138:+.KA7D. == KA7DAMBARI, Calcutta Sam~vat 1919 (KERN). (vol. 1)
    342:.HARIV. LANGL. == LANGLOIS' Uebersetzung des Harivam~c2a. (vol. 2)

2 instances of M7 changed to M5:
  19:+.A7NANDAG. == A7NADAGIRI, Glossator zu C2AM7KARA7K4A7RJA'S Comm. zu BR2H. A7R Up. in der Bibi. ind. (KERN). (vol. 1)
    273:.TATTVAS. == TATTVASAM7SA, Mirzapore 1850(ROTH). (vol. 1)

There are 43 lines in pwbib0 where an N7 (or G7) is used to code a scan N (or G) with acute accent. There are 73 lines where an N4,G4, or K4 is used to code the same letter+acute; most of these are K4's. This multiplicity of coding is not useful, as far as I can tell. Since I think that the usual convention of @thomasincambodia for coding an acute accent used the '4', I have changed the N7 ot N4 (12 instances), and G7 to G4 (50 instances); this change made in pwbib0.txt.
as_roman.xml already has associates to X4 (X=G,N,K) the appropriate codes for 'latin capital letter X with accent'.
In remaking pwbib1, 4 cases of unknown codes were noticed: These were typos, corrected as:

 C2RA2UTSU7TRA -> C2RAUTASU7TRA
SA7HITJADARPAN2A2 -> SA7HITJADARPAN2A
C2RA7DDH7AK -> C2RA7DDHAK
Vgl. Melanges asiatiques tires du Bulletin de l'Academie Imp7eriale des Sciences de St. Petersbourg. ->
Vgl. Mélanges asiatiques tirés du Bulletin de l'Académie Impériale des Sciences de St. Pétersbourg.

With these various corrections, a new version of pwbib1.txt is now available.

Next step will be to make pwbib1_iast.txt, as mentioned above.

funderburkjim commented 8 years ago

@gasyoun I decided to change as_roman.xml to implement the conversions you suggested above. pwbib1.txt is now constructed with these adjustments.

Note two minor changes to your suggested conversions
There is no pwbib1_iast.txt. There is just pwbib1.txt.
Do some spot checking of pwbib1.txt and let me know of either
- missed conversions
- wrong conversions

When pwbib1 looks ok, I think this issue can be closed.

gasyoun commented 8 years ago

Time to close?

Andhrabharati commented 3 years ago

The printed text has some lower and upper case 'm' characters with a tilde above.

The digitization codes these as m~ and M~ sometimes, and sometimes it codes it as M5 or m5.

I could not find a Unicode representation for these m with tilde. So replaced them with a 'dot above' : Ṁ and ṁ.

I could not find a Unicode representation for these - sure, because there is none. No m with tilde, and that is where the unicode game goes wrong.

@funderburkjim, @gasyoun,

Here are the unicode representations for the above characters, M̃ (E1B7) & m̃ (E5B7). [I am just browsing through PW issues now, before taking up the biblio query (for which I was asked if I could help)]

Andhrabharati commented 3 years ago

I had commented hurriedly. Now seen down the line, that you guys got the unicode characters already.

Andhrabharati commented 2 years ago

@funderburkjim Can this issue be closed now?

sanskrit-lexicon / PWK

Editing changes to pwbib0.txt #14