sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

RegEx 5, 4, 7. 6, 41, 2. 7, 53, 3. 8, 1, 1. 3. 10, 2, 13. #35

Closed gasyoun closed 10 years ago

gasyoun commented 10 years ago

Instead of Helvetica one could use Old Standard (https://www.dropbox.com/sh/v93vyxe6qlv2djg/AAAU0JlKsE7jh-irNggaAzTsa) - font similar (almost identical, as based on it) to the original Petersburg dictionary font. And Charter Indologique Capital (https://www.dropbox.com/sh/rujslxf4xe16hst/AAD9CuFYB72rIBouPMFNqoxRa) for text references. When I see a long "train" of quotes, I hardly understand: 1) who belongs to who 2) where is end, where is beginning Simple cases like पुण्यं प्राणान्धारयति MBH. 1, 6056. यावच्च मे धरिष्यन्ति प्राणा देहे N. 5, 31. are great, but in मेमं प्रा॒णो हा᳠सी॒न्मो अ᳠पा॒नः AV. 2, 28, 3. 3, 15, 7. प्राण, व्यान, चक्षुस् 5, 4, 7. 6, 41, 2. 7, 53, 3. 8, 1, 1. 3. 10, 2, 13.
I hope the 2nd quote is related to AV., otherwise I'm lost. And as the list of books is not endless, I would definitely try and do some RegEx to color "5, 4, 7. 6, 41, 2. 7, 53, 3. 8, 1, 1. 3. 10, 2, 13." and make 1st same font size as AV., all the next (2nd = 10 pt, 3rd = 9 pt, on rare occasions 4th each smaller than previous). Exactly as in book. Book is easy to read because of that. Digital version is impossible to read exactly because of that, especially Grosses Petersburger Wörterbuch. prana

So here is what I made in Word, printed to PDF and made a png out of it, to upload it as a sample here: prana-2_1 prana-2_2

funderburkjim commented 10 years ago

Regarding the 'bricks' circled in your image for prARa (slp1) - This is an issue with browser display of the svarita accent.
image This is how it looks in Firefox (version 28.0) . Chrome looks similar.
In Firefox, the default font for browser Devanagari shows as Mangal.

In the Accent Help page http://www.sanskrit-lexicon.uni-koeln.de/scans/PWGScan/2013/web/webtc1/help/accents.html, mention is made of the need for Praja font. When that was written several months ago, it was necessary to install a particular non-free font (Praja) to see the accents in Firefox - I have that font on my system, and it may be that the browser is using it - I am not sure.

Due partly to the spotty display in browsers of accents, I today added an option in the pwg displays to ignore the accents.

funderburkjim commented 10 years ago

A modification of the pwg display for literary sources was made along the lines suggested. Here's a sample: image

How is that?

gasyoun commented 10 years ago

It's a nice addition. Still I think it's time to leave Praja behind and get to online font display solutions. And nothing was said about AV. 2, 28, 3. 3, 15, 7. kind of things, though that is the main topic of the thread.

funderburkjim commented 10 years ago

Did you not see that the numbers in AV. 2, 28, 3. 3, 15, 7 are graded in size, which directly addresses the main topic of the thread?

funderburkjim commented 10 years ago

The picture has nothing to do with Praja font. The only point of Praja font is that it displays Devanagari accents. The picture was made with the 'Hide Accents' option, so accents are not shown.

gasyoun commented 10 years ago

"Did you not see that the numbers in AV. 2, 28, 3. 3, 15, 7 are graded in size" - I saw and showed a sample in doc, exported to PDF and attached here to show the grading. Can we try to reproduce it, please?

funderburkjim commented 10 years ago

It IS implemented ! I made program change in response to this issue. Note, in my image, which is sreen shot of PWG2013 after implementation, how, in AK,2,8,2,88 the sizes go down. Here's another example, using 'deva':

image

I can tinker with the relative sizes and/or other style adjustments if you think it is needed, but the basic implementation regarding size gradations seems ok to my eye. Do you agree?

gasyoun commented 10 years ago

I fully agree! I see it now. Let me ask a few more, mixed (all based on entry "para"). 1) WARREN, KÂLASAM5K. 374 = 24{Ç}{?}. 5K - anglicized Sanskrit rudiment? {Ç}{?} - what meaning does {} has? Is {?} related to {Ç}, undeciphered part of the book or text from the book itself?

2) [Page05.1581] is wanting a color different than the main text and a bibliographical tooltip (in case of [Page04.0481] it's "lost" inside "quote" markup). If I want to quote a dictionary entry, I need to quote it based on a standart. Page05.1581 is good for MySQL, bad for accademical papers. What if we add a JS - if I copy the article the exact bibliographical data gets added after the text as per example the Harvard standart. Similar with प॑र [L=42310] [p= 4-0479]. L=42310 is great for the database, but can I search for a word based on it? Instead of p= 4-0479 I would want to have an alternative, bibliography based output. I could prepare the way I understand it should look like, if others agree in the need in such functionallity above other.

3) Headword (पर) left in Mangal intentionally?

4) MBH quotes are non-usual. In "MBH.3,13386. 15534. 13,602. 3797. 14,2783." 15534 would have to be same size as 13386.

5) Boethlingk used an outdated transliteration even for his time. So the tooltip could partly solve it as well - when on RÂǴAN. im ÇKDR. we could show at least 3 letters the way they are written nowadays in IAST. It would help for search as well.

6) "Ind. St.3,235,a." has no spacing, in similar cases in other sources there is spacing based on the book between numbers.

7) "RAGH. 3,-. 7, 38. 17, 59." check for "-" in book, see in several places, need approval.

8) In "ÇAT. BR. 5, 1, 1, 13." second 1 should be smaller, no? Similar in "AK. 2, 6, 2, 49. 3, 6, 3, 26."

funderburkjim commented 10 years ago

You have raised several good points in this and the last few issue comments of today and yesterday. It will be several days before I can attend to responses. In the meantime, here's one side question: Have you already developed a literary source abbreviation correspondence for the abbreviations in PWG - such a table would be needed for giving online tool-tips for AV, etc. in PWG

gasyoun commented 10 years ago

Yes, topics are wide and mixed. Some are general, some - related only to a few entries. Some are for me, some - for you. As per "literary source abbreviation correspondence for the abbreviations in PWG" - we have some at least. So in 3 months we might have more, but even that would be soon enough and possibly impossible. Opened another entry to compare if anything new occurs - it does indeed.

Entry "dhAtu" (noting while on plane to Siberia): 9) सहो॒भरिः᳠ ṚV. 5, 44, 3. shows with a double danda at the end - must be the notorious "brick" issue. When selected - gone. A mystery for me still. There are several ghost issues in the devanagari font, must be the vedic small things.

10) not all numbers should be grey (because numbers can occur inside texts as well) genannten 7 noch den 3 Dhâtu hervorgegangenen 5 Dhâtu von 10 Dhâtu In other cases (similar) no markup issue at all (everything fine, non-marked): nur 4 Elemente die 5 Sinnesorgane wahrgenommenen 6 Eigenschaften

11) Lexicogrr. ? Same in book. "rr" does not makes much sense to me. But I see it in "Toupii emendationes in Suidam et Hesychium, et alios Lexicogrr. Gnecos." after some googling so it's ok. So it's Greek.

12) "vgl. u. 6." - link to "— 6)"?

13) Ausg. des Lot. de la b. l. 511. fgg. , not only "l. 511. fgg." bibliography "Ausg. des Lot. de la b." as well

14) "in's Sanskrit" -> "in’s Sanskrit" typographical OCR error. Same in Çâkjamuni's. So one could try to autoreplace "any letter + 's" http://webdesignledger.com/tips/common-typography-mistakes-apostrophes-versus-quotation-marks

15) der Erde, - der Gebirge - is "-" the typographically correct choise? SUÇR. 1, 44,-. 88, 5. http://www.chicagomanualofstyle.org/qanda/data/faq/topics/HyphensEnDashesEmDashes.html

16) "Z. 15 lies" and "lies" is a part of meta-language, but it would not say that it deserves a place in the quote markup. "Z. 15" is related to the exact page (= line 15), so ideally it would link internally where it should belong. German words that are used as meta-language should be filtered out from rest of the markup or added one on it's own right. So "(in dieser Bed. m. n. nach VIÇVA bei UǴǴVAL. zu UṆÂDIS. 1, 70)" bei, zu should be left out. Instead of an unknown source "VIÇVA bei UǴǴVAL. zu UṆÂDIS. 1, 70" we get 2 that we can try to trace, count - whatever.

17) In books quoted some depend on the exact (mostly unavailable in scanned pdf) edition, some are "eternal". Rigveda is good for experimenting ṚV. 8, 39, 9., Mahabharata is almost useless MBH. 13, 3231. (one could google and hope the part of texts matches, but otherwise the Calcutta editions which is quoted is non-scanned). Shalu, what do you think on it?

Shalu411 commented 10 years ago

Namaste Regarding "svara" or accent issue, it is good that you gave an option for its appearance and otherwise. Thanks a lot. Because for dictionary purposes, svara hardly matters, when one is just interested in meaning and occurence of the word in different contexts. svara for many non-vedic Sanskrit people has not much use. Sometimes I feel that svara comes in between, while reading text, unless I want to look specifically for the same.

Regarding the numbers, indicating references, it is really helpful now. I was confused myself when I looked at them for the first time. Its a wonderful idea invented by the author. Much to learn from different dictionaries, different standards, different styles of expression and presentation.

funderburkjim commented 10 years ago

Regarding "1) WARREN, KÂLASAM5K. 374 = 24{Ç}{?}. "

For reference, here is image from scan: image

Thomas has provided no explanation of this {Ç}{?} coding. The {Ç} coding occurs 600+ times. Generally, the {?} coding is used for 'text unreadable or uncodable'. Based on examining a couple of instances, my suspicion is that {Ç}{?} codes a 'meter' or 'metrial pattern'. The inference would be that those three apostrophes are interpreted as a metrical pattern. I could adjust the displays to write '[metrical pattern]' for such occurrences, if it is thought to be likely accurate.

As one other data point of {C}{?} , see atiSayana (slp): The digitization has

(4 Mal {Ç}{?}, {Ç}{?})

The scan is image

funderburkjim commented 10 years ago

Regarding '2)[Page05.1581] is wanting color...'.

[PageVV.PPPP] is notation used in digitization to indicate a page-break in the scan. VV is the volume (01,...,07), PPPP is page within Volume.

In case of Page04.0481, the page break occurs within a literary source, and was improperly displayed as if it were a part of the literary source. Based on this example, a slight adjustment to webtcc display was made; take a look at the adjustment. Further adjustment can be attempted if required.

funderburkjim commented 10 years ago

Regarding 3) Headword (पर) left in Mangal intentionally?

No. Change was made in webtcc so this headword now uses same font-family as other devanagari.

funderburkjim commented 10 years ago

Regarding: 4) MBH quotes are non-usual. In "MBH.3,13386. 15534. 13,602. 3797. 14,2783." 15534 would have to be same size as 13386.

I agree the scan shows size of 15534 = size of 13386. I do not see how to program this.

funderburkjim commented 10 years ago

Regarding "2)...L=42310 is great for the database, but can I search for a word based on it?"

L=42310 is for identifying which database record is being displayed. One use case is when specifying correction. Currently, the displays do not allow search on this. It has no visible correlate in the scans. It is a kind of 'metadata'.

funderburkjim commented 10 years ago

Regarding "2) ... Instead of p= 4-0479 I would want to have an alternative, bibliography based output".

Please do provide an example of what you have in mind.

Regarding "if I copy the article the exact bibliographical data gets added after the text as per example the Harvard standard"

What is the 'Harvard standard' ?

funderburkjim commented 10 years ago

Regarding '6) "Ind. St.3,235,a." has no spacing, in similar cases in other sources there is spacing based on the book between numbers.'

Here is the digitization coding:

¯{¤Ind.…St.3,235,a.¤}

The lack of space between "St.' and '3' can be called a digitization error, solved by a correction submission. Also, 'Ind' and 'St' are not in all caps (in agreement with text). Is this whole thing really a literary source?

Contrast the Ind. St. case with

¯{¤KULL.…zu…M.…8,…2.¤}

To make further refinements in the display for literary sources may require adjusting Thomas' digitization to some more elaborate form. This was the case with MW. It is difficult to make a myriad of minor adjustments in the display program; I think we are running into the limitis of feasibility of such adjustments. For instance, maybe 'Ind. St.' is an abbreviation for a literary source in the form of two words. In other words, when it is desired to make fine distinctions or to make such things as tooltips be correct in all cases, then, in my experience, it is necessary to add markup. The current markup of PWG can be used for many cases, but is difficult to use and interpret for all cases.

funderburkjim commented 10 years ago

Regarding "8) In "ÇAT. BR. 5, 1, 1, 13." second 1 should be smaller, no? Similar in "AK. 2, 6, 2, 49. 3, 6, 3, 26."

Here is the generated code for display. The first '1' is 80% of parent, the second '1' is 70%. The font sizes computed from this css are : 13.636px for first, 12.727px for second. Perhaps these two are so close as to be indistinguishable. Another possibility might be to use 'px' directly in the css, rather than '%'.

<span class="ls">Çat. Br. 5,
<span style="font-size:90%"> 1</span>,  ----- the first
<span style="font-size:80%"> 1</span>,  ----- the second
<span style="font-size:70%"> 13</span>. 3,
<span style="font-size:90%"> 11</span>. 1,
<span style="font-size:90%"> 9</span>,
<span style="font-size:80%"> 3</span>,
<span style="font-size:70%"> 10</span>. 9,
<span style="font-size:90%"> 1</span>,
<span style="font-size:80%"> 1</span>,
<span style="font-size:70%"> 29</span>. 14,
<span style="font-size:90%"> 9</span>,
<span style="font-size:80%"> 4</span>,
<span style="font-size:70%"> 11</span>.
</span>
funderburkjim commented 10 years ago

Regarding 7) "RAGH. 3,-. 7, 38. 17, 59." check for "-" in book, see in several places, need approval.

This '-' was an obscure bug in the display program, now fixed in webtcc. GOOD CATCH!

funderburkjim commented 10 years ago

Regarding 9) सहो॒भरिः᳠ ṚV. 5, 44, 3.

Yep. That box is due to font not representing svarita accent. Here is image with siddhanta, which shows accents: image

That little 'hook' is representing svarita accent.

gasyoun commented 10 years ago

Finally I can comment on all the topics. I'm happy that the work is going on faster then I have expected. 1) As per " three apostrophes are interpreted as a metrical pattern" it does sounds fishy. Looks like book size more to me. But "4 Mal {Ç}{?}, {Ç}{?}" is 100% a metrical pattern and in Old Standart you can even show it as in the book. A list of all {Ç} would help to decide. Is it possible to collect all the corresponding .tiff images in one .pdf as well? 2) "Page04.0481" much better now. As it's metadata, I would make it orange or bold - to show that it's not part of the book. Everything metadata - in some similar formatting. I would love to have an option not to see metadata (including L=42310) on the page. And if it's there I do would want to see it in contrasting color. As per the bibliography (around 460 books) the work is in process:

  1. BHÂNUD. = BHÂNUDÎKSHITA, ein Commentator des AMARAKOSHA; nach Anführungen im ÇKDR.
  2. BHAR. = BHARATA, ein Autor über Schauspielkunst, verschieden vom Scholiasten zum AK. (dieser nach Anführungen im ÇKDR.).
  3. BHARTṚ. = BHARTṚHARI, Ausg. von BOHLEN (GILD. Bibl. 156), mit Berück- sichtigung des Werkchens: Variae lectiones ad Bohlenii editionem Bhartriharis sententiarum pertinentes, e codicibus extractae per A. SCHIEFNER et A. WEBER. Berolini 1850.
  4. *BHÂSHÂP. = BHÂSHAPARIḰḰHEDA nach der Ausgabe von RÖER in der Bibliotheca indica.
  5. *BHAṬṬ. = BHAṬṬIKÂVJA, ed. Calc. (GILD. Bibl. 137).
  6. BHÂVAPR. = BHÂVAPRAKÂÇA, ein medic. Wörterbuch; nach Anführungen im ÇKDR.
  7. BHAV. P. = BHAVISHJAPURÂNA.
  8. BHÛRIPR. = BHÛRIPRAJOGA, ein Wörterbuch; nach Anführungen im ÇKDR.

Example of a Harvard Bibliography http://community.ucreative.ac.uk/article/27216/Example-of-a-Harvard-Bibliography 3) Mangal dead. Finally! 4) What if to achieve "scan shows size of 15534 = size of 13386" we count the number of numerals? Like if 2 follow each (15534, 13386) other both with a suspicious length of exactly 5, let's leave them similarly formatted. 6) "Is this whole thing really a literary source?" - yes it is. There are huge number of unlisted sources that I have found in Excel analyzing the data. Ind. St. = Weber, Albrecht [Hrsg.] Indische Studien : Zeitschrift für die Kunde des indischen Alterthums / ... hrsg. von Albrecht Weber. - Bd. 1. - Berlin : Dümmler, 1850. - IV, IV, 484 S. http://digital.indologica.de/?q=node/1679 So yes, "'Ind. St.' is an abbreviation for a literary source in the form of two words". As per "it is necessary to add markup" - what did you use in MW in such cases? Thomas already has 5 different markups, it's a mess. 8) As per my webmaster knowledge "use 'px' directly in the css, rather than '%'." should do the trick. 9) I see and still do not get it. Never I have seen such accent in this or any other book. What we could do is implement accents PWG-wise in Santipur, that would be the best solution I guess. Otherwise it's a pseudo-scientific devanagari.