Russian in PWG, a few more

funderburkjim commented 6 years ago

Most of the Russian script appearing in PWG has been provided.

Here are a few more that need filling in

Case 1. hw = Ahanas

Case 2. hw = dIrGa

Case 3. hw = dIrGa

Case 4. hw = dIrGa

Case 5. hw = kAkud

Case 6. hw = dIrGa

all the dIrGa's are on page 3-0654

gasyoun commented 6 years ago

1) набитый 2) duplicate? 3) съдръгиѫти сѧ - not Russian, but Old Churck Slavic (only difference is font for output should differ) 4) судорога 5) небо 6) дергать

SergeA commented 6 years ago

~~2. duplicate?~~ ~~3. съдръгиѫти сѧ~~

съдръгати сѧ
съдръгнѫти сѧ

I'd like to review the spelling of other OCS words already written in PWG.

not Russian, but Old Churck Slavic (only difference is font for output should differ)

Yes, OCS and Russian words should be marked up separately. OCS is usually printed with a special extrabold uncial cyrillic font.

funderburkjim commented 6 years ago

I'm taking Gasyoun's texts with SergeA's revisions on cases 2,3.

Regarding the distinction between OCS and Russian:

I'm not sure what OCS (Old Chruch Slavic) is --- Is it an ancient language somehow related to Russian. Does it use the same (or almost the same) alphabet?
the current markup syntax used for MW is <lang n="X" script="Y">text</lang>, This was developed in discussions such as here. Based on the mw.dtd, an example would be <lang n="Arabic" script="Arabic"> or <lang n="Persian" script="Arabic">.
This markup may provide enough parameters to describe the current OCS-Russian distinction.
Except for the MW example with Arabic, I've used the <lang> markup more casually, e.g. <lang n="greek">X</lang>. Also, there is probably some inconsistency in the exact spelling of the attribute values (e.g., maybe I've used <lang n="arabic"> (small 'a') elsewhere.
Not all foreign language scripts are currently marked with the lang tag. For instance, Greek is marked in a different (idiosyncratic) way in MW. There may also be some 'naked' (unmarked) Rusiian text in MW.
The main utility of the <lang> tag is to provide a uniform way to identify (for filtering) certain parts of the Unicode code points other than Devanagari and modern English, German, and French or Latin that occurs as the base language of a dictionary.
At the moment, the markup for the above 6 cases is <lang n="russian">X</lang>. Currently, there are 82 cases in PWG with this markup.

@SergeA -- Do you want to review those 82 cases? If so, what do you need from me for that review?

If it seems important to revise the markup to distinguish the OCS cases, then it probably makes sense to do this after a review of the 82 cases.

gasyoun commented 6 years ago

Do you want to review those 82 cases?

All. Just a list will do.

Is it an ancient language somehow related to Russian. Does it use the same (or almost the same) alphabet?

Yap, most letters are the same.

SergeA commented 6 years ago

Do you want to review those 82 cases?

Yes.

If so, what do you need from me for that review?

You know, I´ll be happy if you´ll provide a UI for the task. Or perhaps a list with (headword / Russian word / link to the online article / link to the scan image). Is it easy to generate such list? I do not know the technical side. I do not know if it is possible to give a link to the exact Cologne article. Often it´d be very handy to provide in the discussion a direct link to the MW meaning etc.

funderburkjim commented 6 years ago

@SergeA Give this russian02.html a try.

Note 1: I developed this for a desktop screen (1920 x 1080), and have only used Chrome in testing.

Note 2: The (russian) that appears in the display is just temporary for this program, to make it easier to find the russian words in some of the long texts. When we've finished this review, I'll remove '(russian)' from the display.

SergeA commented 6 years ago

Thank you, it's workable ok. I think I'll finish the check quickly. (Perhaps better shorten the list field, for lesser scrolling to the dic field below and back again. )

SergeA commented 6 years ago

Need correction:

08 aNgAra ѫглъ = ѫгль
11 aBi овъ = объ
21 kftvas краты = кратъ

(№20 краты und №21 кратъ)

58 piSaNga красити = красьнъ
59 piSaNga красити = красный

(№56 краса ... №57 красити ... №58 красьнъ ... №59 красный)

61 barh ФРАГ = greec language!!! ΦΡΑΓ 
65 marka мръкати, мръкатиѧти = мръкати, мръкнѫти
76 lal лелеять = лелѣять

Remove question mark, tag as LS:

38 tala Буслаева, Опытъ истор. гр. русскаго языка
77 varzASAwI Минаевъ, Пратимокша-сутра

Language = OCS:

01 aMhati ѫз-ъкъ
02 aMhu къ
03 aMhu ѫзъкъ
04 aMhu къ
05 aMhu льгъкъ
06 aMhu сладъкъ
07 agni огнь
08 aNgAra ѫглъ = ѫгль
09 apa оу
10 apa оу
11 aBi овъ = объ
12 AtmakIya свои
14 Urmi влъна
18 kfte дѣлѩ, дѣльма
19 kfte дѣло
20 kftvas краты
21 kftvas краты = кратъ 
22 giri гора
24 grIvA гривьна
25 Gar горѣти
26 Gar грѣти
27 Gar горькъ
30 Gar грѣхъ
31 Gar х
33 car чародѣи, чаровати, очаровати
34 Cikkana чьханиѥ
35 jamBa зѫбъ
39 talpa постелѩ
40 tAyu таити
41 tAyu тать
43 darh дръжати
44 dIrGa длъгъ
46 dIrGa съдръгати сѧ
47 dIrGa съдръгнѫти сѧ
49 duhitar дъшти
50 duhitar дъштере
52 nIqa гнѣздо
53 nud ноудити, нѫдити
54 nud ноужда, нѫжда
55 paNkti пѧть
56 piSaNga краса
57 piSaNga красити
58 piSaNga красити = красьнъ
63 Baga богъ
64 manAk мьний
65 marka мръкати, мръкатиѧти = мръкати, мръкнѫти
66 marka мракъ
67 mAMsa мѧсо
68 mUrC мразъ
69 mUrC мразити сѧ
70 meza мѣхъ
78 varzizWa врьхъ
79 vIDra ведро
80 Sruz слоухо
81 Sruz слоухъ
82 Sruz слышати
83 srAma хромъ
84 himA зима

Language = RUS:

13 Ahanas набитый
15 kawAha (Китай)
16 kAkud небо
17 kAlapUga черный народъ
23 grIvA грива
28 Gar жаръ
29 Gar жара
32 caturaNga ладiя
36 tala тло
37 tala дотла
38 tala Буслаева, Опытъ истор. гр. русскаго языка
42 daRqa столбы
45 dIrGa дергать
48 dIrGa судорога
51 nizka гривна
59 piSaNga красити = красный 
60 prasraMsin выкинуть, выкидышъ
62 bAhAdura богатырь
71 akzaravarjita неграмотный
72 antarveDa Антарабида
73 nOka ладiя
74 labDavarRa грамотный
75 labDavarRa грамотѣй
76 lal лелеять = лелѣять
77 varzASAwI Минаевъ, Пратимокша-сутра

Language = Greec: 61 barh ФРАГ = greec!!! ΦΡΑΓ

SergeA commented 6 years ago

Three more corrections, which require fonts with support of Cyrillic Extended-B Unicode range. (In my comp they are viewable with fonts: DejaVu Serif & Old Standard TT.) A657 ꙗ CYRILLIC SMALL LETTER IOTIFIED A A651 ꙑ CYRILLIC SMALL LETTER YERU WITH BACK YER

18 kfte дѣлѩ, дѣльма = дѣлꙗ, дѣльма
39 talpa постелѩ = постелꙗ
82 Sruz слышати = слꙑшати

I am not sure about the last if it makes any difference to spell with ы or ꙑ. I am not an expert in OCS, but it seems to me they are optional graphical variants. In another example Böhtlingk spells through ы. 20 kftvas краты Perhaps this is print error for кратꙑ - this way I see it in the OCS dictionary. http://ksana-k.ru/?p=803 http://ksana-k.ru/dict/stsl/sl0293.png

funderburkjim commented 6 years ago

In case 38:

I'm removing the <lang n="Russian"> tag for Буслаева, Опытъ истор. гр. русскаго языка, since this is name of work within <ls> tag. This is to make it easier to link this with pwgauth. Also added this to the pwgbib.txt file.

Should we provide in pwgbib an English translation and/or other description of this Russian language work?

funderburkjim commented 6 years ago

Language names

We should probably aim to follow a standard spelling for language name in <lang n="LANGUAGE">.

I used language names consistent with ISO 639-2.

<lang n="Russian"> and <lang n="Old-Church-Slavonic">.

Why the hyphens in Old-Church-Slavonic ?

There is a clash in two standards:

the 639-2 standard spelling for language names sometimes has a space within a name, as 'Old Church Slavonic'.
the DTD (document-type definition) standard does not allow spaces in attribute values which are enumerated.
- Unallowed: <!ATTLIST lang n (arabic | Russian | greek | Greek | oldhebrew | Old Church Slavonic) #REQUIRED >

It seems like a good consistency and documentation feature to require specific spellings for the values of the n attribute of the lang tag,

So a compromise has to be made in our usage of one or the other of these two standards.

The compromise I made is to replace the space character with the hyphen character. Then the DTD clause may be written <!ATTLIST lang n (arabic | Russian | greek | Greek | oldhebrew | Old-Church-Slavonic) #REQUIRED > and validation of pwg.xml against pwg.dtd works fine.

Capitalization

Note that the attribute list is case-sensistive. In the one case (among the 84) where the language was Greek, I spelled the attribute <lang n="Greek">, with a capital G. This is in conformity with the 639 language name standard. But we have written other Greek language tags as <lang n="greek">, with lower-case 'g'. So to make current pwg.xml validate, both spellings are needed in the pwg.dtd file, as shown.

All of this is like arguing about the number of angels that can dance on the head of a pin.

But sometime, when there's nothing better to do, we should standardize the language name spellings throughout PWG (e.g., greek -> Greek), and in the other dictionaries as well.

funderburkjim commented 6 years ago

Cyrillic Extended-B Unicode

I made no change in the display program to add a font representing the characters requiring this portion of the Unicode code points. Nonetheless, when I view the 'kfte' example where such a character occurs, I see no problem. At the moment, I don't know whether this is a fortuitous accident with my browser/OS/font setup.

status of russion01.html

This display is in an in-between state now.

The ITEM LIST still shows the spellings BEFORE the corrections
The PWG display for any item will show the spellings AFTER the corrections.

gasyoun commented 6 years ago

The compromise I made is to replace the space character with the hyphen character.

It works, that's enough. ISO was not aware of DTD.

All of this is like arguing about the number of angels that can dance on the head of a pin. But sometime, when there's nothing better to do, we should standardize the language name spellings throughout PWG (e.g., greek -> Greek), and in the other dictionaries as well.

Dance, dance :dancer:

SergeA commented 6 years ago

38 tala Буслаева, Опытъ истор. гр. русскаго языка Ѳ. Буслаевъ. Опытъ исторической грамматики русскаго языка. Москва, 1858. (F. Buslaev. An experimental historical grammar of Russian language. Moscow, 1858.)

77 varzASAwI Минаевъ, Пратимокша-сутра И. Минаевъ. Пратимокша-сутра, буддійскій служебникъ. Санктпетербургъ, 1869. (I. Minaev (also Minayev, Minayeff). Prātimokṣa-sūtra, a buddhist ritual book. Sankt-Peterburg, 1869.)

SergeA commented 6 years ago

61 barh ФРАГ = greec!!! ΦΡΑΓ Cyrillic letters ФРАГ should be replaced with Greek ΦΡΑΓ.

SergeA commented 6 years ago

Cyrillic Extended-B Unicode

I made no change in the display program to add a font representing the characters requiring this portion of the Unicode code points. Nonetheless, when I view the 'kfte' example where such a character occurs, I see no problem. At the moment, I don't know whether this is a fortuitous accident with my browser/OS/font setup.

I also see it ok in the browser in my comp. But in the mail-program ꙗ and ꙑ are changing to squares. Tested in a Win7 comp without additional fonts - in the browser also squares.

funderburkjim commented 6 years ago

61 barh ФРАГ = greec!!! ΦΡΑΓ

So Seeing is NOT believing in this case!> Didn't realize Russian and Greek have 'homoglyphs' . Will make the change to Greek Unicode

funderburkjim commented 6 years ago

Have added ls expansions for Buslaev and Minaev.

NOTE: This may be one more ls with Russian that needs expansion, 72 antarveDa Антарабида

This appears to me to be TWO references: (a) Антарабида and (b) WASSILJEW (an author)

funderburkjim commented 6 years ago

Tested in a Win7 comp without additional fonts - in the browser also squares.

Changed display to use Old Standard for Russian and OCS within <lang n="X">Y</lang> tags.

Does this work in that Win7 computer?

SergeA commented 6 years ago

72 antarveDa Антарабида

"Антарабида" is a supposed Sanskrit (or not Sanskrit?) word written in Cyrillic letters (the same way as Sanskrit words written with Latin letters in English text).

the source Wassiljewis: В. Васильевъ. Буддизмъ, его догматы, исторія и литература. Часть 1. Общее обозрѣніе. Санктпетербургъ, 1857. (V. Vasilyev. Buddhism, its dogmata, history and literature. Part 1. General overview. Sankt-Peterburg, 1857.)

There was aslo German edition: W. Wassiljew. Der Buddhismus, seine Dogmen, Geschichte und Literatur . Theil 1: Allgemeine Uebersicht. St. Petersburg, 1860. But Böhtlingk gives the word in Cyrillic "Антарабида" , p. 55 from Russian edition, not "Antarabida", p. 60 from German ed.

SergeA commented 6 years ago

Does this work in that Win7 computer?

I mentioned that comp just to say there can be a font problem. As I said, that comp is without any additional fonts. And without fonts you can not help. If I'll install some fonts perhaps it will work. But which font must I install? Perhaps in the site there should be some recommendations about fonts.

funderburkjim commented 6 years ago

which font must I install?

Let's assume you are using one of the Cologne displays, say the basic display for PWG. In this scenario, you should not have to manually download any fonts to the Win7 computer.

When required by the display, the browser takes care of downloading Old Standard Font from the Cologne server. This is a usage of what is called a 'web font'; this article describes the details of web programming.

If you 'inspect' one of the russian or ocs words in chrome browser you can see this:

Note1: I had to clear cached files in browser for this to work; Ctrl-F5 doesn't adequately clear things in russian02.html here since the displays are in Iframes.

Note2: This network (or web) font may not be in a place used by other programs on the computer, such as the mail program. If you need to install the Old Standard Font into the Windows OS Fonts, I can give you a link.

gasyoun commented 6 years ago

used by other programs on the computer, such as the mail program

Actually, that's not or goal - mail programs. So web font works well enough for web.

SergeA commented 6 years ago

Let's assume you are using one of the Cologne displays, say the basic display for PWG.

Ooops! I've tried the link with Russian list from this thread. Now tested it in Win7 + Chrome63 with basic PWG interface - yes, it works! )) Thanks for explanation and for the picture how to check. Now I tried in my comp with WinXP SP3 + Chrome49 and found it works both in PWG basic and in Russian list page, though it don't show correct web-font name and gives it as

Rendered Fonts
kpTDuo2x8027lBJI3wn-ew==—12 glyphs

The behavior of fonts and browsers is quite mysterious.

SergeA commented 6 years ago

So web font works well enough for web.

The question is - do it really always and everywhere works well enough?

gasyoun commented 6 years ago

The only question remains - why are only etymologies in Old Standard, but not all of the non-originally-Devanagari text. Rendering of fonts in mail exchange software is out of our reach.

funderburkjim commented 6 years ago

really always and everywhere works well enough

Impossible to tell. Browser technologies change all the time. My informal aim in developing web apps is that they work with modern browsers. I'm often unsure whether a particular new feature (say of Javascript ES6) should be used. For instance, if such a feature works with Chrome but not Firefox, I won't use it. Of course there are so-called Javascript transpilers which convert ES6 to ES5, but I've avoided workflows using such 'build' steps.

So, if it works on a venerable WINXP SP3, that's nice; but if it doesn't, that will have to remain an unfortunate case.

funderburkjim commented 6 years ago

why are only etymologies in Old Standard

This is a good question. In the current displays, there is some looseness in the specification of what CSS styles apply to what parts of the page.

My implicit game plan is:

Finish the meta-line/IAST conversion for all the dictionaries
Rewrite the displays so that all of them derive from a common code base.

It is in this second step that clearing up that CSS looseness can be addressed.

Currently about 50% of the dictionaries have been converted (see the tracker).

It might be that work can begin now on that second step for those converted dictionaries, without waiting on the conversion of all the dictionaries.

gasyoun commented 6 years ago

if it works on a venerable WINXP SP3, that's nice; but if it doesn't, that will have to remain an unfortunate case.

Agree. Keep in mind Serge is on WINXP and I myself in the countryside am on WINXP SP3.

It might be that work can begin now on that second step for those converted dictionaries, without waiting on the conversion of all the dictionaries.

Agree, so it's time for me to get in the game. As you state it's CSS looseness and not something intended to be so, I can play around with it. In February I'll be gone to Poona, so I guess only in March.

drdhaval2785 commented 3 years ago

@gasyoun All Russian and OCS handled here? Time to close?

Andhrabharati commented 2 years ago

My implicit game plan is:

* Finish the meta-line/IAST conversion for all the dictionaries

* Rewrite the displays so that all of them derive from a  common code base.

@funderburkjim

I guess you should make all the works converted to Unicode (and IAST) at the earliest and keep at this Github repos, in addition to whatever encoding you would prefer continue working with.

It makes the collaboration work easier for people like me.

drdhaval2785 commented 2 years ago

These two items have already been done for all dictionaries as far as I remember. If there are any abberrations, they should be treated as bug and be corrected.

Andhrabharati commented 2 years ago

Where can I access the Unicode (or IAST) files?

I am seeing only slp1 (Jim's version) or HK (Thomas's version ?) files everywhere (rather mostly).

Andhrabharati commented 2 years ago

if I had got the AP90 and PWG Unicode files, @funderburkjim would've happily/easily used all my corrections in them by now.

Andhrabharati commented 2 years ago

Even the MW99 was converted to IAST only after my joining here and asking specifically for it, not before that.

drdhaval2785 commented 2 years ago

SLP1 for Sanskrit and IAST for all other was the thumbrule when we converted the various encodings. SLP1 has its inherent advantage, as it is a non-lossy transliteration scheme.

drdhaval2785 commented 2 years ago

SLP1 does not even require unicode code points. It gets accomodated within ASCII itself. So SLP1 is Unicode compliant.

You may be intending to say IAST when you say Unicode.

drdhaval2785 commented 2 years ago

Sorry to be blunt, but the reason why much of your good work is not being used immediately because it changes the markup or structure irreversibly.

As an observer of processes at Cologne for many years, I find that Jim (and over the years me too) adheres to "invertibility principle" very religiously. Whenever a drastic change is made (like changing the encoding), Jim writes a convertor to and from. If the output of to and fro function yields the original file, then only the drastic changes are made. This ensures that drastic changes do not break anything unintentionally.

I see three ways in which your marvellous and fast work may be incorporates in Cologne.

You take file from csl-orig by forking. Make changes and create a Pull Request.
You may make changes directly to csl-orig and push the changes
You may specify a consistent format (let's say Y) in which you intend to use all dictionaries of Cologne (X). Jim or I can write scripts to convert from X to Y (script A) and from Y to X (script B).

I think the third way will suit you more. Only consideration is that you should not alter the markup in such a way that it is not amenavle to machine handling.

Andhrabharati commented 2 years ago

SLP1 for Sanskrit and IAST for all other was the thumbrule when we converted the various encodings.

Did not expect this from you, Dr. Dhaval!! IAST (International Alphabet of Sanskrit Transliteration) is meant only of Sanskrit transliteration in Roman letters, not for any other language.

Let me give an example,

rAmAyaRa : rāmāyaṇa : रामायण (SLP1 : IAST : Unicode)

By Unicode, I mean the native language lettering (if Skt- Devanagari etc.) for all the languages involved.

Andhrabharati commented 2 years ago

Sorry to be blunt, but the reason why much of your good work is not being used immediately because it changes the markup or structure irreversibly.

The reason for this is very simple- I did all those portions in our format/style, which includes the conversion from one encoding to another; we don't do things in parts/batches, but as a whole set of processes involved altogether at one go.

If I had got the (converted) files from CDSL itself, they would have been straight away useful for further work, as was the case in MW99 work.

drdhaval2785 commented 2 years ago

You need to specify your requirements fully. I will work and provide you file in that format.

Andhrabharati commented 2 years ago

@drdhaval2785

There are no much requirements from my side, a fairly simple/single conversion (for all the works at once, not one-by-one against a specific request) is all that I asked for. https://github.com/sanskrit-lexicon/PWG/issues/39#issuecomment-888021563

@funderburkjim had agreed in principle, and gave few binding points for my further working; but then nothing happened afterwards. https://github.com/sanskrit-lexicon/PWG/issues/39#issuecomment-888462630

BTW, he was still talking about only one work (PWG), not all the CDSL works!!

sanskrit-lexicon / CORRECTIONS