sanskrit-lexicon / PWG

Boehtlingk und Roth Sanskrit Wörterbuch, 7 Bände Petersburg 1855-1875
0 stars 0 forks source link

Adding of Russian Etymologies Started #6

Closed gasyoun closed 9 years ago

gasyoun commented 10 years ago

https://github.com/sanskrit-lexicon/PWG/blob/master/52-Russian-etym-in-PWG.txt hope I can master the rest as well. It would need an Old Church Slavic font to look authentic, but there are only 52 entries that contain Russian words.

gasyoun commented 9 years ago

Finished in https://github.com/sanskrit-lexicon/CORRECTIONS/issues/7#issuecomment-89058657

gasyoun commented 9 years ago

Before:

<H1A><h><hc3>100</hc3><key1>ahi</key1><hc1>1</hc1><key2>a/hi</key2></h><body> <lex type="inh">m.</lex> <c>N._of_a_<as0>R2ishi</as0><as1><s>fzi</s></as1>_</c> <p><c>with_the_patron.</c>~<s>OSanasa</s></p> <c>and_of_another_</c> <p><c>with_the_patron.</c>~<s>pEdva</s></p>. <p><b><c><c1><ab>Zd.</ab></c1>~<etym>az8i</etym>~<c1>;_<ab>Lat.</ab></c1>~<etym>angui-s</etym>~~;~~<c1><ab>Gk.</ab>_<gk>1</gk>_,_<gk>2</gk>_,_<gk>3</gk>_,_and_<gk>4</gk>_;_<ab>Lith.</ab></c1>~<etym>ungury-s</etym>~~;~~<c1><ab>Russ.</ab>_$;</c1>_<ab>Armen.</ab></c>~<etym>o7z</etym>~<c>;_<ab>Germ.</ab></c>~<etym>unc</etym>.</b></p> </body><tail><MW>015869</MW> <pc>125,1</pc> <L>21811</L></tail></H1A>

After:

<H1A><h><hc3>100</hc3><key1>ahi</key1><hc1>1</hc1><key2>a/hi</key2></h><body> <lex type="inh">m.</lex> <c>N._of_a_<as0>R2ishi</as0><as1><s>fzi</s></as1>_</c> <p><c>with_the_patron.</c>~<s>OSanasa</s></p> <c>and_of_another_</c> <p><c>with_the_patron.</c>~<s>pEdva</s></p>. <p><b><c><c1><ab>Zd.</ab></c1>~<etym>az8i</etym>~<c1>;_<ab>Lat.</ab></c1>~<etym>angui-s</etym>~~;~~<c1><ab>Gk.</ab>_<gk>1</gk>_,_<gk>2</gk>_,_<gk>3</gk>_,_and_<gk>4</gk>_;_<ab>Lith.</ab></c1>~<etym>ungury-s</etym>~~;~~<c1><ab>Russ.</ab>ûgorj</c1>_<ab>Armen.</ab></c>~<etym>o7z</etym>~<c>;_<ab>Germ.</ab></c>~<etym>unc</etym>.</b></p> </body><tail><MW>015869</MW> <pc>125,1</pc> <L>21811</L></tail></H1A>

ahi

What should be done, to convert the Anglicized Sanskrit az8i to aži? Has the time for Unicode come to the etymology section of MW, @funderburkjim ? https://github.com/sanskrit-lexicon/GreekInSanskrit/issues/18 continued.

gasyoun commented 9 years ago

Before

<H1B><h><hc3>100</hc3><key1>fBu</key1><hc1>1</hc1><key2>fBu/</key2></h><body>  <lex type="inh">m.</lex>  <p><b><c><ab>cf.</ab>~<c1><ab>Gk.</ab>_<gk>1</gk>_;_<ab>Lat.</ab></c1>~<etym>labor</etym>~<c1>;_<ab>Goth.</ab></c1></c>~<etym>arb-aiths</etym>~~;~~<c><ab>Angl.Sax.</ab>_$_;_<ab>Slav.</ab></c>~<etym>rab-u8</etym>.</b></p>  </body><tail><MW>026855</MW> <pc>226,2</pc> <L>38965</L></tail></H1B>

After

<H1B><h><hc3>100</hc3><key1>fBu</key1><hc1>1</hc1><key2>fBu/</key2></h><body>  <lex type="inh">m.</lex>  <p><b><c><ab>cf.</ab>~<c1><ab>Gk.</ab>_<gk>1</gk>_;_<ab>Lat.</ab></c1>~<etym>labor</etym>~<c1>;_<ab>Goth.</ab></c1></c>~<etym>arb-aiths</etym>~~;~~<c><ab>Angl.Sax.</ab>earfoð;_<ab>Slav.</ab></c>~<etym>rab-u8</etym>.</b></p>  </body><tail><MW>026855</MW> <pc>226,2</pc> <L>38965</L></tail></H1B>

fbu

gasyoun commented 9 years ago

Before

<H3><h><hc3>000</hc3><key1>vyodana</key1><hc1>3</hc1><key2>vy--o/dana</key2></h><body> <p><c>$</c>~ <lex type="hwalt">ind.</lex>~ <ab>accord.</ab>~<c>to</c>~<ls>Sa1y.</ls>~<c>=</c>~<s>viviDe 'nne labDe sati</s></p> <ls>RV._viii_,_52_,_9.</ls> </body><tail><MW>131042</MW> <pc>1029,1</pc> <L>208305</L></tail></H3>

After

<H3><h><hc3>000</hc3><key1>vyodana</key1><hc1>3</hc1><key2>vy--o/dana</key2></h><body> <p><c>s.</c>~ <lex type="hwalt">ind.</lex>~ <ab>accord.</ab>~<c>to</c>~<ls>Sa1y.</ls>~<c>=</c>~<s>viviDe 'nne labDe sati</s></p> <ls>RV._viii_,_52_,_9.</ls> </body><tail><MW>131042</MW> <pc>1029,1</pc> <L>208305</L></tail></H3>

vyodana

gasyoun commented 9 years ago

Before

<H2B><h><hc3>110</hc3><key1>Sala</key1><hc1>2</hc1><key2>Sala/</key2></h><body> <lex>m.</lex> <c>a_<ab>partic.</ab>_measure_of_length</c> <p><cf/>~<c>$-</c>~,~<s>paYcaS<sr1/></s>.~<etc/></p> </body><tail><mat/> <pc>1058,3</pc> <L>214077</L></tail></H2B>

After

<H2B><h><hc3>110</hc3><key1>Sala</key1><hc1>2</hc1><key2>Sala/</key2></h><body> <lex>m.</lex> <c>a_<ab>partic.</ab>_measure_of_length</c> <p><cf/>~<c>tri-</c>~,~<s>paYcaS<sr1/></s>.~<etc/></p> </body><tail><mat/> <pc>1058,3</pc> <L>214077</L></tail></H2B>

sala

tri-?

funderburkjim commented 9 years ago

@gasyoun Re az8i: Let me give a little background.

I think Thomas viewed the 'letter+number' system as a way to represent in digitizations any Latinate letter with its diacritics. I think he originally came up with this system in the 1990s as a way to code the Sanskrit words which appear in IAST in Monier-Williams Dictionary, and he called it Anglicized Sanskrit in this context. Then (and this is speculation on my part) he realized that the same idea could be applied to code the non-Sanskrit non-English words which appear in the etymologies sections of many MW entries.

The advantage of the system is that a computer file with such coding is pure lower-128 Ascii , the most ubiquitous encoding system; such text 'looks the same' when viewed in probably any text viewing system. Another advantage is that you don't actually need to know the language being represented. A third advantage is that the same number (e.g. a '1' for macron above) can be used following any letter to represent a specific diacritic -- thus, the system is parsimonious.

A key disadvantage of this system in the Cologne digitizations is that the translation table was never stated explicitly by Thomas, except for the most frequently occuring IAST cases. Thus, when we see an '8', it is not known exactly what diacritic is to be represented.

Now, 20 years later, the Unicode system has becoming universally accepted as a way to represent the letters occuring in all modern languages. And there has even been some work to extend the system to non-modern languages, notably with the Vedic extensions, to represent the accents appearing in Vedic texts.

It is also true that not all letter-diacritic combinations appearing in the Sanskrit dictionaries have a corresponding Unicode representation.

That's the end of the background comment.

Regarding the coding in the Etymologies, my first impression is that replacing the 'A-S' coding with Unicode would be an improvement.

Ideally, I would like to see this developed as a separate sub-project before incorporating the result into the live dictionary. In other words, tackle all the etymologies at once (treat all the <etym> tags). Maybe a separate repository, like the GreekInSanskrit. The reason is so that the different sub-issues can be handled with some uniformity.

funderburkjim commented 9 years ago

I was confused by the fact that the etymologies re 'MW' are being discussed in this 'PWG' issue.

gasyoun commented 9 years ago

The speculation seems true. I'm aware of why ASCII is great. It's not needed, when we have UTF, it's no more 1990. To make a list of the etymologies I will have to convert them to Unicode. Even on the website they are shown as junk. I guess the question is only of the full table of replacements? I've made my efforts to make the full table and we could start, to see what's left over. Not hundreds of variants actually. not all letter-diacritic combinations appearing in the Sanskrit dictionaries - indeed, that is the biggest issue. Which will never be solved fully. I do not think it's as big as an sub-issue, but if you say - so be it. I can check the languages, the words - coding is Jim's side. Let's take a ride?

funderburkjim commented 9 years ago

If you think the MW etymologies not such a huge problem, fine to do it in one or more issues in an existing repository (either MWS or CORRECTIONS repository?).

So, your task is to replace the 'AS-coded' words in <etym> tags to unicode?

Small important detail: Use the xml records from monier.xml (not from mw.xml). The difference is unrelated to the problem, but will make my life easier.

   The difference is that monier.xml has sanskrit accents represented by a symbol (/,\,or ^) coming   
   *before* a vowel, while mw.xml has the symbol *after* a vowel.   The current reference for MW is 
   monier.xml.   The reason for mw.xml is that it follows the
   current SLP1 convention for accent placement.