Closed gasyoun closed 9 years ago
Before:
<H1A><h><hc3>100</hc3><key1>ahi</key1><hc1>1</hc1><key2>a/hi</key2></h><body> <lex type="inh">m.</lex> <c>N._of_a_<as0>R2ishi</as0><as1><s>fzi</s></as1>_</c> <p><c>with_the_patron.</c>~<s>OSanasa</s></p> <c>and_of_another_</c> <p><c>with_the_patron.</c>~<s>pEdva</s></p>. <p><b><c><c1><ab>Zd.</ab></c1>~<etym>az8i</etym>~<c1>;_<ab>Lat.</ab></c1>~<etym>angui-s</etym>~~;~~<c1><ab>Gk.</ab>_<gk>1</gk>_,_<gk>2</gk>_,_<gk>3</gk>_,_and_<gk>4</gk>_;_<ab>Lith.</ab></c1>~<etym>ungury-s</etym>~~;~~<c1><ab>Russ.</ab>_$;</c1>_<ab>Armen.</ab></c>~<etym>o7z</etym>~<c>;_<ab>Germ.</ab></c>~<etym>unc</etym>.</b></p> </body><tail><MW>015869</MW> <pc>125,1</pc> <L>21811</L></tail></H1A>
After:
<H1A><h><hc3>100</hc3><key1>ahi</key1><hc1>1</hc1><key2>a/hi</key2></h><body> <lex type="inh">m.</lex> <c>N._of_a_<as0>R2ishi</as0><as1><s>fzi</s></as1>_</c> <p><c>with_the_patron.</c>~<s>OSanasa</s></p> <c>and_of_another_</c> <p><c>with_the_patron.</c>~<s>pEdva</s></p>. <p><b><c><c1><ab>Zd.</ab></c1>~<etym>az8i</etym>~<c1>;_<ab>Lat.</ab></c1>~<etym>angui-s</etym>~~;~~<c1><ab>Gk.</ab>_<gk>1</gk>_,_<gk>2</gk>_,_<gk>3</gk>_,_and_<gk>4</gk>_;_<ab>Lith.</ab></c1>~<etym>ungury-s</etym>~~;~~<c1><ab>Russ.</ab>ûgorj</c1>_<ab>Armen.</ab></c>~<etym>o7z</etym>~<c>;_<ab>Germ.</ab></c>~<etym>unc</etym>.</b></p> </body><tail><MW>015869</MW> <pc>125,1</pc> <L>21811</L></tail></H1A>
What should be done, to convert the Anglicized Sanskrit az8i
to aži
? Has the time for Unicode come to the etymology section of MW, @funderburkjim ? https://github.com/sanskrit-lexicon/GreekInSanskrit/issues/18 continued.
Before
<H1B><h><hc3>100</hc3><key1>fBu</key1><hc1>1</hc1><key2>fBu/</key2></h><body> <lex type="inh">m.</lex> <p><b><c><ab>cf.</ab>~<c1><ab>Gk.</ab>_<gk>1</gk>_;_<ab>Lat.</ab></c1>~<etym>labor</etym>~<c1>;_<ab>Goth.</ab></c1></c>~<etym>arb-aiths</etym>~~;~~<c><ab>Angl.Sax.</ab>_$_;_<ab>Slav.</ab></c>~<etym>rab-u8</etym>.</b></p> </body><tail><MW>026855</MW> <pc>226,2</pc> <L>38965</L></tail></H1B>
After
<H1B><h><hc3>100</hc3><key1>fBu</key1><hc1>1</hc1><key2>fBu/</key2></h><body> <lex type="inh">m.</lex> <p><b><c><ab>cf.</ab>~<c1><ab>Gk.</ab>_<gk>1</gk>_;_<ab>Lat.</ab></c1>~<etym>labor</etym>~<c1>;_<ab>Goth.</ab></c1></c>~<etym>arb-aiths</etym>~~;~~<c><ab>Angl.Sax.</ab>earfoð;_<ab>Slav.</ab></c>~<etym>rab-u8</etym>.</b></p> </body><tail><MW>026855</MW> <pc>226,2</pc> <L>38965</L></tail></H1B>
Before
<H3><h><hc3>000</hc3><key1>vyodana</key1><hc1>3</hc1><key2>vy--o/dana</key2></h><body> <p><c>$</c>~ <lex type="hwalt">ind.</lex>~ <ab>accord.</ab>~<c>to</c>~<ls>Sa1y.</ls>~<c>=</c>~<s>viviDe 'nne labDe sati</s></p> <ls>RV._viii_,_52_,_9.</ls> </body><tail><MW>131042</MW> <pc>1029,1</pc> <L>208305</L></tail></H3>
After
<H3><h><hc3>000</hc3><key1>vyodana</key1><hc1>3</hc1><key2>vy--o/dana</key2></h><body> <p><c>s.</c>~ <lex type="hwalt">ind.</lex>~ <ab>accord.</ab>~<c>to</c>~<ls>Sa1y.</ls>~<c>=</c>~<s>viviDe 'nne labDe sati</s></p> <ls>RV._viii_,_52_,_9.</ls> </body><tail><MW>131042</MW> <pc>1029,1</pc> <L>208305</L></tail></H3>
Before
<H2B><h><hc3>110</hc3><key1>Sala</key1><hc1>2</hc1><key2>Sala/</key2></h><body> <lex>m.</lex> <c>a_<ab>partic.</ab>_measure_of_length</c> <p><cf/>~<c>$-</c>~,~<s>paYcaS<sr1/></s>.~<etc/></p> </body><tail><mat/> <pc>1058,3</pc> <L>214077</L></tail></H2B>
After
<H2B><h><hc3>110</hc3><key1>Sala</key1><hc1>2</hc1><key2>Sala/</key2></h><body> <lex>m.</lex> <c>a_<ab>partic.</ab>_measure_of_length</c> <p><cf/>~<c>tri-</c>~,~<s>paYcaS<sr1/></s>.~<etc/></p> </body><tail><mat/> <pc>1058,3</pc> <L>214077</L></tail></H2B>
tri-
?
@gasyoun Re az8i: Let me give a little background.
I think Thomas viewed the 'letter+number' system as a way to represent in digitizations any Latinate letter with its diacritics. I think he originally came up with this system in the 1990s as a way to code the Sanskrit words which appear in IAST in Monier-Williams Dictionary, and he called it Anglicized Sanskrit in this context. Then (and this is speculation on my part) he realized that the same idea could be applied to code the non-Sanskrit non-English words which appear in the etymologies sections of many MW entries.
The advantage of the system is that a computer file with such coding is pure lower-128 Ascii , the most ubiquitous encoding system; such text 'looks the same' when viewed in probably any text viewing system. Another advantage is that you don't actually need to know the language being represented. A third advantage is that the same number (e.g. a '1' for macron above) can be used following any letter to represent a specific diacritic -- thus, the system is parsimonious.
A key disadvantage of this system in the Cologne digitizations is that the translation table was never stated explicitly by Thomas, except for the most frequently occuring IAST cases. Thus, when we see an '8', it is not known exactly what diacritic is to be represented.
Now, 20 years later, the Unicode system has becoming universally accepted as a way to represent the letters occuring in all modern languages. And there has even been some work to extend the system to non-modern languages, notably with the Vedic extensions, to represent the accents appearing in Vedic texts.
It is also true that not all letter-diacritic combinations appearing in the Sanskrit dictionaries have a corresponding Unicode representation.
That's the end of the background comment.
Regarding the coding in the Etymologies, my first impression is that replacing the 'A-S' coding with Unicode would be an improvement.
Ideally, I would like to see this developed as a separate sub-project before incorporating the result into the live dictionary. In other words, tackle all the etymologies at once (treat all the <etym>
tags). Maybe a separate repository, like the GreekInSanskrit. The reason is so that the different sub-issues can be handled with some uniformity.
I was confused by the fact that the etymologies re 'MW' are being discussed in this 'PWG' issue.
The speculation seems true. I'm aware of why ASCII is great. It's not needed, when we have UTF, it's no more 1990. To make a list of the etymologies I will have to convert them to Unicode. Even on the website they are shown as junk. I guess the question is only of the full table of replacements? I've made my efforts to make the full table and we could start, to see what's left over. Not hundreds of variants actually. not all letter-diacritic combinations appearing in the Sanskrit dictionaries - indeed, that is the biggest issue. Which will never be solved fully. I do not think it's as big as an sub-issue, but if you say - so be it. I can check the languages, the words - coding is Jim's side. Let's take a ride?
If you think the MW etymologies not such a huge problem, fine to do it in one or more issues in an existing repository (either MWS or CORRECTIONS repository?).
So, your task is to replace the 'AS-coded' words in <etym>
tags to unicode?
Small important detail: Use the xml records from monier.xml (not from mw.xml). The difference is unrelated to the problem, but will make my life easier.
The difference is that monier.xml has sanskrit accents represented by a symbol (/,\,or ^) coming
*before* a vowel, while mw.xml has the symbol *after* a vowel. The current reference for MW is
monier.xml. The reason for mw.xml is that it follows the
current SLP1 convention for accent placement.
https://github.com/sanskrit-lexicon/PWG/blob/master/52-Russian-etym-in-PWG.txt hope I can master the rest as well. It would need an Old Church Slavic font to look authentic, but there are only 52 entries that contain Russian words.