Closed gasyoun closed 3 years ago
The A tag currently works perfectly as an alphabet tag. Every time the A tag is used, it indicates an indisputable use of the Arabic alphabet.
Splitting it into smaller language tags would require an extra level of analysis. The added complication is that words can be more than one language (but not more than one alphabet). Some of these words are both Arabic and Persian. Some are Persian and Turkish. Many of the entries are names, which could be considered Arabic of origin, or shared words between all the Arabic-script languages. I can tell you whether or not the words are Arabic, but I can't tell you whether or not they're also Persian/Turkish. The dictionaries usually indicate the language of origin, but they don't always do so. So it would be a safe procedure to simply do the language coding based on what the dictionary says in the entry. In cases where the dictionary gives the word without indicating language, I can give my best guess, but this is not quite as watertight as leaving it as an alphabet tag.
I'm happy to do the work to retag these words. I don't think it'll add much value to the words in Arabic script, especially given that there are so few of them, but if this is part of a greater improvements of the text then I can see reason for it.
Yes, it's part of a global remake. And no - Monier does not states what language is used in these contexts. To know if it's Arabic or Persian/Turkish is still better than just Arabic script. So as I'm splitting the R
with Cyrillic script in two subgroups, would love to see going deeper in other sections as well. Sure it works well as an alphabet tag, but that's are not what ships are made for :neckbeard:
OK, well I'm certainly willing to help split the tags. Let me know if you all have decided you'd like to do this, and then let me know how best to do it. Questions I would have are:
@funderburkjim let's give @jlreeder a chance?
What other tags should I use?
Arabic or Persian/Turkish. Non other I'm aware as used. Maybe Urdu?
What should I do in cases where the word exists in multiple languages?
Take the older language as basis.
What protocol would you want me to use to indicate when I am not certain?
I guess some kind of [?]
would do, @funderburkjim ?
I'm glad to take on the challenge. I can certainly indicate which words are Arabic. I can also consult colleagues to get confirmation on determining Turkic or Farsi origin for others
@jlreeder sounds thrilling. I guess these 3 directions woulds suffice. I myself can split Russian and Old Slavic - both in Cyrillic.
Sounds good. Just send me details on how I should do the labeling and I'll get started.
On Thu, Jan 14, 2016 at 8:22 PM, Marcis Gasuns notifications@github.com wrote:
@jlreeder https://github.com/jlreeder sounds thrilling. I guess these 3 directions woulds suffice. I myself can split Russian and Old Slavic - both in Cyrillic.
— Reply to this email directly or view it on GitHub https://github.com/sanskrit-lexicon/Cologne/issues/68#issuecomment-171868520 .
We will need @funderburkjim 's advice on this. If none, I'll propose.
I'm too involved with PWK-literary source work to think about this now.
I'll take a look when time permits.
Suggest you go ahead without me meanwhile.
@funderburkjim read when free from samsara
<H1><h><hc3>110</hc3><key1>ramala</key1><hc1>1</hc1><key2>ramala</key2></h><body> <lex>m.</lex> <c>or</c> <lex type="hw">n.</lex> <p><cf/>~<c>Arabic رمال </c>~<s>rammAl</s></p> <c>a_mode_of_fortune-telling_by_means_of_dice_<p>a_branch_of_divination_borrowed_from_the_Arabs</p></c> <ls>Cat.</ls> </body><tail><pc>868,2</pc> <L>175217</L></tail></H1>
The رمال
has no tags around it - should not? Not even A
.
Not all instances of Arabic script in MW printed book have the language mentionded, example <c>= پادشاه ,_a_king</c>
. Others have, so <p>fr._Arabic إِنْتِها </p>
uses Arabic script for Arabic language. But not all language "meta-data" is marked in same way in book, there are variants in abbreviations and hundreds of ways in XML to represent it:
<c>fr._the_<ab>Pers.</ab> خربوزه </c>
<c1>=_the_Persian شاه</c1>
<p><ab>Hind.</ab>_ ارهٿ</p>
<c1>in_<as0 type="ns">Hindu1sta1ni1</as0><as1>Hindustani</as1> پتهركي پهول </c1>
Not all instances of Arabic letters are inside <p>
, <c>
or <c1>
tag. So we can't use them in our additional, extra markup. @jlreeder can you extract with your regex all the instances of Arabic script used in MW, please? I would add around the Arabic word (and only around it, not other tags included) an additional language tag.
<lang type="A">شاه</lex>
.
A - Arabic T - Turkish P - Persian H - Hindustani, Urdu What else might be missing?
Sure! Let me take a crack at that and then we can check in.
Would you mind pointing me to the exact location of MW? So far my work with the project has been entirely within "Issues," so I'm not familiar with the file structure.
@jlreeder here http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2014/web/webtc/download.html you can download http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2014/downloads/monierxml.zip and monier.xml
is the file.
@gasyoun Thanks! Now I've got what I need. Expect an update by the end of the week.
@jlreeder it's an honor to have you around! Thanks, eagerly waiting. It will become part of a book by mid-2016, if I manage to have clean list of other language words as well by that time.
If you mean 'clean enough' by the word 'clean', you have it now. If you mean 'really clean', not in foreseeable future for sure.
@gasyoun I've made some small progress, but not enough to show. Working on cleanly extracting all of the Arabic. I'll send another update once that's complete.
@jlreeder understood, thanks for the mini update.
@gasyoun Apologies for my slow response time here. Now I've got some more bandwidth to tackle this and should be able to finish it up soon. Also, in the meantime I've learned how to write solid Python scripts, so it's much faster work.
Here are three examples of the output I plan to use, with the line number, the original line, and the line with the Arabic words surrounded by the tag you gave. Would you mind confirming that this is the format you're looking for? If it is, then I'll go through and correct all of the tags to the proper language.
In [291]: Line: 17762
Match(es): ['ارهٿ']
Original Line:
<H3><h><hc3>110</hc3><key1>araGawwa</key1><hc1>3</hc1><key2>ara--Gawwa</key2></h><body> <lex>m.</lex> <c>a_wheel_or_machine_for_raising_water_from_a_well_<p><ab>Hind.</ab>_ ارهٿ</p></c> <ls>Pan5cat.</ls> </body><tail><pc>86,2</pc> <L>15014</L></tail></H3>
Line With Tags:
<H3><h><hc3>110</hc3><key1>araGawwa</key1><hc1>3</hc1><key2>ara--Gawwa</key2></h><body> <lex>m.</lex> <c>a_wheel_or_machine_for_raising_water_from_a_well_<p><ab>Hind.</ab>_ <lang type="A">ارهٿ</lex></p></c> <ls>Pan5cat.</ls> </body><tail><pc>86,2</pc> <L>15014</L></tail></H3>
Line: 20012
Match(es): ['العابدينا']
Original Line:
<H1><h><hc3>000</hc3><key1>allApadIna</key1><hc1>1</hc1><key2>allApadIna</key2></h><body> <lex>m.</lex> = العابدينا , <ab>N.</ab> of a king, <ls>Sa1h.</ls> (<ab>v.l.</ab>).</body><tail><pc>1316,3</pc><L supL="314380">16937.2</L></tail></H1>
Line With Tags:
<H1><h><hc3>000</hc3><key1>allApadIna</key1><hc1>1</hc1><key2>allApadIna</key2></h><body> <lex>m.</lex> = <lang type="A">العابدينا</lex> , <ab>N.</ab> of a king, <ls>Sa1h.</ls> (<ab>v.l.</ab>).</body><tail><pc>1316,3</pc><L supL="314380">16937.2</L></tail></H1>
Line: 21043
Match(es): ['شاه']
Original Line:
<H1><h><hc3>000</hc3><key1>avaraNgasAha</key1><hc1>1</hc1><key2>avaraNga-sAha</key2></h><body> <c>=_Aurungzeb_<p><c1>a_Muhammedan_king_of_the_17th_century</c1>~;~<s>sAha</s>~<c1>=_the_Persian شاه</c1></p>.</c> </body><tail><mul/> <MW>013086</MW> <pc>102,3</pc> <L>17894</L></tail></H1>
Line With Tags:
<H1><h><hc3>000</hc3><key1>avaraNgasAha</key1><hc1>1</hc1><key2>avaraNga-sAha</key2></h><body> <c>=_Aurungzeb_<p><c1>a_Muhammedan_king_of_the_17th_century</c1>~;~<s>sAha</s>~<c1>=_the_Persian <lang type="A">شاه</lex></c1></p>.</c> </body><tail><mul/> <MW>013086</MW> <pc>102,3</pc> <L>17894</L></tail></H1>
The code that produced the output above is here: link
Great to see you back again. I'm fine if @funderburkjim accepts.
I only wonder. In the examples you provided there is Persian
, Hind.
in the text. If you mark language all as A
= Arabic
, does it means all
the words are non-Persian
, non-Hind.
, but etymologically Arabic?
Ah, yes. To clarify, this is just an example to make sure that I've properly understood where to put the tags. I do not mean to say that all of these should be Arabic as opposed to Persian, etc. If this is good, I'll go back and make sure that the tags are correct for each language.
On Sun, Apr 10, 2016 at 11:02 PM, Marcis Gasuns notifications@github.com wrote:
Great to see you back again. I'm fine if @funderburkjim https://github.com/funderburkjim accepts. I only wonder. In the examples you provided there is Persian, Hind. in the text. If you mark language all as A = Arabic, does it means all the words are non-Persian, non-Hind., but etymologically Arabic?
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/sanskrit-lexicon/Cologne/issues/68#issuecomment-208176241
I have not followed this thread closely, and don't have any solid opinion on it.
My crude understanding is that Jason is adding some markup to the Arabic script of MW records to identify the language. i.e., there are several languages that are written in Arabic script. I guess this is similar to the fact that Latin and English are both normally written in the same script.
Based on this crude understanding, here are some comments.
<A>....</A>
. Using the markup form <lang type="A">...</lang>
would also be an acceptable way to indicate the same thing, Arabic script. In the araGawwa example I see <lang type="A">ارهٿ</lex>
which is not valid xml, as the closing tag should be 'lang', not 'lex'<lang script="A" lang="Turkish">...</lang>
Hope these comments aren't too far off the mark.
@jlreeder Regarding he principles you use to decide whether a given instance of Arabic script is representing one language or another.
This seems to me to be a subject worthy of a research article. The rough form of the article might provide a description of the reasoning for each of the Arabic words or phrases occurring in MW, taken one by one. No doubt in proceeding instance by instance, some of the different instances would be decided by similar reasoning. After all instances were examined, probably there would be a small number of principles which would explain all the instances.
Once all the classifications and their justifying principles are clear, we can re-examine the issue of adding markup to MW that reflects the disctinctions.
Just a thought.
@funderburkjim the intention is 2.
. Just the markup of script is boring and easy one. As there are only about 50 cases, I do not see place for research - it's research and a final product all in one. I would go for production version <lang script="A" lang="Turkish">...</lang>
, even if it's non-TEI.
scriptStmt (script statement) contains a citation giving details of the script used for a spoken text.
from http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CC.html does not make much sense anyway. Other tags have even less to do with us.
</lang>
as suggested.lang=Turkish
and will raise the issue again if I read anything that makes me change my mind.At that point I'll leave it to you all to decide what you want to incorporate into production.
- I'll make a deal with Jim :+1:
@gasyoun @funderburkjim I've finally found the time I needed to complete this. You'll find my suggestions for tags and reasoning behind them all in the repository add-language-tags. I'm not sure what format would be best for you to have the changes in. You'll see I've just created an xml doc with the original line numbers and the new lines with tags. Happy to tweak this if something else would be more convenient.
@jlreeder looks perfect, thanks! Perfect timing for a supplement for my reverse dictionary. Well structured and clean job. It's up to @funderburkjim to make a decision.
@marcis I'm glad it's still relevant to your work. Thanks for your kind words. I'd be happy to repeat the process for other dictionaries, too, if that would be helpful.
On Mon, Jul 25, 2016 at 1:11 AM, Marcis Gasuns notifications@github.com wrote:
@jlreeder https://github.com/jlreeder looks perfect, thanks! Perfect timing for a supplement for my reverse dictionary. Well structured and clean job. It's up to @funderburkjim https://github.com/funderburkjim to make a decision.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/Cologne/issues/68#issuecomment-234878956, or mute the thread https://github.com/notifications/unsubscribe-auth/AKGTun2DUW41JIXuWWJIOMKgsUGdB-CVks5qZG-egaJpZM4EHRj3 .
Your add-language-tags reposoitory looks good!
Next step is to enhance the monier-williams xml dtd, mw.dtd.
mw.dtd is part of the xml download for mw;
Here is a direct link to mw.dtd.
To test it, you'll need to
xmllint --noout --valid mw.xml
When there is NO output, that means the new mw.xml is valid xml which conforms to the mw.dtd specification.
xmllint is available on MacOSX and linux OS. You can do something similar in Windows, but suggest
against using WIndows for this.NOTE: After writing the above, I noticed that the name of the original MW dictionary in your repository is 'monier.xml', and not mw.xml. Currently, the two versions are only very slightly different, namely in how Sanskrit accents are represented in SLP1.
So, instead of 'mw.dtd', you should use monier.dtd
There is only one trivial, but significant, difference in the two dtds.
monier.dtd
<!ELEMENT monier (H1 | H2 | H3 | H4 | H1A | H2A | H3A | H4A |
mw.dtd
<!ELEMENT mw (H1 | H2 | H3 | H4 | H1A | H2A | H3A | H4A |
NOTE: When you validate with xmllint, there is a line in the xml file that specifies what dtd is used for the validation: Here is the line for monier.xml. You can probably guess what this DOCTYPE line looks like in mw.xml.
<!DOCTYPE monier SYSTEM "monier.dtd">
So, go ahead and do this step, and we'll proceed from there.
two versions are only very slightly different, namely in how Sanskrit accents are represented in SLP1.
Oh - and the only person who knows all the secrets is Jim :sleeping:
@funderburkjim Thanks for those clear instructions. that looks good. I'll send over an update when I've made some progress.
Hi @funderburkjim,
Alright, I've taken a number of steps here and the process is almost complete.
.dtd
. I've attempted to do so, but have been unsuccessful. I think that if you update that file to incorporate the new tags, these validation errors should go away. A few characters aren't displaying well on GitHub's rendering, but they look fine in raw (and are not really relevant to the linting problems).Standing by for next steps.
For the interest of others, Jason has introduced a <lang script="X" lang="Y">...</lang>
tag for the
Arabic script.
Here the script
attribute value X is always A
for Arabic.
The lang
attribute value Y is one of Arabic | Hindustani | Persian | Turkish
.
Would it be better to use Arabic
than A
for the script attribute value? Reason: then the value is self explanatory, at least for English-Language speakers.
Is there some official list of Language names, and are the names used for the lang
attribute values in this list? For instance, there is an IS0-691,2 standard which has some language names, along with
abbreviations for them. Should we use these ISO standard abbreviations where possible?
For instance ara urd per tur
?
Hindustani does not appear in this iso list, but Urdu does - that's the reason for that choice.
The Hindustani
lang attribute value occurs only in 1 headword (slp1=priyadarSana) and agrees with Monier's comment. From a cursory reading of Hindustani Language, I'm not sure of the relation between Urdu and Hindustani.
Part of the reason for raising Question 2 is that we may choose to extend the <lang lang='Y'>
tag to other languages in MW (notably Greek, but several others). It might be useful to think in terms of standardizing the values of 'Y', in anticipation of some future usage by students of various languages.
To the extent possible, our markup should reflect modern names of even ancient languages (I'm thinking of Old Norse, Old Hebrew, many others that appear in MW and probably other dictionaries).
We should also think of using this markup standard for language identification in other dictionaries. This would be one step in the markup standardization notion begun under #87.
@fxru Does your experience with TEI suggest anything with regard to this proposed <lang>
element?
Would it be better to use Arabic than A for the script attribute value? Reason: then the value is self explanatory, at least for English-Language speakers.
If A is the only short one, no use to have it, because it could raise issues after because it's not obvious.
Should we use these ISO standard abbreviations where possible?
They are not obvious as well. If I stumble upon ara
or a
both can be read in different ways.
I'm not sure of the relation between Urdu and Hindustani
Same language in different scripts, but different words are used, so it is close, but not identical.
other languages in MW
There are dead languages that ISO does not care about. We can't predict all and ISO does not have them, so let's ignore it. Let's write in full and do not care about making it complicated.
@jlreeder is there any life on the Moon?
Full list of 32 (names) of languages (some have 2) from list of abbreviations:
Aeol. AEolic
Angl.Sax. Anglo-Saxon
Arab. Arabic
Arm. Armorican or the language of Brittany
Armen. Armenian
Armor. Armorican
Boh. Bohemian
Bohem. Bohemian
Br. Breton
Bret. Breton
Eng. English
Gael. Gaelic
Germ. German
Gk. Greek
Goth. Gothic
Hib. Hibernian or Irish
Hind. Hindi
Icel. Icelandic
Ion. Ionic
Lat. Latin
Lett. Lettish
Lith. Lithuanian
Osset. Ossetic
Pers. Persian
Pra1k. Prakrit
Pra1kr. Prakrit
Pruss. Prussian
Russ. Russian
Sax. Saxon
Scot. Scotch or Highland-Scotch
Slav. Slavonic or Slavonian
Zd. Zend
I'm going to implement the lang tag in MW. The only differences from Jason's solution:
<lang script='Arabic' n='Persian'>xxx</lang>
Instead of Jason's:
<lang script='A' lang='Persian'>xxx</lang>
@jlreeder
These changes to markup now installed in mw.xml at Cologne.
Thanks for all the help and good ideas!
Thanks, @funderburkjim. Apologies for my radio silence - I've started a coding boot camp here in SF and have been strapped for time. It looks like you've gone ahead and incorporated everything quite well without further input from me. Are there any final threads you still need input on?
@jlreeder I'm thinking about changing the markup of at least Greek in mw, making use of the new lang
tag. Currently, the markup for Greek has the form
I'm currently thinking that it would be better to replace this round-about system with
<lang n="Greek">[Greek Unicode]</lang>
.
What's holding me up is the Perseus angle. Although this seemed cute at the time, I think it is probably not of much interest, and there are also problems with the links to Perseus. So, my inclination is just to drop the Perseus part.
As a linguist, do you have any opinion on the above, especially the dropping of Perseus?
If there are issues with the Perseus integration and you feel that it's main advantage is just adding "cuteness", I don't think that it is critical to retain it.
I would prioritize language markup over external references. In other words, I would agree with you to keep lang n="Greek"
and drop the instance id. That will keep the markup in the dictionary cleaner.
If a Perseus integration is important, I would lean towards implementing one that allowed you to simply plug in the Greek word as opposed to saving the whole Perseus link, in case their URL structure changes. If that is possible, then there would be no need for an instance id on the greek word in the dictionary, just the accurate unicode transcription of the Greek.
Happy to chat about this further if you'd like.
If a Perseus integration is important
@jsonreeder I guess it's more important that just cute Greek font.
in case their URL structure changes
Has it ever done so in the last decades, do you remember such a case?
Is Perseus integration just 'cute'?
I think that from our point of view, this "integration" probably is on the 'cute' side. It only is present in MW, and I think there are unresolved issues of getting to the 'right' entry in Perseus for a given Greek word. Since none of us is a Greek scholar, it is probably presumptuous of us to attempt a linking of Greek in Sanskrit dictionaries to standard Greek word study tools such as Perseus. Our best contribution is to have the Greek text clearly and consistently marked in the Sanskrit Dictionaries we provide. It would be up to someone with skills in both Sanskrit and Greek to make use of this material.
A similar comment would apply to all the other languages (e.g. Arabic) mentioned by the scholars who compiled these Sanskrit dictionaries and made passing comments on relations to words of other languages.
Over the years, almost all dictionaries have received their Greek words. Whatever is left, can be tackled in separate issues. Closing this.
https://github.com/sanskrit-lexicon/ArabicInSanskrit/issues/6 continued.
R
is not an alphabet tag, because two different alphabets (which have many similar elements) are behind them. Old Slavic is not equal to Modern Russian. I believe the tags should be language tags. Even ifA
until today was an alphabet and not a language tag it was because there were no Arabic linguists around. If I wand to know all Arabic words in Monier-Williams I do not care that Sindh language in Kashmir might use it as well. I always want to know about some language and not about script. If I know about the language I can extract the data about the script as well, but not always otherwise. @jlreeder do you agree? Would it be possible / needed to splitA
in several smaller tags?