sanskrit-lexicon / COLOGNE

Development of http://www.sanskrit-lexicon.uni-koeln.de/
18 stars 3 forks source link

Alphabet tags vs Language tags #68

Closed gasyoun closed 3 years ago

gasyoun commented 9 years ago

https://github.com/sanskrit-lexicon/ArabicInSanskrit/issues/6 continued. R is not an alphabet tag, because two different alphabets (which have many similar elements) are behind them. Old Slavic is not equal to Modern Russian. I believe the tags should be language tags. Even if A until today was an alphabet and not a language tag it was because there were no Arabic linguists around. If I wand to know all Arabic words in Monier-Williams I do not care that Sindh language in Kashmir might use it as well. I always want to know about some language and not about script. If I know about the language I can extract the data about the script as well, but not always otherwise. @jlreeder do you agree? Would it be possible / needed to split A in several smaller tags?

jsonreeder commented 9 years ago

The A tag currently works perfectly as an alphabet tag. Every time the A tag is used, it indicates an indisputable use of the Arabic alphabet.

Splitting it into smaller language tags would require an extra level of analysis. The added complication is that words can be more than one language (but not more than one alphabet). Some of these words are both Arabic and Persian. Some are Persian and Turkish. Many of the entries are names, which could be considered Arabic of origin, or shared words between all the Arabic-script languages. I can tell you whether or not the words are Arabic, but I can't tell you whether or not they're also Persian/Turkish. The dictionaries usually indicate the language of origin, but they don't always do so. So it would be a safe procedure to simply do the language coding based on what the dictionary says in the entry. In cases where the dictionary gives the word without indicating language, I can give my best guess, but this is not quite as watertight as leaving it as an alphabet tag.

I'm happy to do the work to retag these words. I don't think it'll add much value to the words in Arabic script, especially given that there are so few of them, but if this is part of a greater improvements of the text then I can see reason for it.

gasyoun commented 9 years ago

Yes, it's part of a global remake. And no - Monier does not states what language is used in these contexts. To know if it's Arabic or Persian/Turkish is still better than just Arabic script. So as I'm splitting the R with Cyrillic script in two subgroups, would love to see going deeper in other sections as well. Sure it works well as an alphabet tag, but that's are not what ships are made for :neckbeard:

jsonreeder commented 9 years ago

OK, well I'm certainly willing to help split the tags. Let me know if you all have decided you'd like to do this, and then let me know how best to do it. Questions I would have are:

  • What other tags should I use?
  • What should I do in cases where the word exists in multiple languages?
  • What protocol would you want me to use to indicate when I am not certain?
gasyoun commented 8 years ago

@funderburkjim let's give @jlreeder a chance?

What other tags should I use?

Arabic or Persian/Turkish. Non other I'm aware as used. Maybe Urdu?

What should I do in cases where the word exists in multiple languages?

Take the older language as basis.

What protocol would you want me to use to indicate when I am not certain?

I guess some kind of [?] would do, @funderburkjim ?

jsonreeder commented 8 years ago

I'm glad to take on the challenge. I can certainly indicate which words are Arabic. I can also consult colleagues to get confirmation on determining Turkic or Farsi origin for others

gasyoun commented 8 years ago

@jlreeder sounds thrilling. I guess these 3 directions woulds suffice. I myself can split Russian and Old Slavic - both in Cyrillic.

jsonreeder commented 8 years ago

Sounds good. Just send me details on how I should do the labeling and I'll get started.

On Thu, Jan 14, 2016 at 8:22 PM, Marcis Gasuns notifications@github.com wrote:

@jlreeder https://github.com/jlreeder sounds thrilling. I guess these 3 directions woulds suffice. I myself can split Russian and Old Slavic - both in Cyrillic.

— Reply to this email directly or view it on GitHub https://github.com/sanskrit-lexicon/Cologne/issues/68#issuecomment-171868520 .

gasyoun commented 8 years ago

We will need @funderburkjim 's advice on this. If none, I'll propose.

funderburkjim commented 8 years ago

I'm too involved with PWK-literary source work to think about this now.

I'll take a look when time permits.

Suggest you go ahead without me meanwhile.

gasyoun commented 8 years ago

@funderburkjim read when free from samsara

<H1><h><hc3>110</hc3><key1>ramala</key1><hc1>1</hc1><key2>ramala</key2></h><body> <lex>m.</lex> <c>or</c> <lex type="hw">n.</lex> <p><cf/>~<c>Arabic رمال </c>~<s>rammAl</s></p> <c>a_mode_of_fortune-telling_by_means_of_dice_<p>a_branch_of_divination_borrowed_from_the_Arabs</p></c> <ls>Cat.</ls> </body><tail><pc>868,2</pc> <L>175217</L></tail></H1>

The رمال has no tags around it - should not? Not even A.

gasyoun commented 8 years ago

Not all instances of Arabic script in MW printed book have the language mentionded, example <c>= پادشاه ,_a_king</c>. Others have, so <p>fr._Arabic إِنْتِها </p> uses Arabic script for Arabic language. But not all language "meta-data" is marked in same way in book, there are variants in abbreviations and hundreds of ways in XML to represent it:

<c>fr._the_<ab>Pers.</ab> خربوزه </c>
<c1>=_the_Persian شاه</c1>
<p><ab>Hind.</ab>_ ارهٿ</p>
<c1>in_<as0 type="ns">Hindu1sta1ni1</as0><as1>Hindustani</as1> پتهركي پهول </c1>

Not all instances of Arabic letters are inside <p>, <c> or <c1> tag. So we can't use them in our additional, extra markup. @jlreeder can you extract with your regex all the instances of Arabic script used in MW, please? I would add around the Arabic word (and only around it, not other tags included) an additional language tag.

<lang type="A">شاه</lex>.

A - Arabic T - Turkish P - Persian H - Hindustani, Urdu What else might be missing?

jsonreeder commented 8 years ago

Sure! Let me take a crack at that and then we can check in.

Would you mind pointing me to the exact location of MW? So far my work with the project has been entirely within "Issues," so I'm not familiar with the file structure.

gasyoun commented 8 years ago

@jlreeder here http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2014/web/webtc/download.html you can download http://www.sanskrit-lexicon.uni-koeln.de/scans/MWScan/2014/downloads/monierxml.zip and monier.xml is the file.

jsonreeder commented 8 years ago

@gasyoun Thanks! Now I've got what I need. Expect an update by the end of the week.

gasyoun commented 8 years ago

@jlreeder it's an honor to have you around! Thanks, eagerly waiting. It will become part of a book by mid-2016, if I manage to have clean list of other language words as well by that time.

drdhaval2785 commented 8 years ago

If you mean 'clean enough' by the word 'clean', you have it now. If you mean 'really clean', not in foreseeable future for sure.

jsonreeder commented 8 years ago

@gasyoun I've made some small progress, but not enough to show. Working on cleanly extracting all of the Arabic. I'll send another update once that's complete.

gasyoun commented 8 years ago

@jlreeder understood, thanks for the mini update.

jsonreeder commented 8 years ago

@gasyoun Apologies for my slow response time here. Now I've got some more bandwidth to tackle this and should be able to finish it up soon. Also, in the meantime I've learned how to write solid Python scripts, so it's much faster work.

Here are three examples of the output I plan to use, with the line number, the original line, and the line with the Arabic words surrounded by the tag you gave. Would you mind confirming that this is the format you're looking for? If it is, then I'll go through and correct all of the tags to the proper language.

In [291]: Line: 17762
Match(es): ['ارهٿ']
Original Line:
<H3><h><hc3>110</hc3><key1>araGawwa</key1><hc1>3</hc1><key2>ara--Gawwa</key2></h><body> <lex>m.</lex> <c>a_wheel_or_machine_for_raising_water_from_a_well_<p><ab>Hind.</ab>_ ارهٿ</p></c> <ls>Pan5cat.</ls> </body><tail><pc>86,2</pc> <L>15014</L></tail></H3>
Line With Tags:
<H3><h><hc3>110</hc3><key1>araGawwa</key1><hc1>3</hc1><key2>ara--Gawwa</key2></h><body> <lex>m.</lex> <c>a_wheel_or_machine_for_raising_water_from_a_well_<p><ab>Hind.</ab>_ <lang type="A">ارهٿ</lex></p></c> <ls>Pan5cat.</ls> </body><tail><pc>86,2</pc> <L>15014</L></tail></H3>

Line: 20012
Match(es): ['العابدينا']
Original Line:
<H1><h><hc3>000</hc3><key1>allApadIna</key1><hc1>1</hc1><key2>allApadIna</key2></h><body> <lex>m.</lex> = العابدينا , <ab>N.</ab> of a king, <ls>Sa1h.</ls> (<ab>v.l.</ab>).</body><tail><pc>1316,3</pc><L supL="314380">16937.2</L></tail></H1>
Line With Tags:
<H1><h><hc3>000</hc3><key1>allApadIna</key1><hc1>1</hc1><key2>allApadIna</key2></h><body> <lex>m.</lex> = <lang type="A">العابدينا</lex> , <ab>N.</ab> of a king, <ls>Sa1h.</ls> (<ab>v.l.</ab>).</body><tail><pc>1316,3</pc><L supL="314380">16937.2</L></tail></H1>

Line: 21043
Match(es): ['شاه']
Original Line:
<H1><h><hc3>000</hc3><key1>avaraNgasAha</key1><hc1>1</hc1><key2>avaraNga-sAha</key2></h><body> <c>=_Aurungzeb_<p><c1>a_Muhammedan_king_of_the_17th_century</c1>~;~<s>sAha</s>~<c1>=_the_Persian شاه</c1></p>.</c> </body><tail><mul/> <MW>013086</MW> <pc>102,3</pc> <L>17894</L></tail></H1>
Line With Tags:
<H1><h><hc3>000</hc3><key1>avaraNgasAha</key1><hc1>1</hc1><key2>avaraNga-sAha</key2></h><body> <c>=_Aurungzeb_<p><c1>a_Muhammedan_king_of_the_17th_century</c1>~;~<s>sAha</s>~<c1>=_the_Persian <lang type="A">شاه</lex></c1></p>.</c> </body><tail><mul/> <MW>013086</MW> <pc>102,3</pc> <L>17894</L></tail></H1>

The code that produced the output above is here: link

gasyoun commented 8 years ago

Great to see you back again. I'm fine if @funderburkjim accepts. I only wonder. In the examples you provided there is Persian, Hind. in the text. If you mark language all as A = Arabic, does it means all the words are non-Persian, non-Hind., but etymologically Arabic?

jsonreeder commented 8 years ago

Ah, yes. To clarify, this is just an example to make sure that I've properly understood where to put the tags. I do not mean to say that all of these should be Arabic as opposed to Persian, etc. If this is good, I'll go back and make sure that the tags are correct for each language.

On Sun, Apr 10, 2016 at 11:02 PM, Marcis Gasuns notifications@github.com wrote:

Great to see you back again. I'm fine if @funderburkjim https://github.com/funderburkjim accepts. I only wonder. In the examples you provided there is Persian, Hind. in the text. If you mark language all as A = Arabic, does it means all the words are non-Persian, non-Hind., but etymologically Arabic?

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/sanskrit-lexicon/Cologne/issues/68#issuecomment-208176241

funderburkjim commented 8 years ago

I have not followed this thread closely, and don't have any solid opinion on it.

My crude understanding is that Jason is adding some markup to the Arabic script of MW records to identify the language. i.e., there are several languages that are written in Arabic script. I guess this is similar to the fact that Latin and English are both normally written in the same script.

Based on this crude understanding, here are some comments.

  1. In MW, the Arabic script is not currently marked in any way. I can see adding markup to indicate the presence of Arabic script as useful. For instance, in other dictionaries some ad-hoc markup exists such as <A>....</A>. Using the markup form <lang type="A">...</lang> would also be an acceptable way to indicate the same thing, Arabic script. In the araGawwa example I see <lang type="A">ارهٿ</lex> which is not valid xml, as the closing tag should be 'lang', not 'lex'
  2. If this intention is to indicate not only that some text is written in Arabic script but moreover to indicate the language represented by that Arabic script, then it would be necessary to add a representation of that language as a second piece of information. For instance, one could write something like <lang script="A" lang="Turkish">...</lang>
  3. It might be appropriate to use some generally accepted language designations. One source of this might be mentioned in the TEI system of markup. Jason, you should probably learn what such standard language designations are (are there several systems? is there one system that almost everyone uses?) Peter Scharf probably knows about this, if you don't already know about it.
  4. Regarding the language-script dichotomy in the context of MW markup, my inclination is to add markup to indicate 'this is Arabic script' (as in item 1 above), but to leave the 'language' specification (e.g., Turkish, Persian, etc.) in the research phase for now (i.e., not to add this secondary markup now to the production version of MW.)

Hope these comments aren't too far off the mark.

funderburkjim commented 8 years ago

@jlreeder Regarding he principles you use to decide whether a given instance of Arabic script is representing one language or another.

This seems to me to be a subject worthy of a research article. The rough form of the article might provide a description of the reasoning for each of the Arabic words or phrases occurring in MW, taken one by one. No doubt in proceeding instance by instance, some of the different instances would be decided by similar reasoning. After all instances were examined, probably there would be a small number of principles which would explain all the instances.

Once all the classifications and their justifying principles are clear, we can re-examine the issue of adding markup to MW that reflects the disctinctions.

Just a thought.

gasyoun commented 8 years ago

@funderburkjim the intention is 2.. Just the markup of script is boring and easy one. As there are only about 50 cases, I do not see place for research - it's research and a final product all in one. I would go for production version <lang script="A" lang="Turkish">...</lang>, even if it's non-TEI.

scriptStmt (script statement) contains a citation giving details of the script used for a spoken text. from http://www.tei-c.org/release/doc/tei-p5-doc/en/html/CC.html does not make much sense anyway. Other tags have even less to do with us.

jsonreeder commented 8 years ago
gasyoun commented 8 years ago

At that point I'll leave it to you all to decide what you want to incorporate into production. - I'll make a deal with Jim :+1:

jsonreeder commented 8 years ago

@gasyoun @funderburkjim I've finally found the time I needed to complete this. You'll find my suggestions for tags and reasoning behind them all in the repository add-language-tags. I'm not sure what format would be best for you to have the changes in. You'll see I've just created an xml doc with the original line numbers and the new lines with tags. Happy to tweak this if something else would be more convenient.

gasyoun commented 8 years ago

@jlreeder looks perfect, thanks! Perfect timing for a supplement for my reverse dictionary. Well structured and clean job. It's up to @funderburkjim to make a decision.

jsonreeder commented 8 years ago

@marcis I'm glad it's still relevant to your work. Thanks for your kind words. I'd be happy to repeat the process for other dictionaries, too, if that would be helpful.

On Mon, Jul 25, 2016 at 1:11 AM, Marcis Gasuns notifications@github.com wrote:

@jlreeder https://github.com/jlreeder looks perfect, thanks! Perfect timing for a supplement for my reverse dictionary. Well structured and clean job. It's up to @funderburkjim https://github.com/funderburkjim to make a decision.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/sanskrit-lexicon/Cologne/issues/68#issuecomment-234878956, or mute the thread https://github.com/notifications/unsubscribe-auth/AKGTun2DUW41JIXuWWJIOMKgsUGdB-CVks5qZG-egaJpZM4EHRj3 .

funderburkjim commented 8 years ago

Your add-language-tags reposoitory looks good!

Next step is to enhance the monier-williams xml dtd, mw.dtd.

mw.dtd is part of the xml download for mw;
Here is a direct link to mw.dtd.

To test it, you'll need to

NOTE: After writing the above, I noticed that the name of the original MW dictionary in your repository is 'monier.xml', and not mw.xml. Currently, the two versions are only very slightly different, namely in how Sanskrit accents are represented in SLP1.

So, instead of 'mw.dtd', you should use monier.dtd

There is only one trivial, but significant, difference in the two dtds.

monier.dtd
<!ELEMENT  monier (H1 | H2 | H3 | H4 | H1A | H2A | H3A | H4A |

mw.dtd
<!ELEMENT  mw (H1 | H2 | H3 | H4 | H1A | H2A | H3A | H4A |

NOTE: When you validate with xmllint, there is a line in the xml file that specifies what dtd is used for the validation: Here is the line for monier.xml. You can probably guess what this DOCTYPE line looks like in mw.xml.

<!DOCTYPE monier SYSTEM "monier.dtd">

So, go ahead and do this step, and we'll proceed from there.

gasyoun commented 8 years ago

two versions are only very slightly different, namely in how Sanskrit accents are represented in SLP1.

Oh - and the only person who knows all the secrets is Jim :sleeping:

jsonreeder commented 8 years ago

@funderburkjim Thanks for those clear instructions. that looks good. I'll send over an update when I've made some progress.

jsonreeder commented 7 years ago

Hi @funderburkjim,

Alright, I've taken a number of steps here and the process is almost complete.

Standing by for next steps.

funderburkjim commented 7 years ago

For the interest of others, Jason has introduced a <lang script="X" lang="Y">...</lang> tag for the Arabic script. Here the script attribute value X is always A for Arabic.

The lang attribute value Y is one of Arabic | Hindustani | Persian | Turkish.

Question 1:

Would it be better to use Arabic than A for the script attribute value? Reason: then the value is self explanatory, at least for English-Language speakers.

Question 2:

Is there some official list of Language names, and are the names used for the lang attribute values in this list? For instance, there is an IS0-691,2 standard which has some language names, along with abbreviations for them. Should we use these ISO standard abbreviations where possible? For instance ara urd per tur ?

Hindustani does not appear in this iso list, but Urdu does - that's the reason for that choice. The Hindustani lang attribute value occurs only in 1 headword (slp1=priyadarSana) and agrees with Monier's comment. From a cursory reading of Hindustani Language, I'm not sure of the relation between Urdu and Hindustani.

Part of the reason for raising Question 2 is that we may choose to extend the <lang lang='Y'> tag to other languages in MW (notably Greek, but several others). It might be useful to think in terms of standardizing the values of 'Y', in anticipation of some future usage by students of various languages. To the extent possible, our markup should reflect modern names of even ancient languages (I'm thinking of Old Norse, Old Hebrew, many others that appear in MW and probably other dictionaries).

We should also think of using this markup standard for language identification in other dictionaries. This would be one step in the markup standardization notion begun under #87.

@fxru Does your experience with TEI suggest anything with regard to this proposed <lang> element?

gasyoun commented 7 years ago

Would it be better to use Arabic than A for the script attribute value? Reason: then the value is self explanatory, at least for English-Language speakers.

If A is the only short one, no use to have it, because it could raise issues after because it's not obvious.

Should we use these ISO standard abbreviations where possible?

They are not obvious as well. If I stumble upon ara or a both can be read in different ways.

I'm not sure of the relation between Urdu and Hindustani

Same language in different scripts, but different words are used, so it is close, but not identical.

other languages in MW

There are dead languages that ISO does not care about. We can't predict all and ISO does not have them, so let's ignore it. Let's write in full and do not care about making it complicated.

gasyoun commented 7 years ago

@jlreeder is there any life on the Moon?

Full list of 32 (names) of languages (some have 2) from list of abbreviations:

Aeol.   AEolic
Angl.Sax.   Anglo-Saxon
Arab.   Arabic
Arm.    Armorican or the language of Brittany
Armen.  Armenian
Armor.  Armorican
Boh.    Bohemian
Bohem.  Bohemian
Br. Breton
Bret.   Breton
Eng.    English
Gael.   Gaelic
Germ.   German
Gk. Greek
Goth.   Gothic
Hib.    Hibernian or Irish
Hind.   Hindi
Icel.   Icelandic
Ion.    Ionic
Lat.    Latin
Lett.   Lettish
Lith.   Lithuanian
Osset.  Ossetic
Pers.   Persian
Pra1k.  Prakrit
Pra1kr. Prakrit
Pruss.  Prussian
Russ.   Russian
Sax.    Saxon
Scot.   Scotch or Highland-Scotch
Slav.   Slavonic or Slavonian
Zd. Zend
funderburkjim commented 7 years ago

I'm going to implement the lang tag in MW. The only differences from Jason's solution:

funderburkjim commented 7 years ago

@jlreeder

These changes to markup now installed in mw.xml at Cologne.

Thanks for all the help and good ideas!

jsonreeder commented 7 years ago

Thanks, @funderburkjim. Apologies for my radio silence - I've started a coding boot camp here in SF and have been strapped for time. It looks like you've gone ahead and incorporated everything quite well without further input from me. Are there any final threads you still need input on?

funderburkjim commented 7 years ago

@jlreeder I'm thinking about changing the markup of at least Greek in mw, making use of the new lang tag. Currently, the markup for Greek has the form n, where 'n' is an instance number, which, along with the 'L' code generates a key into an external database. This external database has not only the Greek text (in Beta format, as I recall) but also a link to Perseus Greek dictionary.

I'm currently thinking that it would be better to replace this round-about system with <lang n="Greek">[Greek Unicode]</lang>.

What's holding me up is the Perseus angle. Although this seemed cute at the time, I think it is probably not of much interest, and there are also problems with the links to Perseus. So, my inclination is just to drop the Perseus part.

As a linguist, do you have any opinion on the above, especially the dropping of Perseus?

jsonreeder commented 7 years ago

If there are issues with the Perseus integration and you feel that it's main advantage is just adding "cuteness", I don't think that it is critical to retain it.

I would prioritize language markup over external references. In other words, I would agree with you to keep lang n="Greek" and drop the instance id. That will keep the markup in the dictionary cleaner.

If a Perseus integration is important, I would lean towards implementing one that allowed you to simply plug in the Greek word as opposed to saving the whole Perseus link, in case their URL structure changes. If that is possible, then there would be no need for an instance id on the greek word in the dictionary, just the accurate unicode transcription of the Greek.

Happy to chat about this further if you'd like.

gasyoun commented 7 years ago

If a Perseus integration is important

@jsonreeder I guess it's more important that just cute Greek font.

in case their URL structure changes

Has it ever done so in the last decades, do you remember such a case?

funderburkjim commented 7 years ago

Is Perseus integration just 'cute'?

I think that from our point of view, this "integration" probably is on the 'cute' side. It only is present in MW, and I think there are unresolved issues of getting to the 'right' entry in Perseus for a given Greek word. Since none of us is a Greek scholar, it is probably presumptuous of us to attempt a linking of Greek in Sanskrit dictionaries to standard Greek word study tools such as Perseus. Our best contribution is to have the Greek text clearly and consistently marked in the Sanskrit Dictionaries we provide. It would be up to someone with skills in both Sanskrit and Greek to make use of this material.

A similar comment would apply to all the other languages (e.g. Arabic) mentioned by the scholars who compiled these Sanskrit dictionaries and made passing comments on relations to words of other languages.

drdhaval2785 commented 3 years ago

Over the years, almost all dictionaries have received their Greek words. Whatever is left, can be tackled in separate issues. Closing this.