Another 3 cases in MW72

funderburkjim commented 8 years ago

@jlreeder Jonathan found another 3 cases where Arabic alphabet text was incorrectly marked as Greek. I've changed the tags, and include them here, for your completion. The case numbers are from the Greek, you can ignore.

page = 0364 case = 330, hw = tambIra, page=0364-c

83138 old <><g></g> the fourteenth Yoga.
83138 new <><Arabic>تمْوير</Arabic> the fourteenth Yoga.

page = 0365 case = 331, hw = taravI, page=0365-c

83329 old <P>.{#taravI#}¦ {%taravi1,%} in astrology = <g></g> quadra-
83329 new <P>.{#taravI#}¦ {%taravi1,%} in astrology = <Arabic>تربيع</Arabic> quadra-

page = 0376 case = 348, hw = tIra, page=0376-b

85741 old <>a sort of arrow [cf. the Pers. <g></g>]; ({%as%}), m. tin
85741 new <>a sort of arrow [cf. the Pers. <Arabic>تير</Arabic>]; ({%as%}), m. tin

jsonreeder commented 8 years ago

@funderburkjim Thanks for bringing this to my attention. I just responded to the thread Greek in MW72, part 2 of 6 #15 and updated the comments there, but I have copied my work to this thread as well.

In case you'd prefer to continue the conversation here, I've included the rest of my comment from the Greek thread:

I'm thinking back to that question of how to detect the instances of Arabic script that have been improperly tagged. Here I noticed some patterns.

Case 348 has the string cf. the Pers. <g>. You could probably do a simple search through the text to find all instances of that string. Every result would be relevant here, since the proper transcription would of course be cf. the Pers. <Arabic>. There are probably a few other strings that commonly occur before or after Arabic script.
Case 331 is an example of a word that comes from Arabic and I have transcribed elsewhere. Perhaps searching through the text for all other instances of headwords that contain Arabic script would be fruitful.

gasyoun commented 8 years ago

@funderburkjim can we search for any Arabic strings with regex?

jsonreeder commented 8 years ago

If you do decide you do want to write some regexes for Arabic and you run into any encoding issues let me know. I've now worked with Arabic regexes in Python and found a few workaround for common problems.

gasyoun commented 8 years ago

@jlreeder any sample solution to find any Arabic text, tagged or untagged?

jsonreeder commented 8 years ago

@gasyoun Well, here are some ideas:

You could do a regex search for something like Pers\.\s<.+> by which I mean "the string Pers. followed by any tag. This would have caught the mislabeled instance from above in this thread.
As an extension to that, you could do a quick search through the transcribed data to find the most common strings that occur before an tag, then search for all instances of those strings in the data to see if any other instances of the same strings are followed by erroneous tags.
To find Arabic letters you can use the Arabic equivalent of the Latin regex [a-zA-z], which is [ا-ى], but I'm not 100% sure that this gives you ALL characters in the same way it does for Latin.
I've used the Python library unicodedata before (link). It is useful to automatically detect metadata for unicode characters. This is another way to look at unicode text and determine whether it is Arabic, though not a very elegant one.

Let me know if you have specific cases or queries you'd like help with.

gasyoun commented 8 years ago

@jlreeder thanks, that is enough for a Wiki article! @funderburkjim do you think there are untagged cases of Arabic left?

funderburkjim commented 8 years ago

A search of mw72.xml for string 'Pers.' resulted in identifying two cases where text was wrongly marked as Greek.

page = 0597 case = 719, hw = peraja, page=0597-b

       <>days, biestings; fresh butter; nectar, Amr2ita.
       <P>.{#peraja#}¦ {%peraja,%} or {%peroja, am,%} n. a turquoise
136748 old <>(= Pers. <g></g>).
136748 new <>(= Pers. <Arabic>فيروزه</Arabic>).
       <P>.{#perA#}¦ {%pera1,%} f. a kind of musical instrument.
       <P>.{#peru#}¦ 2. {%peru, us, us, u%} (fr. rt. 1. {%pr2i;%} for 1.

page = 1015

case = 1219, hw = SuBa, page=1015-a

       <>ment; water, rain, (Sa1y. {%= alan4ka1ra%} or {%udaka,%}
       <>R2ig-veda VII. 82, 5); a fragrant wood ({%= padma-
233954 old <>ka1sht2ha%}); [cf. Pers. <g></g> {%khu1b.%}] {%--S4ubha-kara,
233954 new <>ka1sht2ha%}); [cf. Pers. <Arabic>خوب</Arabic> {%khu1b.%}] {%--S4ubha-kara,
       <>as, a1%} or {%i1, am,%} causing welfare, producing good,
       <>propitious, &c. {%--S4ubha-karman, a,%} n. a good or

@jlreeder These cases need your input.

The other 25 cases of 'Pers.' do not need your input:

Some already done, and installed
The three cases (tambIra, etc) which you have prepared, but which I haven't yet installed
Several cases where there was text represented in a 'Latin with diacritics' style, similar to the {%khu1b%} shown above under SuBa, but without the accompanying Arabic script.

gasyoun commented 8 years ago

So the magic worked, great.

funderburkjim commented 8 years ago

A similar examination using the search 'Arab' yielded one case from the first batch where an Arabic text was missed.

page = 0578-b

hw = pIlu

132310 old <>an insect; an elephant (Arabic <Arabic></Arabic> Persian <Arabic></Arabic>);
132310 new <>an insect; an elephant (Arabic <Arabic>فيل</Arabic> Persian <Arabic>پيل</Arabic>);

The first word is missing. Also - Please double check the 2nd word. That last character looks odd here, and in Emacs and Notepad++ ; this may be an artifact of my copy-pasting.

gasyoun commented 8 years ago

Yeah, looks anti-Unicode.

jsonreeder commented 8 years ago

I've updated the comments.

As for the third example (132310), as far as I can tell, the letter in question is correct, even though it is not displaying correctly. I'm talking about the first character, which displays on the right end. It should be پ, which is \u067E unicode info. This is commonly used on the web, and seems to display fine: examples.

Perhaps Github's font simply doesn't display it.

Let me know if it also displays erroneously in the dictionary and we can keep discussing this.

gasyoun commented 8 years ago

As it's UTF8 friendly, strange to see it fade, because Github uses default popular web fonts with basic Arabic support included.

jsonreeder commented 8 years ago

You're correct. Looking back at it I realized what was going on: there was an additional offending character. I've removed it and updated the comment containing the text above. That string seems to be all good!

gasyoun commented 8 years ago

Hail to Jason!

funderburkjim commented 8 years ago

@jlreeder Hi, Jason -

I've installed the Arabic from cases above, finally!
The 'bad character' problem under pIlu seems solved. Thanks!
However, in that 'pIlu' example, there are TWO words in Arabic script, and the first one still needs to be filled in.

funderburkjim commented 8 years ago

I tried Arabic regex filter which @jlreeder mentions above, and it works perfectly for mw72. The filter was on mw72.xml. And, 45 records were selected with the regex [ا-ى].
As a check, I also selected records using the regex <Arabic>; and found the result to be identical.

We can also use the regex [ا-ى] on MW, where the Arabic text was inserted without an accompanying tag. Doing this, there are 87 lines of mw.xml with Arabic text.

This Gist contains the program and two filters mentioned. (I used copy/paste from emacs to put the file contents up).

@jlreeder Minor question. In your note above that gives the regex, the regex 'wraps around'. Thus when I copied/pasted into this note (and into the program), the regex looks different. I think this is just a weakness in the 'wrapping' of the Github rendering of your note. Do you agree?

gasyoun commented 8 years ago

Can we add the Arabic markup to the possible 87 Arabic words?

jsonreeder commented 8 years ago

@funderburkjim

I've updated the Arabic that I had forgetted to input for the comment above (132310).
On the regex question, I agree that what you are witness is just a difference in the display and not an actual change to the string. Here's a quick explanation of what is behind this. Most of the time it's easier to look at a sequence of Arabic characters in order from right to left (because that's how they're read), so 90% of programs display the sequences that way. Pretty much the only exception is with regular expressions, when it is simply easier to read the whole expression left to right. I use Sublime Text when I need that, since it doesn't give Arabic characters any special treatment (it just displays them from left to right). So sometimes when you copy and paste an Arabic string from one program to another the order looks reversed, but the string itself is almost never modified - it's just the display.

gasyoun commented 8 years ago

Understood, display only.

funderburkjim commented 8 years ago

pIlu amendment now installed.

funderburkjim commented 8 years ago

@gasyoun I'm still considering your request to add <Arabic> markup to MW.

I think it is a good idea.

The concern holding me back relates to breaking 'downstream' code. I'm not worried about my downstream code, as I can identify and patch But what about other non-Cologne versions (like Huet's)? Should I worry about possibly breaking those? Or just let them worry about that? -- Not sure.

gasyoun commented 8 years ago

You mean a single net tag can break the castle of Huet? One mathemathican can break another one? Never would I believe that. @drdhaval2785 is coding around Huet's XMLs and did not break them. So why do you think you could, even if you would want to? :cactus:

funderburkjim commented 8 years ago

@gasyoun I guess you are basically joking, and are aware of the downstream breaking issues.

Am intrigued by 'Dhaval is coding around Huet's XMLs' -- Have no idea what that means.

gasyoun commented 8 years ago

Let @drdhaval2785 tell you when he gets back. Hundreds of lines of intriguing and practically applicable code around Huet's machine.

drdhaval2785 commented 3 years ago

Arabic language markup added everywhere in MW72. <lang n="arabic">إِدْبار</lang>.

The other remarks of gasyoun do not deserve consideration right now. Closing.

sanskrit-lexicon / ArabicInSanskrit

Another 3 cases in MW72 #9