Closed funderburkjim closed 3 years ago
@funderburkjim Thanks for bringing this to my attention. I just responded to the thread Greek in MW72, part 2 of 6 #15
and updated the comments there, but I have copied my work to this thread as well.
In case you'd prefer to continue the conversation here, I've included the rest of my comment from the Greek thread:
I'm thinking back to that question of how to detect the instances of Arabic script that have been improperly tagged. Here I noticed some patterns.
cf. the Pers. <g>
. You could probably do a simple search through the text to find all instances of that string. Every result would be relevant here, since the proper transcription would of course be cf. the Pers. <Arabic>
. There are probably a few other strings that commonly occur before or after Arabic script.@funderburkjim can we search for any Arabic strings with regex?
If you do decide you do want to write some regexes for Arabic and you run into any encoding issues let me know. I've now worked with Arabic regexes in Python and found a few workaround for common problems.
@jlreeder any sample solution to find any Arabic text, tagged or untagged?
@gasyoun Well, here are some ideas:
Pers\.\s<.+>
by which I mean "the string Pers.
followed by any tag. This would have caught the mislabeled instance from above in this thread.[a-zA-z]
, which is [ا-ى]
, but I'm not 100% sure that this gives you ALL characters in the same way it does for Latin.unicodedata
before (link). It is useful to automatically detect metadata for unicode characters. This is another way to look at unicode text and determine whether it is Arabic, though not a very elegant one.Let me know if you have specific cases or queries you'd like help with.
@jlreeder thanks, that is enough for a Wiki article! @funderburkjim do you think there are untagged cases of Arabic left?
A search of mw72.xml for string 'Pers.' resulted in identifying two cases where text was wrongly marked as Greek.
page = 0597 case = 719, hw = peraja, page=0597-b
<>days, biestings; fresh butter; nectar, Amr2ita.
<P>.{#peraja#}¦ {%peraja,%} or {%peroja, am,%} n. a turquoise
136748 old <>(= Pers. <g></g>).
136748 new <>(= Pers. <Arabic>فيروزه</Arabic>).
<P>.{#perA#}¦ {%pera1,%} f. a kind of musical instrument.
<P>.{#peru#}¦ 2. {%peru, us, us, u%} (fr. rt. 1. {%pr2i;%} for 1.
page = 1015
case = 1219, hw = SuBa, page=1015-a
<>ment; water, rain, (Sa1y. {%= alan4ka1ra%} or {%udaka,%}
<>R2ig-veda VII. 82, 5); a fragrant wood ({%= padma-
233954 old <>ka1sht2ha%}); [cf. Pers. <g></g> {%khu1b.%}] {%--S4ubha-kara,
233954 new <>ka1sht2ha%}); [cf. Pers. <Arabic>خوب</Arabic> {%khu1b.%}] {%--S4ubha-kara,
<>as, a1%} or {%i1, am,%} causing welfare, producing good,
<>propitious, &c. {%--S4ubha-karman, a,%} n. a good or
@jlreeder These cases need your input.
The other 25 cases of 'Pers.' do not need your input:
So the magic worked, great.
A similar examination using the search 'Arab' yielded one case from the first batch where an Arabic text was missed.
page = 0578-b
hw = pIlu
132310 old <>an insect; an elephant (Arabic <Arabic></Arabic> Persian <Arabic></Arabic>);
132310 new <>an insect; an elephant (Arabic <Arabic>فيل</Arabic> Persian <Arabic>پيل</Arabic>);
The first word is missing. Also - Please double check the 2nd word. That last character looks odd here, and in Emacs and Notepad++ ; this may be an artifact of my copy-pasting.
Yeah, looks anti-Unicode.
I've updated the comments.
As for the third example (132310), as far as I can tell, the letter in question is correct, even though it is not displaying correctly. I'm talking about the first character, which displays on the right end. It should be پ
, which is \u067E
unicode info. This is commonly used on the web, and seems to display fine: examples.
Perhaps Github's font simply doesn't display it.
Let me know if it also displays erroneously in the dictionary and we can keep discussing this.
As it's UTF8 friendly, strange to see it fade, because Github uses default popular web fonts with basic Arabic support included.
You're correct. Looking back at it I realized what was going on: there was an additional offending character. I've removed it and updated the comment containing the text above. That string seems to be all good!
Hail to Jason!
@jlreeder Hi, Jason -
I tried Arabic regex filter which @jlreeder mentions above, and it works perfectly for mw72.
The filter was on mw72.xml. And, 45 records were selected with the regex [ا-ى].
As a check, I also selected records using the regex <Arabic>
; and found the result to be identical.
We can also use the regex [ا-ى] on MW, where the Arabic text was inserted without an accompanying tag. Doing this, there are 87 lines of mw.xml with Arabic text.
This Gist contains the program and two filters mentioned. (I used copy/paste from emacs to put the file contents up).
@jlreeder Minor question. In your note above that gives the regex, the regex 'wraps around'. Thus when I copied/pasted into this note (and into the program), the regex looks different. I think this is just a weakness in the 'wrapping' of the Github rendering of your note. Do you agree?
Can we add the Arabic markup to the possible 87 Arabic words?
@funderburkjim
132310
).Understood, display only.
pIlu amendment now installed.
@gasyoun I'm still considering your request to add <Arabic>
markup to MW.
I think it is a good idea.
The concern holding me back relates to breaking 'downstream' code. I'm not worried about my downstream code, as I can identify and patch But what about other non-Cologne versions (like Huet's)? Should I worry about possibly breaking those? Or just let them worry about that? -- Not sure.
You mean a single net tag can break the castle of Huet? One mathemathican can break another one? Never would I believe that. @drdhaval2785 is coding around Huet's XMLs and did not break them. So why do you think you could, even if you would want to? :cactus:
@gasyoun I guess you are basically joking, and are aware of the downstream breaking issues.
Am intrigued by 'Dhaval is coding around Huet's XMLs' -- Have no idea what that means.
Let @drdhaval2785 tell you when he gets back. Hundreds of lines of intriguing and practically applicable code around Huet's machine.
Arabic language markup added everywhere in MW72. <lang n="arabic">إِدْبار</lang>
.
The other remarks of gasyoun do not deserve consideration right now. Closing.
@jlreeder Jonathan found another 3 cases where Arabic alphabet text was incorrectly marked as Greek. I've changed the tags, and include them here, for your completion. The case numbers are from the Greek, you can ignore.
page = 0364 case = 330, hw = tambIra, page=0364-c
page = 0365 case = 331, hw = taravI, page=0365-c
page = 0376 case = 348, hw = tIra, page=0376-b