Internationalization

Open koheiw opened 6 years ago

koheiw commented 6 years ago

We discussed how to deal with CJK languages last year, but Hebrew and Arabic are also difficult to deal with because of the direction of texts. Words should be right-alighted and run from right-to-left as in the screenshot, but they often do not appear in this way in text analysis tools. screenshot_20180414_075425 The Hebrew text file is provided by a guy in Israel. Don't use Google Translate, because its result could be wrong.


koheiw commented 6 years ago

Relatedly, we could discuss stopwords in non-English languages. stopwords is a collection of stopwords taken from various sources, but words for some of the languages (e.g. Japanese) look strange.

> head(stopwords::stopwords("he", "stopwords-iso"), 20)
 [1] "אבל"    "או"     "אולי"   "אותה"   "אותו"   "אותי"   "אותך"   "אותם"   "אותן"   "אותנו" 
[11] "אז"     "אחר"    "אחרות"  "אחרי"   "אחריכן" "אחרים"  "אחרת"   "אי"     "איזה"   "איך" 

> head(stopwords::stopwords("ja", "stopwords-iso"), 20)
 [1] "あそこ"   "あっ"     "あの"     "あのかた" "あの人"   "あり"     "あります" "ある"    
 [9] "あれ"     "い"       "いう"     "います"   "いる"     "う"       "うち"     "え"      
[17] "お"       "および"   "おり"     "おります"
conjugateprior commented 6 years ago

RTL, mostly arabic but also hebrew, is a current annoyance in another (python) project. I'd be interested in learning more.

koheiw commented 6 years ago
> tokens_tortl(toks)
tokens from 5 documents.
text1 :
[1] "מדינת" "ישראל" "נוסדה" "בשנת"  "1948"  ".‏"    

text2 :
 [1] "מְדִינַת"       "יִשְׂרָאֵל"       "(‏"           "בערבית"      ":‏"           "دولة"       
 [7] "إسرائيل"     ",‏"           "דַולַת"        "אִסְרַאאִיל"     ")‏"           ",‏"          
[13] "הנקראת"      "לרוב"        "יִשְׂרָאֵל"       ",‏"           "היא"         "מדינה"      
[19] "במזרח"       "התיכון"      ",‏"           "השוכנת"      "על"          "החוף"       
[25] "הדרום-‏מזרחי" "של"          "הים"         "התיכון"      ".‏"           "מדינת"      
[31] "ישראל"       "הוקמה"       "בשטחי"       "ארץ"         "ישראל"       ",‏"          
[37] "ביתו"        "הלאומי"      "וארץ"        "מולדתו"      "של"          "העם"        
[43] "היהודי"      ".‏"           "המדינה"      ",‏"           "שהכריזה"     "על"         
[49] "עצמאותה"     "בה'‏"         "באייר"       "תש\"‏ח"       ",‏"           "14"         
[55] "במאי"        "1948"        ",‏"           "היא"         "בעלת"        "משטר"       
[61] "של"          "דמוקרטיה"    "פרלמנטרית"   ".‏"          
koheiw commented 6 years ago


koheiw commented 6 years ago
char_tortl <- function(x) {
    stri_replace_all_regex(x, "([\\p{P}\\p{S}])", "$1\\u200F") # save punctuation

tokens_tortl <- function(x) {
    attr(x, "types") <- char_tortl(types(x))

jwijffels commented 6 years ago

FYI. If you want to have extra annotations (pos/lemma/dependencies)

x <- readLines("hebrew.txt", encoding = "UTF-8")
udmodel <- udpipe_download_model(language = "hebrew")
udmodel <- udpipe_load_model(udmodel$file_model)
txt <- udpipe_annotate(udmodel, x)
