Open koheiw opened 6 years ago
Relatedly, we could discuss stopwords in non-English languages. stopwords is a collection of stopwords taken from various sources, but words for some of the languages (e.g. Japanese) look strange.
> head(stopwords::stopwords("he", "stopwords-iso"), 20)
[1] "אבל" "או" "אולי" "אותה" "אותו" "אותי" "אותך" "אותם" "אותן" "אותנו"
[11] "אז" "אחר" "אחרות" "אחרי" "אחריכן" "אחרים" "אחרת" "אי" "איזה" "איך"
> head(stopwords::stopwords("ja", "stopwords-iso"), 20)
[1] "あそこ" "あっ" "あの" "あのかた" "あの人" "あり" "あります" "ある"
[9] "あれ" "い" "いう" "います" "いる" "う" "うち" "え"
[17] "お" "および" "おり" "おります"
RTL, mostly arabic but also hebrew, is a current annoyance in another (python) project. I'd be interested in learning more.
> tokens_tortl(toks)
tokens from 5 documents.
text1 :
[1] "מדינת" "ישראל" "נוסדה" "בשנת" "1948" "."
text2 :
[1] "מְדִינַת" "יִשְׂרָאֵל" "(" "בערבית" ":" "دولة"
[7] "إسرائيل" "," "דַולַת" "אִסְרַאאִיל" ")" ","
[13] "הנקראת" "לרוב" "יִשְׂרָאֵל" "," "היא" "מדינה"
[19] "במזרח" "התיכון" "," "השוכנת" "על" "החוף"
[25] "הדרום-מזרחי" "של" "הים" "התיכון" "." "מדינת"
[31] "ישראל" "הוקמה" "בשטחי" "ארץ" "ישראל" ","
[37] "ביתו" "הלאומי" "וארץ" "מולדתו" "של" "העם"
[43] "היהודי" "." "המדינה" "," "שהכריזה" "על"
[49] "עצמאותה" "בה'" "באייר" "תש\"ח" "," "14"
[55] "במאי" "1948" "," "היא" "בעלת" "משטר"
[61] "של" "דמוקרטיה" "פרלמנטרית" "."
char_tortl <- function(x) {
stri_replace_all_regex(x, "([\\p{P}\\p{S}])", "$1\\u200F") # save punctuation
}
tokens_tortl <- function(x) {
attr(x, "types") <- char_tortl(types(x))
return(x)
}
char_tortl(txt)
tokens_tortl(toks)
FYI. If you want to have extra annotations (pos/lemma/dependencies)
library(udpipe)
x <- readLines("hebrew.txt", encoding = "UTF-8")
udmodel <- udpipe_download_model(language = "hebrew")
udmodel <- udpipe_load_model(udmodel$file_model)
txt <- udpipe_annotate(udmodel, x)
View(as.data.frame(txt))
We discussed how to deal with CJK languages last year, but Hebrew and Arabic are also difficult to deal with because of the direction of texts. Words should be right-alighted and run from right-to-left as in the screenshot, but they often do not appear in this way in text analysis tools. The Hebrew text file is provided by a guy in Israel. Don't use Google Translate, because its result could be wrong.