Allow for non-latin alphabets in clean_labels

zkamvar commented 5 years ago

Currently, clean_labels doesn't handle non-latin characters:

x <-  data.frame(stringsAsFactors=FALSE,
                 Source = c("김, 국삼", "김, 명희", "정, 병호", "...", 
                            "たけだ, まさゆき", "ますだ, よしひこ", 
                            "やまもと, のぼる", "...", "Ρούτση, Άννα", 
                            "Καλούδης, Χρήστος", "Θεοδωράτου, Ελένη > Ezra"),
                 Transliteration = c("Gim, Gugsam", "Gim, Myeonghyi", 
                                     "Jeong, Byeongho", "...", "Takeda, Masayuki",
                                     "Masuda, Yoshihiko", "Yamamoto, Noboru",
                                     "...", "Roútsē, Ánna", "Kaloúdēs, Chrḗstos", 
                                     "Theodōrátou, Elénē")
 )
epitrix::clean_labels(x$Source)
#>  [1] ""     ""     ""     ""     ""     ""     ""     ""     ""     ""    
#> [11] "ezra"

The reason for this is because the parser in clean_labels() transliterates any text with latin characters to ASCII, but ignores the non-latin symbols.

The solution to this is to first transliterate all symbols into Latin and then transliterate that into ASCII.

print(y <- stringi::stri_trans_general(x$Source, "ANY-Latin"))
#>  [1] "gim, gugsam"               "gim, myeonghui"           
#>  [3] "jeong, byeongho"           "..."                      
#>  [5] "takeda, masayuki"          "masuda, yoshihiko"        
#>  [7] "yamamoto, noboru"          "..."                      
#>  [9] "Roútsē, Ánna"              "Kaloúdēs, Chrḗstos"       
#> [11] "Theodōrátou, Elénē > Ezra"
print(z <- stringi::stri_trans_general(y, "Latin-ASCII"))
#>  [1] "gim, gugsam"               "gim, myeonghui"           
#>  [3] "jeong, byeongho"           "..."                      
#>  [5] "takeda, masayuki"          "masuda, yoshihiko"        
#>  [7] "yamamoto, noboru"          "..."                      
#>  [9] "Routse, Anna"              "Kaloudes, Chrestos"       
#> [11] "Theodoratou, Elene > Ezra"

^{Created on 2019-05-02 by the reprex package (v0.2.1)}

zkamvar commented 5 years ago

The table is from the ICU Project Guide: http://userguide.icu-project.org/transforms/general, which I got from the stringi manual: http://www.gagolewski.com/software/stringi/manual/?manpage=stri_trans_general

zkamvar commented 5 years ago

In fact, it can be in one command since the transliterators can be combined:

Note that transliterators are often combined in sequence to achieve a desired transformation. This is analogous to the composition of mathematical functions. For example, given a script that converts lowercase ASCII characters from Latin script to Katakana script, it is convenient to first (1) separate input base characters and accents, and then (2) convert uppercase to lowercase. To achieve this, a compound transform can be specified as follows: NFKD; Lower; Latin-Katakana;

print(y <- stringi::stri_trans_general(x$Source, "ANY-Latin; Latin-ASCII"))
#>  [1] "gim, gugsam"               "gim, myeonghui"           
#>  [3] "jeong, byeongho"           "..."                      
#>  [5] "takeda, masayuki"          "masuda, yoshihiko"        
#>  [7] "yamamoto, noboru"          "..."                      
#>  [9] "Routse, Anna"              "Kaloudes, Chrestos"       
#> [11] "Theodoratou, Elene > Ezra"

thibautjombart commented 5 years ago

I feared this would come and bite us at some point. Thanks for finding the monster's lair and slaying it. This is awesome. :) :)

zkamvar commented 5 years ago

To tie this in with #12, we could add the de-ASCII in there before Latin-ASCII

thibautjombart commented 5 years ago

Yup

reconhub / epitrix

Allow for non-latin alphabets in clean_labels #19