quanteda / quanteda.textplots

Plotting and visualisation for quanteda
GNU General Public License v3.0
6 stars 1 forks source link

On Mac, Japanese Characters are garbled in output generated by textplot_network #14

Open hidekoji opened 3 years ago

hidekoji commented 3 years ago

Describe the bug

When running the below script on Mac, Japanese characters got garbled. I found a similar issue https://github.com/quanteda/quanteda/issues/1317 and followed the suggestion but it didn't solve the problem.

macOS Big Sure (11.2.3)

library(quanteda)
library(dplyr)

## Load extra fonts
extrafont::font_import()

tokens <- Twitter_Search_Source1 %>% select(text) %>% quanteda::corpus() %>%  
  quanteda::tokens(what = "word", remove_punct = TRUE, remove_numbers = TRUE,  remove_symbols = TRUE, remove_twitter = TRUE, remove_hyphens = TRUE, remove_separators = TRUE, remove_url = TRUE)
stopwords_to_remove <- exploratory::get_stopwords(lang = "japanese")
    tokens <- tokens %>% quanteda::tokens_remove(stopwords_to_remove, valuetype = "fixed")
    tokens <- tokens %>% quanteda::tokens_remove(stringr::str_c("^[\\\u3040-\\\u309f]{1,", 2, "}$"), valuetype = "regex")
fcmat <- quanteda::fcm(tokens, context = "window", tri = FALSE)
feat <- names(topfeatures(fcmat, 30))
quanteda::fcm_select(fcmat, pattern = feat) %>%
     quanteda::textplot_network(min_freq = 0.5)

This generates

image

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS  10.16

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_2.1.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5         pillar_1.4.7       compiler_4.0.2     tools_4.0.2        stopwords_2.1      digest_0.6.27      evaluate_0.14      lifecycle_0.2.0   
 [9] tibble_3.0.1       gtable_0.3.0       lattice_0.20-41    pkgconfig_2.0.3    rlang_0.4.10       Matrix_1.2-18      fastmatch_1.1-0    rstudioapi_0.13   
[17] yaml_2.2.1         xfun_0.20          dplyr_1.0.2        knitr_1.30         generics_0.1.0     vctrs_0.3.6        fs_1.5.0           grid_4.0.2        
[25] tidyselect_1.1.0   glue_1.4.2         data.table_1.13.6  R6_2.5.0           rmarkdown_2.6      ggplot2_3.3.3      purrr_0.3.4        magrittr_2.0.1    
[33] scales_1.1.1       ellipsis_0.3.1     htmltools_0.5.0    usethis_2.0.0      colorspace_2.0-0   stringi_1.5.3      RcppParallel_5.0.2 munsell_0.5.0     
[41] crayon_1.3.4  

font loaded on the R

 > extrafont::fonts()
  [1] ".SF Compact Rounded"     ".Keyboard"               ".New York"               ".SF Compact"             "System Font"            
  [6] ".SF NS Mono"             ".SF NS Rounded"          "Academy Engraved LET"    "Andale Mono"             "Apple Braille"          
 [11] "AppleMyungjo"            "Arial Black"             "Arial"                   "Arial Narrow"            "Arial Rounded MT Bold"  
 [16] "Arial Unicode MS"        "Bodoni Ornaments"        "Bodoni 72 Smallcaps"     ""                        "Brush Script MT"        
 [21] "Comic Sans MS"           "Courier New"             "DIN Alternate"           "DIN Condensed"           "Georgia"                
 [26] "Impact"                  "Khmer Sangam MN"         "Lao Sangam MN"           "Luminari"                "Microsoft Sans Serif"   
 [31] "Noto Sans Adlam"         "Noto Sans Avestan"       "Noto Sans Bamum"         "Noto Sans Bassa Vah"     "Noto Sans Batak"        
 [36] "Noto Sans Bhaiksuki"     "Noto Sans Brahmi"        "Noto Sans Buginese"      "Noto Sans Buhid"         "Noto Sans Carian"       
 [41] "Noto Sans CaucAlban"     "Noto Sans Chakma"        "Noto Sans Cham"          "Noto Sans Coptic"        "Noto Sans Cuneiform"    
 [46] "Noto Sans Cypriot"       "Noto Sans Duployan"      "Noto Sans EgyptHiero"    "Noto Sans Elbasan"       "Noto Sans Glagolitic"   
 [51] "Noto Sans Gothic"        "Noto Sans HanifiRohg"    "Noto Sans Hanunoo"       "Noto Sans Hatran"        "Noto Sans ImpAramaic"   
 [56] "Noto Sans InsPahlavi"    "Noto Sans InsParthi"     "Noto Sans Kaithi"        "Noto Sans Kayah Li"      "Noto Sans Kharoshthi"   
 [61] "Noto Sans Khojki"        "Noto Sans Khudawadi"     "Noto Sans Lepcha"        "Noto Sans Limbu"         "Noto Sans Linear A"     
 [66] "Noto Sans Linear B"      "Noto Sans Lisu"          "Noto Sans Lycian"        "Noto Sans Lydian"        "Noto Sans Mahajani"     
 [71] "Noto Sans Mandaic"       "Noto Sans Manichaean"    "Noto Sans Marchen"       "Noto Sans MeeteiMayek"   "Noto Sans Mende Kikakui"
 [76] "Noto Sans Meroitic"      "Noto Sans Miao"          "Noto Sans Modi"          "Noto Sans Mongolian"     "Noto Sans Mro"          
 [81] "Noto Sans Multani"       "Noto Sans Nabataean"     "Noto Sans Newa"          "Noto Sans NewTaiLue"     "Noto Sans N'Ko"         
 [86] "Noto Sans Ogham"         "Noto Sans Ol Chiki"      "Noto Sans OldHung"       "Noto Sans Old Italic"    "Noto Sans OldNorArab"   
 [91] "Noto Sans Old Permic"    "Noto Sans OldPersian"    "Noto Sans OldSouArab"    "Noto Sans Old Turkic"    "Noto Sans Osage"        
 [96] "Noto Sans Osmanya"       "Noto Sans Pahawh Hmong"  "Noto Sans Palmyrene"     "Noto Sans PauCinHau"     "Noto Sans PhagsPa"      
[101] "Noto Sans Phoenician"    "Noto Sans PsaPahlavi"    "Noto Sans Rejang"        "Noto Sans Runic"         "Noto Sans Samaritan"    
[106] "Noto Sans Saurashtra"    "Noto Sans Sharada"       "Noto Sans Shavian"       "Noto Sans Siddham"       "Noto Sans SoraSomp"     
[111] "Noto Sans Sundanese"     "Noto Sans Syloti Nagri"  "Noto Sans Syriac"        "Noto Sans Tagalog"       "Noto Sans Tagbanwa"     
[116] "Noto Sans Tai Le"        "Noto Sans Tai Tham"      "Noto Sans Tai Viet"      "Noto Sans Takri"         "Noto Sans Thaana"       
[121] "Noto Sans Tifinagh"      "Noto Sans Tirhuta"       "Noto Sans Ugaritic"      "Noto Sans Vai"           "Noto Sans Wancho"       
[126] "Noto Sans WarangCiti"    "Noto Sans Yi"            "Noto Serif Ahom"         "Noto Serif Balinese"     "Party LET"              
[131] "Tahoma"                  "Times New Roman"         "Trattatello"             "Trebuchet MS"            "Verdana"                
[136] "Webdings"                "Wingdings"               "Wingdings 2"             "Wingdings 3"             "Yu Gothic"              
>