quanteda / quanteda

An R package for the Quantitative Analysis of Textual Data
https://quanteda.io
GNU General Public License v3.0
840 stars 188 forks source link

Save Chinese or Japanese wordcloud as pdf #1770

Closed yuanzhouIR closed 4 years ago

yuanzhouIR commented 4 years ago

I can save an English wordcloud plotted by textplot_wordcloud() as pdf, but when I save a Chinese or Japanese wordcloud as pdf, the file turns out to be blank. My wordcloud generating code is as follows: textplot_wordcloud(dfm, min_count = 6, random_order = FALSE, max_words = 200, min_size = .5, max_size = 2.8, font = "SimHei", color = brewer.pal(8, "Dark2"))

I think it's because of the font. Can you fix this when updating the package?

kbenoit commented 4 years ago

We'd need more info about your system to pinpoint the problem, but I think it means you have not installed the SimHei font.

Please see https://github.com/quanteda/quanteda/issues/1317 for instructions, and let us know how that works for you.

yuanzhouIR commented 4 years ago

@kbenoit Thanks for your reply. I have installed the SimHei font and the wordcloud displays correctly in the plot window. I can save it as png but not pdf. My sessionInfo is as follows:

R version 3.6.1 (2019-07-05) Platform: x86_64-apple-darwin15.6.0 (64-bit) Running under: macOS Catalina 10.15.1

Matrix products: default BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale: [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] quanteda.corpora_0.87 extrafont_0.17 RColorBrewer_1.1-2 jiebaR_0.10.99
[5] jiebaRD_0.1 lubridate_1.7.4 forcats_0.4.0 stringr_1.4.0
[9] dplyr_0.8.3 purrr_0.3.3 readr_1.3.1 tidyr_1.0.0
[13] tibble_2.1.3 ggplot2_3.2.1 tidyverse_1.2.1 quanteda_1.5.1

loaded via a namespace (and not attached): [1] httr_1.4.1 pkgload_1.0.2 jsonlite_1.6 modelr_0.1.5
[5] RcppParallel_4.4.4 assertthat_0.2.1 cellranger_1.1.0 remotes_2.1.0
[9] sessioninfo_1.1.1 Rttf2pt1_1.3.7 pillar_1.4.2 backports_1.1.5
[13] lattice_0.20-38 glue_1.3.1 extrafontdb_1.0 digest_0.6.22
[17] rvest_0.3.5 colorspace_1.4-1 Matrix_1.2-17 pkgconfig_2.0.3
[21] devtools_2.2.1 broom_0.5.2 haven_2.2.0 scales_1.0.0
[25] processx_3.4.1 generics_0.0.2 usethis_1.5.1 ellipsis_0.3.0
[29] withr_2.1.2 lazyeval_0.2.2 cli_1.1.0 magrittr_1.5
[33] crayon_1.3.4 readxl_1.3.1 memoise_1.1.0 ps_1.3.0
[37] stopwords_1.0 fs_1.3.1 nlme_3.1-142 xml2_1.2.2
[41] pkgbuild_1.0.6 tools_3.6.1 data.table_1.12.6 prettyunits_1.0.2 [45] hms_0.5.2 lifecycle_0.1.0 munsell_0.5.0 callr_3.3.2
[49] compiler_3.6.1 rlang_0.4.1 grid_3.6.1 rstudioapi_0.10
[53] labeling_0.3 testthat_2.3.0 gtable_0.3.0 curl_4.2
[57] R6_2.4.0 zeallot_0.1.0 fastmatch_1.1-0 rprojroot_1.3-2
[61] desc_1.2.0 stringi_1.4.3 Rcpp_1.0.3 vctrs_0.2.0
[65] spacyr_1.2 tidyselect_0.2.5

kbenoit commented 4 years ago

From RStudio:

Screenshot 2019-11-22 10 26 03

works fine for me.

Another way:

library("quanteda")
set.seed(10)
dfmat1 <- dfm(corpus_subset(data_corpus_inaugural, President == "Obama"),
               remove = stopwords("english"), remove_punct = TRUE) %>%
    dfm_trim(min_termfreq = 3)

pdf(file = "wordcloud.pdf")
# basic wordcloud
textplot_wordcloud(dfmat1)
dev.off()

Note that I could not get ggplot2::ggsave("wordcloud.pdf") to produce a file that I could read, even though that should have worked.

yuanzhouIR commented 4 years ago

@kbenoit I can export English wordclouds as pdf, too. However, for Chinese or Japanese wordclouds, it does not work.

kbenoit commented 4 years ago

Can you create a reproducible example? This is almost surely a font issue.

yuanzhouIR commented 4 years ago

@kbenoit The following code can produce a Japanese wordcloud. It displays correctly in my RStdudio plot window. However, when I save the graph as pdf, the exported file is blank.

library(quanteda)
library(quanteda.corpora)
library(dplyr)

corp <- download(url = "https://www.dropbox.com/s/co12wpj08pzqz71/data_corpus_election2017tweets.rds?dl=1")
texts(corp) <- stringi::stri_trans_nfkc(texts(corp))

toks <- corp %>% tokens(remove_separator = FALSE, remove_url = TRUE) %>%
  tokens_select('^[一-龠]+$', valuetype = 'regex') %>% 
  tokens_remove("^[ぁ-ん]+$", valuetype = "regex")

tweet_dfm <- toks %>% dfm() %>% dfm_select(min_nchar = 2) %>%
  dfm_remove(c("選挙", "投票"))

textplot_wordcloud(tweet_dfm, max_words = 200, font = "SimHei",
                   color = RColorBrewer::brewer.pal(8, "Dark2"))
kbenoit commented 4 years ago

Make sure you have installed Cairo, using

brew install cairo

Then this should work:

cairo_pdf(file = "wordcloud.pdf")
textplot_wordcloud(tweet_dfm, max_words = 200, font = "SimHei",
                   color = RColorBrewer::brewer.pal(8, "Dark2"))
dev.off()
yuanzhouIR commented 4 years ago

@kbenoit Yes, it works! Thank you very much!