juliasilge / tidytext

Text mining using tidy tools :sparkles::page_facing_up::sparkles:
https://juliasilge.github.io/tidytext/
Other
1.18k stars 182 forks source link

Error with `unnest_tokens()` token = "ngrams" and non-standard column names #67

Closed zkamvar closed 7 years ago

zkamvar commented 7 years ago

unnest_tokens() will fail when token = "ngrams" if the incoming data frame has non-standard column names (e.g. spaces, commas). It originates from group_by_ (see: https://github.com/tidyverse/dplyr/issues/2891).

library("tidyr")
library("tibble")
library("tidytext")
tribble(~a, ~`b, yeah`, ~b, 
  1, "a", "some sentence.", 
  2, "b", "another sentence with more words", 
  3, "c", "a sentence that has more things") %>% 
  unnest_tokens(output = word, input = b, token = "ngrams", n = 2)
#> Error in parse(text = x): <text>:1:2: unexpected ','
#> 1: b,
#>      ^
Session info ``` r devtools::session_info() #> Session info ------------------------------------------------------------- #> setting value #> version R version 3.4.0 (2017-04-21) #> system x86_64, darwin15.6.0 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> tz America/Chicago #> date 2017-06-20 #> Packages ----------------------------------------------------------------- #> package * version date source #> assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0) #> backports 1.1.0 2017-05-22 CRAN (R 3.4.0) #> base * 3.4.0 2017-04-21 local #> broom 0.4.2 2017-02-13 CRAN (R 3.4.0) #> compiler 3.4.0 2017-04-21 local #> datasets * 3.4.0 2017-04-21 local #> devtools 1.13.2 2017-06-02 CRAN (R 3.4.0) #> digest 0.6.12 2017-01-27 CRAN (R 3.4.0) #> dplyr 0.7.0 2017-06-09 CRAN (R 3.4.0) #> evaluate 0.10 2016-10-11 CRAN (R 3.4.0) #> foreign 0.8-68 2017-04-24 CRAN (R 3.4.0) #> formatR 1.5 2017-04-25 CRAN (R 3.4.0) #> glue 1.0.0 2017-04-17 CRAN (R 3.4.0) #> graphics * 3.4.0 2017-04-21 local #> grDevices * 3.4.0 2017-04-21 local #> grid 3.4.0 2017-04-21 local #> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0) #> janeaustenr 0.1.5 2017-06-10 CRAN (R 3.4.0) #> knitr 1.16 2017-05-18 CRAN (R 3.4.0) #> lattice 0.20-35 2017-03-25 CRAN (R 3.4.0) #> magrittr 1.5 2014-11-22 CRAN (R 3.4.0) #> Matrix 1.2-10 2017-04-28 CRAN (R 3.4.0) #> memoise 1.1.0 2017-04-21 CRAN (R 3.4.0) #> methods * 3.4.0 2017-04-21 local #> mnormt 1.5-5 2016-10-15 CRAN (R 3.4.0) #> nlme 3.1-131 2017-02-06 CRAN (R 3.4.0) #> parallel 3.4.0 2017-04-21 local #> plyr 1.8.4 2016-06-08 CRAN (R 3.4.0) #> psych 1.7.5 2017-05-03 CRAN (R 3.4.0) #> purrr 0.2.2.2 2017-05-11 cran (@0.2.2.2) #> R6 2.2.2 2017-06-17 cran (@2.2.2) #> Rcpp 0.12.11 2017-05-22 cran (@0.12.11) #> reshape2 1.4.2 2016-10-22 CRAN (R 3.4.0) #> rlang 0.1.1 2017-05-18 CRAN (R 3.4.0) #> rmarkdown 1.6 2017-06-15 cran (@1.6) #> rprojroot 1.2 2017-01-16 CRAN (R 3.4.0) #> SnowballC 0.5.1 2014-08-09 cran (@0.5.1) #> stats * 3.4.0 2017-04-21 local #> stringi 1.1.5 2017-04-07 CRAN (R 3.4.0) #> stringr 1.2.0 2017-02-18 CRAN (R 3.4.0) #> tibble * 1.3.3 2017-05-28 CRAN (R 3.4.0) #> tidyr * 0.6.3 2017-05-15 CRAN (R 3.4.0) #> tidytext * 0.1.3 2017-06-19 CRAN (R 3.4.0) #> tokenizers 0.1.4 2016-08-29 cran (@0.1.4) #> tools 3.4.0 2017-04-21 local #> utils * 3.4.0 2017-04-21 local #> withr 1.0.2 2016-06-20 CRAN (R 3.4.0) #> yaml 2.1.14 2016-11-12 CRAN (R 3.4.0) ```
juliasilge commented 7 years ago

This is now fixed, after updating some of tidytext's internals to rlang/tidyeval.

library(tidyverse)
library(tidytext)

tribble(~a, ~`b, yeah`, ~b, 
        1, "a", "some sentence.", 
        2, "b", "another sentence with more words", 
        3, "c", "a sentence that has more things") %>% 
  unnest_tokens(output = word, input = b, token = "ngrams", n = 2)

#> # A tibble: 10 x 3
#>        a `b, yeah`             word
#>    <dbl>     <chr>            <chr>
#>  1     1         a    some sentence
#>  2     2         b another sentence
#>  3     2         b    sentence with
#>  4     2         b        with more
#>  5     2         b       more words
#>  6     3         c       a sentence
#>  7     3         c    sentence that
#>  8     3         c         that has
#>  9     3         c         has more
#> 10     3         c      more things

Let me know if you run into any issues with using the tidyeval framework with tidytext! :smiley:

zkamvar commented 7 years ago

Thank you so much for the fix!

P.S. Love that commit message 😃

github-actions[bot] commented 2 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.