tidymodels / textrecipes

Extra recipes for Text Processing
https://textrecipes.tidymodels.org/
Other
160 stars 14 forks source link

custom_token argument in step_tokenize() doesn't like it when main argument isn't x #248

Open gaohuachuan opened 1 year ago

gaohuachuan commented 1 year ago

The problem

I created a function cn_seg() for Chinese word segmentation. The function takes a character vector as input and output a list of character vectors as requested. But when I set custom_token = cn_seg, it throws an error.

Reproducible example

words <- c("下面是不分行输出的结果", "下面是不输出的结果")

library(jiebaR)                           # For Chinese word segmentation

cn_seg <- function(text) {                      
  engine <- worker(bylines = TRUE)
  segment(text, engine)
}

cn_seg(words)

cn_text <- tibble(words = c("下面是不分行输出的结果", "下面是不输出的结果"))

recipe(~ words, data = cn_text) |> 
  step_tokenize(words, custom_token = cn_seg) |> 
  show_tokens(content)
#> Error in `step_tokenize()`:
#> Caused by error in `token()`:
#> ! unused argument (x = data[, 1, drop = TRUE])
#> Run `rlang::last_trace()` to see where the error occurred.
EmilHvitfeldt commented 1 year ago

Hello @gaohuachuan! 👋 thanks for reporting!

I found two things. Firstly, it wasn't documented, but it appears that the custom tokenization function uses the argument x as input. That should be fixed or documented correctly.

Secondly you should reference the same variable in show_tokens() as you used in step_tokenize(). So it should be show_tokens(words) instead of show_tokens(content)

words <- c("下面是不分行输出的结果", "下面是不输出的结果")

library(jiebaR)
#> Loading required package: jiebaRD

cn_seg <- function(x) {
  engine <- worker(bylines = TRUE)
  segment(x, engine)
}

cn_seg(words)
#> [[1]]
#> [1] "下面" "是"   "不"   "分行" "输出" "的"   "结果"
#> 
#> [[2]]
#> [1] "下面" "是"   "不"   "输出" "的"   "结果"

library(textrecipes)

cn_text <- tibble(words = c("下面是不分行输出的结果", "下面是不输出的结果"))

recipe(~ words, data = cn_text) |>
  step_tokenize(words, custom_token = cn_seg) |>
  show_tokens(words)
#> [[1]]
#> [1] "下面" "是"   "不"   "分行" "输出" "的"   "结果"
#> 
#> [[2]]
#> [1] "下面" "是"   "不"   "输出" "的"   "结果"
gaohuachuan commented 1 year ago

Thanks for your reply. The problem with my code is the x argument.