Open gaohuachuan opened 1 year ago
Hello @gaohuachuan! 👋 thanks for reporting!
I found two things. Firstly, it wasn't documented, but it appears that the custom tokenization function uses the argument x
as input. That should be fixed or documented correctly.
Secondly you should reference the same variable in show_tokens()
as you used in step_tokenize()
. So it should be show_tokens(words)
instead of show_tokens(content)
words <- c("下面是不分行输出的结果", "下面是不输出的结果")
library(jiebaR)
#> Loading required package: jiebaRD
cn_seg <- function(x) {
engine <- worker(bylines = TRUE)
segment(x, engine)
}
cn_seg(words)
#> [[1]]
#> [1] "下面" "是" "不" "分行" "输出" "的" "结果"
#>
#> [[2]]
#> [1] "下面" "是" "不" "输出" "的" "结果"
library(textrecipes)
cn_text <- tibble(words = c("下面是不分行输出的结果", "下面是不输出的结果"))
recipe(~ words, data = cn_text) |>
step_tokenize(words, custom_token = cn_seg) |>
show_tokens(words)
#> [[1]]
#> [1] "下面" "是" "不" "分行" "输出" "的" "结果"
#>
#> [[2]]
#> [1] "下面" "是" "不" "输出" "的" "结果"
Thanks for your reply. The problem with my code is the x
argument.
The problem
I created a function
cn_seg()
for Chinese word segmentation. The function takes a character vector as input and output a list of character vectors as requested. But when I setcustom_token = cn_seg
, it throws an error.Reproducible example