Closed apsteinmetz closed 1 year ago
Hello @apsteinmetz 👋
That is an interesting application! I don't think it would be worth adding another step. In this case, you could handle this with a single call to str_replace_all()
library(tidymodels)
library(textrecipes)
library(stringr)
tate_text <- tate_text |>
select(medium, year)
tate_nots = tibble(medium = "Etching on not canvas", year = 2000)
tate_text = bind_rows(tate_nots,tate_text)
negations <- c(
"not +" = "not_",
"no +" = "no_"
)
recipe(year ~ medium, data = tate_text) |>
step_mutate(medium = str_replace_all(medium, negations)) |>
step_tokenize(medium) |>
step_tf(medium) |>
prep() |>
bake(new_data = NULL) |>
select(contains("not_"))
#> # A tibble: 4,285 × 1
#> tf_medium_not_canvas
#> <int>
#> 1 1
#> 2 0
#> 3 0
#> 4 0
#> 5 0
#> 6 0
#> 7 0
#> 8 0
#> 9 0
#> 10 0
#> # ℹ 4,275 more rows
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.
I have seen articles suggesting that creating unique tokens for negated words can improve results (e.g. https://analyticsindiamag.com/when-to-use-negation-handling-in-sentiment-analysis/). Would this make sense for a step function?
I do it crudely now by using these approaches
Created on 2023-03-30 with reprex v2.0.2