tidymodels / textrecipes

Extra recipes for Text Processing
https://textrecipes.tidymodels.org/
Other
160 stars 14 forks source link

step_negation function? #224

Closed apsteinmetz closed 1 year ago

apsteinmetz commented 1 year ago

I have seen articles suggesting that creating unique tokens for negated words can improve results (e.g. https://analyticsindiamag.com/when-to-use-negation-handling-in-sentiment-analysis/). Would this make sense for a step function?

I do it crudely now by using these approaches

library(tidymodels)
library(textrecipes)
library(tidytext)
library(stringr)
#> 
#> Attaching package: 'stringr'
#> The following object is masked from 'package:recipes':
#> 
#>     fixed

tate_text <- tate_text |>
  select(medium, year)

tate_nots = tibble(medium = "Etching on not canvas", year = 2000)
tate_text = bind_rows(tate_nots,tate_text)

# APPROACH 1: preproccess raw data
tate_text |> 
  mutate(medium = str_replace(medium, "not ","not_"))
#> # A tibble: 4,285 × 2
#>    medium                                                   year
#>    <chr>                                                   <dbl>
#>  1 Etching on not_canvas                                    2000
#>  2 Video, monitor or projection, colour and sound (stereo)  1990
#>  3 Etching on paper                                         1990
#>  4 Etching on paper                                         1990
#>  5 Etching on paper                                         1990
#>  6 Oil paint on canvas                                      1990
#>  7 Oil paint on canvas                                      1990
#>  8 Acrylic paint on paper                                   1990
#>  9 Woodcut on paper                                         1990
#> 10 Oil paint and wax on canvas                              1990
#> # ℹ 4,275 more rows

# APPROACH 2: process tokenized data
detect_negations <- function(tokens,negation_words = c("not")) {
  # function to negate tokenized data
  tokens <- tokens %>% rowid_to_column(var="word_num")
  not_words_rows <- tokens |> 
    filter(word %in% negation_words) |> 
    mutate(word_num = word_num) |> 
    pull(word_num)
  tokens <- tokens %>% 
    # create negated terms
    filter(!(word_num %in% not_words_rows)) |> 
    mutate(word = ifelse(word_num %in% (not_words_rows+1),paste0("not_",word),word))
  return(tokens)
}

unnest_tokens(tate_text,word,medium) |> 
  detect_negations()
#> # A tibble: 20,944 × 3
#>    word_num  year word      
#>       <int> <dbl> <chr>     
#>  1        1  2000 etching   
#>  2        2  2000 on        
#>  3        4  2000 not_canvas
#>  4        5  1990 video     
#>  5        6  1990 monitor   
#>  6        7  1990 or        
#>  7        8  1990 projection
#>  8        9  1990 colour    
#>  9       10  1990 and       
#> 10       11  1990 sound     
#> # ℹ 20,934 more rows

Created on 2023-03-30 with reprex v2.0.2

EmilHvitfeldt commented 1 year ago

Hello @apsteinmetz 👋

That is an interesting application! I don't think it would be worth adding another step. In this case, you could handle this with a single call to str_replace_all()

library(tidymodels)
library(textrecipes)
library(stringr)

tate_text <- tate_text |>
  select(medium, year)

tate_nots = tibble(medium = "Etching on not canvas", year = 2000)
tate_text = bind_rows(tate_nots,tate_text)

negations <- c(
  "not +" = "not_",
  "no +" = "no_"
)

recipe(year ~ medium, data = tate_text) |>
  step_mutate(medium = str_replace_all(medium, negations)) |>
  step_tokenize(medium) |>
  step_tf(medium) |>
  prep() |>
  bake(new_data = NULL) |>
  select(contains("not_"))
#> # A tibble: 4,285 × 1
#>    tf_medium_not_canvas
#>                   <int>
#>  1                    1
#>  2                    0
#>  3                    0
#>  4                    0
#>  5                    0
#>  6                    0
#>  7                    0
#>  8                    0
#>  9                    0
#> 10                    0
#> # ℹ 4,275 more rows
github-actions[bot] commented 1 year ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.