tidymodels / recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
https://recipes.tidymodels.org
Other
569 stars 112 forks source link

text recipe steps #192

Closed topepo closed 4 years ago

topepo commented 6 years ago

For this project, the idea is to have steps that can be used to process text data (contained in a new package). I've made placeholders in that project for some obvious processing candidates.

@EmilHvitfeldt has volunteered to get started. Perhaps @juliasilge, @skeydan, and others might have some suggestions and opinions. I'd be happy to include tensorflow methods for text processing too.

I can add anyone interested in helping to the project page. We can use this issue to discuss ideas and kick around implementation questions.

topepo commented 6 years ago

One question that I have (not being an expert in this area) is this... the data will start with each row having a text field with the unparsed text ("the lazy fox..."). At what point should we break it up into words? Should we? For some endpoints (e.g. td-idf) will need them cut up but others (n-grams) need the sequences.

EmilHvitfeldt commented 6 years ago

Okay I'll start by throwing in my 2 cents.

I like the name we have to far. Been unable to think up anything better myself. I do think we should consider if recipes.text would be better to align more with the naming convention of other packages such as broom.mixed for broom.

One question that I have (not being an expert in this area) is this... the data will start with each row having a text field with the unparsed text ("the lazy fox..."). At what point should we break it up into words? Should we? For some endpoints (e.g. td-idf) will need them cut up but others (n-grams) need the sequences.

Very valid question. I would prefer that we don't break them up into words, by which I mean that we don't expand the rows tidytext does it, I'm not saying that the tidytext framework is bad, just that it makes it had to work with in modeling. I would suggest that in a general workflow we keep the text field for the duration of the preprocessing phase and then drop it by the end.

Instead of replacing the text variable, we would let step_token create a list column that then could be used in future steps such as step_stem and step_tf_idf, step_featurehashing or step_stopwords which would eventually give us some vectors that can be used for modelling. This way we don't lose the original unparsed text + we stay in a tidy format (1 observation per row). Let me know what you think. Another benefit from this is that you don't lose an observation created entirely with stopwords, you would just a an empty list element in the tokenized_stopword_filtered_column.

EmilHvitfeldt commented 6 years ago

Talking about steps. I'll just list of some the the actions that comes to mind here, not suggesting that all of them could or should be implemented.

Embeddings

As this project mentions we would also like some word embeddings. Here we have the choice to handle the two cases, pretrained and train-yourself. Interesting embeddings would be

td-idf and family

We would of cause want to be able to be able to do tf-idf, but I think we should also have the tf (term frequency) and word counts, ie bag of words. We might even want to have it count how many times certain words appear (limited bag of words).

dimentionality reduction

We would want step_featurehashing, to enable feature Hashing. Possible with multiple different hashing algorithms. As far as I know Murmurhash3 is a popular choice.
Also be able to simply select the n most used words, or the top 90% most used words.

Tokenizers

Stemming/Lemmatizing

For stemming there is already a couple of packages we can rely on.

We might also want to include a dictionary stemmer.

(Edit: hunspell is a dictionary stemmer.)

Stopwords

Here we have to be careful and stopwords are both subject and language specific. We should allow the exclusion of the n most used words, as they will contain a lot of the stopwords. And we need to allow the use of a custom stopwords list. We do have some packages

(Edit: tidytext used the stopwords package)

Sentiment

Here we have the same problem as we do with stopwords.

Other

jwijffels commented 6 years ago

FWIW. You might be interested in adding the crfsuite R package also on this (https://github.com/bnosac/crfsuite) list of options. It does predictions. I'll think I'll upload it to cran in the coming weeks after I updated udpipe on cran.

juliasilge commented 6 years ago

A couple of initial reactions:

Whether to tokenize the text (and how) is a step in the recipe IMO, and depends on your modeling strategy. For most text modeling strategies, an early step is to tokenize, whether you are doing deep learning or regularized regression. (:arrow_left: Note on that link: I think I have an error in the ROC code that I need to dig out.)

EmilHvitfeldt commented 6 years ago

A couple of initial reactions: tidytext largely uses the stopwords package now; I would say that's the central place to go for stopwords in R hunspell is a dictionary stemmer

Perfect, thanks for the corrections! I don't I really wanted to depend on tidytext (seems like a big dependency), I more included it in the list for reference.

I agree that tokenizing is a step. And I'm sorry if it haven't been entirely clear. My considerations is more about how the text should be stored between each step.

I'm thinking that step_tokenize should take in a character vector of length n and return a list of length n.

Then step_stem, step_stopwords, step_only_include_top_n_words and similar steps would take in a list and return a stemmed or stopword-filtered list.

And step_tf_idf, step_hashing would take in a list and return m variables depending on previous specifications.

topepo commented 6 years ago

Whether to tokenize the text (and how) is a step in the recipe IMO,

I agree that tokenizing is a step.

Keep in mind that recipes doesn't restrict step order so we'd have to be clear about which steps require untoeknized text (or paste it internally).

I'm thinking that step_tokenize should take in a character vector of length n and return a list of length n.

Lists aren't that tidy. I'd rather have something like:

> dat <- 
+   data_frame(
+     txt = c(
+       "sample one has this text",
+       "while sample two is in the next row"
+     ),
+     sample = 1:2)
> 
> dat %>% 
+   group_by(sample) %>%
+   unnest_tokens(word, txt)
# A tibble: 13 x 2
# Groups:   sample [2]
   sample word  
    <int> <chr> 
 1      1 sample
 2      1 one   
 3      1 has   
 4      1 this  
 5      1 text  
 6      2 while 
 7      2 sample
 8      2 two   
 9      2 is    
10      2 in    
11      2 the   
12      2 next  
13      2 row   

or perhaps a list column:

> tibble(
+   text = 
+     dat %>% 
+     group_by(sample) %>%
+     unnest_tokens(word, txt) %>%
+     split(.$sample)
+ )
# A tibble: 2 x 1
  text            
  <list>          
1 <tibble [5 × 2]>
2 <tibble [8 × 2]>

unnest() can be used to get back to the concatenated version.

Naming is always hard but I want to avoid repeated_underscores_in_names =]

EmilHvitfeldt commented 6 years ago

Keep in mind that recipes doesn't restrict step order so we'd have to be clear about which steps require untoeknized text (or paste it internally).

Very true, some of the embeddings are based on characters instead of words.

I don't really like the first option as you can have something like this happen

dat <- 
   data_frame(
     txt = c(
       "sample one has this text",
       "let me be behind many"
     ),
     sample = 1:2)

dat %>% 
   unnest_tokens(word, txt)
#> # A tibble: 10 x 2
#>    sample word  
#>     <int> <chr> 
#>  1      1 sample
#>  2      1 one   
#>  3      1 has   
#>  4      1 this  
#>  5      1 text  
#>  6      2 let   
#>  7      2 me    
#>  8      2 be    
#>  9      2 behind
#> 10      2 many

dat %>% 
   unnest_tokens(word, txt) %>%
  anti_join(stop_words)
#> Joining, by = "word"
#> # A tibble: 2 x 2
#>   sample word  
#>    <int> <chr> 
#> 1      1 sample
#> 2      1 text

By removing stop words we lost all the words in sample two, and now it is hard to get that back.

Lists aren't that tidy.

I know :( But I was kinda thinking along of the list column route. Still not super tidy.

dat %>% 
  unnest_tokens(word, txt) %>%
  nest(word)
#> # A tibble: 2 x 2
#>   sample data            
#>    <int> <list>          
#> 1      1 <tibble [5 × 1]>
#> 2      2 <tibble [5 × 1]>
EmilHvitfeldt commented 6 years ago

How would we start this package? Is it something I create or you create (@topepo) and move to tidymodels when it is more ripe?

topepo commented 6 years ago

How would we start this package?

Check out the usethis package for good ways of creating a new package. tidymodels/embed is an example of a package that contains only new steps.

Is it something I create or you create (@topepo) and move to tidymodels when it is more ripe?

Have at it yourself (and I can help if you need). My process is to keep it in my personal account until I'm close to submitting to CRAN.

I suggest starting with something simple like stop words and/or stemming. Let me know the repo name and I'll keep watch.

skeydan commented 6 years ago

Hi Max, hi all, thanks for your patience, here finally I am too :-)

Some first comments from my side...

For embeddings, I'd also say it makes sense to have both train-yourself as well as pre-trained as options.

For tokenizing and converting to integers, I think I basically always end up using the same workflow of keras

1) fit_on_texts --- trains the tokenizer
2) texts_to_sequences --- converts text to integers 3) pad_sequences --- pads or truncates to same length

e.g.,

top_k <- 5000
tokenizer <- text_tokenizer(
  num_words = top_k,
  oov_token = "<unk>",
  filters = '!"#$%&()*+.,-/:;=?@[\\]^_`{|}~ ')

tokenizer$fit_on_texts(sample_captions)

train_captions_tokenized <-
  tokenizer %>% texts_to_sequences(train_captions)

train_captions_padded <-  pad_sequences(
  train_captions_tokenized,
  maxlen = max_length,
  padding = "post",
  truncating = "post"
)

There's also a parallel way of converting to 1-hot instead of to integers, but I don't see that used often...

jonthegeek commented 5 years ago

Did this ever happen? I'm going to end up writing a lot of this in the next few days if it didn't, would love to use code if it exists and contribute if it doesn't.

EmilHvitfeldt commented 5 years ago

Hello @jonthegeek! We have textrecipes which includes a small dozen steps.

jonthegeek commented 5 years ago

Fantastic!

github-actions[bot] commented 3 years ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.