Closed topepo closed 4 years ago
One question that I have (not being an expert in this area) is this... the data will start with each row having a text field with the unparsed text ("the lazy fox..."). At what point should we break it up into words? Should we? For some endpoints (e.g. td-idf) will need them cut up but others (n-grams) need the sequences.
Okay I'll start by throwing in my 2 cents.
I like the name we have to far. Been unable to think up anything better myself. I do think we should consider if recipes.text
would be better to align more with the naming convention of other packages such as broom.mixed for broom.
One question that I have (not being an expert in this area) is this... the data will start with each row having a text field with the unparsed text ("the lazy fox..."). At what point should we break it up into words? Should we? For some endpoints (e.g. td-idf) will need them cut up but others (n-grams) need the sequences.
Very valid question. I would prefer that we don't break them up into words, by which I mean that we don't expand the rows tidytext does it, I'm not saying that the tidytext framework is bad, just that it makes it had to work with in modeling. I would suggest that in a general workflow we keep the text field for the duration of the preprocessing phase and then drop it by the end.
Instead of replacing the text variable, we would let step_token
create a list column that then could be used in future steps such as step_stem
and step_tf_idf
, step_featurehashing
or step_stopwords
which would eventually give us some vectors that can be used for modelling. This way we don't lose the original unparsed text + we stay in a tidy format (1 observation per row). Let me know what you think. Another benefit from this is that you don't lose an observation created entirely with stopwords, you would just a an empty list element in the tokenized_stopword_filtered_column.
Talking about steps. I'll just list of some the the actions that comes to mind here, not suggesting that all of them could or should be implemented.
As this project mentions we would also like some word embeddings. Here we have the choice to handle the two cases, pretrained and train-yourself. Interesting embeddings would be
We would of cause want to be able to be able to do tf-idf, but I think we should also have the tf (term frequency) and word counts, ie bag of words. We might even want to have it count how many times certain words appear (limited bag of words).
We would want step_featurehashing
, to enable feature Hashing. Possible with multiple different hashing algorithms. As far as I know Murmurhash3 is a popular choice.
Also be able to simply select the n most used words, or the top 90% most used words.
For stemming there is already a couple of packages we can rely on.
We might also want to include a dictionary stemmer.
(Edit: hunspell is a dictionary stemmer.)
Here we have to be careful and stopwords are both subject and language specific. We should allow the exclusion of the n most used words, as they will contain a lot of the stopwords. And we need to allow the use of a custom stopwords list. We do have some packages
(Edit: tidytext used the stopwords package)
Here we have the same problem as we do with stopwords.
FWIW. You might be interested in adding the crfsuite R package also on this (https://github.com/bnosac/crfsuite) list of options. It does predictions. I'll think I'll upload it to cran in the coming weeks after I updated udpipe on cran.
A couple of initial reactions:
Whether to tokenize the text (and how) is a step in the recipe IMO, and depends on your modeling strategy. For most text modeling strategies, an early step is to tokenize, whether you are doing deep learning or regularized regression. (:arrow_left: Note on that link: I think I have an error in the ROC code that I need to dig out.)
A couple of initial reactions: tidytext largely uses the stopwords package now; I would say that's the central place to go for stopwords in R hunspell is a dictionary stemmer
Perfect, thanks for the corrections! I don't I really wanted to depend on tidytext (seems like a big dependency), I more included it in the list for reference.
I agree that tokenizing is a step. And I'm sorry if it haven't been entirely clear. My considerations is more about how the text should be stored between each step.
I'm thinking that step_tokenize
should take in a character vector of length n and return a list of length n.
Then step_stem
, step_stopwords
, step_only_include_top_n_words
and similar steps would take in a list and return a stemmed or stopword-filtered list.
And step_tf_idf
, step_hashing
would take in a list and return m variables depending on previous specifications.
Whether to tokenize the text (and how) is a step in the recipe IMO,
I agree that tokenizing is a step.
Keep in mind that recipes doesn't restrict step order so we'd have to be clear about which steps require untoeknized text (or paste it internally).
I'm thinking that
step_tokenize
should take in a character vector of length n and return a list of length n.
Lists aren't that tidy. I'd rather have something like:
> dat <-
+ data_frame(
+ txt = c(
+ "sample one has this text",
+ "while sample two is in the next row"
+ ),
+ sample = 1:2)
>
> dat %>%
+ group_by(sample) %>%
+ unnest_tokens(word, txt)
# A tibble: 13 x 2
# Groups: sample [2]
sample word
<int> <chr>
1 1 sample
2 1 one
3 1 has
4 1 this
5 1 text
6 2 while
7 2 sample
8 2 two
9 2 is
10 2 in
11 2 the
12 2 next
13 2 row
or perhaps a list column:
> tibble(
+ text =
+ dat %>%
+ group_by(sample) %>%
+ unnest_tokens(word, txt) %>%
+ split(.$sample)
+ )
# A tibble: 2 x 1
text
<list>
1 <tibble [5 × 2]>
2 <tibble [8 × 2]>
unnest()
can be used to get back to the concatenated version.
Naming is always hard but I want to avoid repeated_underscores_in_names =]
Keep in mind that recipes doesn't restrict step order so we'd have to be clear about which steps require untoeknized text (or paste it internally).
Very true, some of the embeddings are based on characters instead of words.
I don't really like the first option as you can have something like this happen
dat <-
data_frame(
txt = c(
"sample one has this text",
"let me be behind many"
),
sample = 1:2)
dat %>%
unnest_tokens(word, txt)
#> # A tibble: 10 x 2
#> sample word
#> <int> <chr>
#> 1 1 sample
#> 2 1 one
#> 3 1 has
#> 4 1 this
#> 5 1 text
#> 6 2 let
#> 7 2 me
#> 8 2 be
#> 9 2 behind
#> 10 2 many
dat %>%
unnest_tokens(word, txt) %>%
anti_join(stop_words)
#> Joining, by = "word"
#> # A tibble: 2 x 2
#> sample word
#> <int> <chr>
#> 1 1 sample
#> 2 1 text
By removing stop words we lost all the words in sample two, and now it is hard to get that back.
Lists aren't that tidy.
I know :( But I was kinda thinking along of the list column route. Still not super tidy.
dat %>%
unnest_tokens(word, txt) %>%
nest(word)
#> # A tibble: 2 x 2
#> sample data
#> <int> <list>
#> 1 1 <tibble [5 × 1]>
#> 2 2 <tibble [5 × 1]>
How would we start this package? Is it something I create or you create (@topepo) and move to tidymodels when it is more ripe?
How would we start this package?
Check out the usethis
package for good ways of creating a new package. tidymodels/embed
is an example of a package that contains only new steps.
Is it something I create or you create (@topepo) and move to tidymodels when it is more ripe?
Have at it yourself (and I can help if you need). My process is to keep it in my personal account until I'm close to submitting to CRAN.
I suggest starting with something simple like stop words and/or stemming. Let me know the repo name and I'll keep watch.
Hi Max, hi all, thanks for your patience, here finally I am too :-)
Some first comments from my side...
For embeddings, I'd also say it makes sense to have both train-yourself as well as pre-trained as options.
For train-yourself, there already is the tfembed
package, right? I've always wanted to check how you implemented it Max, would you like me to take a closer look?
For pre-trained, TF Hub (https://www.tensorflow.org/hub/modules/google/elmo/2) has ELMO, and we already have some basic hub functionality in tfdatasets
.
That is very raw and untested though; last time I used it it filled up my complete temp space of 32G and I had to kill it. It's probably worth another try though.
For tokenizing and converting to integers, I think I basically always end up using the same workflow of keras
1) fit_on_texts
--- trains the tokenizer
2) texts_to_sequences
--- converts text to integers
3) pad_sequences
--- pads or truncates to same length
e.g.,
top_k <- 5000
tokenizer <- text_tokenizer(
num_words = top_k,
oov_token = "<unk>",
filters = '!"#$%&()*+.,-/:;=?@[\\]^_`{|}~ ')
tokenizer$fit_on_texts(sample_captions)
train_captions_tokenized <-
tokenizer %>% texts_to_sequences(train_captions)
train_captions_padded <- pad_sequences(
train_captions_tokenized,
maxlen = max_length,
padding = "post",
truncating = "post"
)
There's also a parallel way of converting to 1-hot instead of to integers, but I don't see that used often...
Did this ever happen? I'm going to end up writing a lot of this in the next few days if it didn't, would love to use code if it exists and contribute if it doesn't.
Hello @jonthegeek! We have textrecipes which includes a small dozen steps.
Fantastic!
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.
For this project, the idea is to have steps that can be used to process text data (contained in a new package). I've made placeholders in that project for some obvious processing candidates.
@EmilHvitfeldt has volunteered to get started. Perhaps @juliasilge, @skeydan, and others might have some suggestions and opinions. I'd be happy to include tensorflow methods for text processing too.
I can add anyone interested in helping to the project page. We can use this issue to discuss ideas and kick around implementation questions.