Comply with text interchange format, perhaps also adding vignette

ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text

https://docs.ropensci.org/tokenizers

Other

184 stars 25 forks source link

Comply with text interchange format, perhaps also adding vignette #49

Closed lmullen closed 6 years ago

lmullen commented 7 years ago

The aim is to meet the standards for this proposal from the text workshop.

https://github.com/ropensci/textworkshop17/issues/14

dselivanov commented 7 years ago

I was thinking about all implications of having single flat data.frame as single default format. There are several drawbacks. It will be inconsistent to what base strsplit and stringi::stri_split* produce - list of vectors. So construction of data frame from this will take time, consume memory. For pos tagging and dtm construction it will be needed to split it to list of vectors again... So I'm afraid having it the only format will introduce too much overhead.

kbenoit commented 7 years ago

I agree with @dselivanov about the drawbacks of a data.frame-only format. There is also the issue that parallelisation works more mainly (only?) on lists. However I don't think we decided to change the default list format, but rather to create coercion methods to use the data.frame as the interchange format.

BTW the performance disadvantages of data.frames vanish if you make them a data.table for the operations you cite, but I can understand there might be a reluctance to use a non-base object class for an interchange format.

lmullen commented 7 years ago

I agree that we definitely need to keep the current list of tokens format for tokenizers, for both performance and compatibility reasons.

You can see the interoperability branch for some functions that I've added to convert between list and data frame formats. https://github.com/ropensci/tokenizers/tree/interoperability

The function to convert to a data frame is really slow. I know they don't need to be that slow, since either a tidyr::unnest() or dplyr::bind_rows() call would speed things up over do.call(rbind).

dselivanov commented 7 years ago

It is possible to make data.frame without lapply, rbind. I will send PR.

lmullen commented 7 years ago

🚀

kbenoit commented 7 years ago

You just need to make sure there are names for each "document" before this:

> toks <- tokenize_words(c(d1 = "a b c d e f", d2 = "g h i j k"))
> data.frame(docid = rep(names(toks), lengths(toks)), token = unlist(toks))
    docid token
d11    d1     a
d12    d1     b
d13    d1     c
d14    d1     d
d15    d1     e
d16    d1     f
d21    d2     g
d22    d2     h
d23    d2     i
d24    d2     j
d25    d2     k

> data.frame(docid = rep(names(toks), lengths(toks)), token = unlist(toks, use.names = FALSE))
   docid token
1     d1     a
2     d1     b
3     d1     c
4     d1     d
5     d1     e
6     d1     f
7     d2     g
8     d2     h
9     d2     i
10    d2     j
11    d2     k

if you don't want row names

lmullen commented 7 years ago

Ah, makes sense. I didn't realize rep() was vectorized in that way.