Closed lmullen closed 6 years ago
I was thinking about all implications of having single flat data.frame
as single default format. There are several drawbacks. It will be inconsistent to what base strsplit
and stringi::stri_split*
produce - list of vectors. So construction of data frame from this will take time, consume memory. For pos tagging and dtm construction it will be needed to split it to list of vectors again... So I'm afraid having it the only format will introduce too much overhead.
I agree with @dselivanov about the drawbacks of a data.frame-only format. There is also the issue that parallelisation works more mainly (only?) on lists. However I don't think we decided to change the default list format, but rather to create coercion methods to use the data.frame as the interchange format.
BTW the performance disadvantages of data.frames vanish if you make them a data.table for the operations you cite, but I can understand there might be a reluctance to use a non-base object class for an interchange format.
I agree that we definitely need to keep the current list of tokens format for tokenizers, for both performance and compatibility reasons.
You can see the interoperability
branch for some functions that I've added to convert between list and data frame formats. https://github.com/ropensci/tokenizers/tree/interoperability
The function to convert to a data frame is really slow. I know they don't need to be that slow, since either a tidyr::unnest()
or dplyr::bind_rows()
call would speed things up over do.call(rbind)
.
It is possible to make data.frame without lapply, rbind. I will send PR.
🚀
You just need to make sure there are names for each "document" before this:
> toks <- tokenize_words(c(d1 = "a b c d e f", d2 = "g h i j k"))
> data.frame(docid = rep(names(toks), lengths(toks)), token = unlist(toks))
docid token
d11 d1 a
d12 d1 b
d13 d1 c
d14 d1 d
d15 d1 e
d16 d1 f
d21 d2 g
d22 d2 h
d23 d2 i
d24 d2 j
d25 d2 k
or
> data.frame(docid = rep(names(toks), lengths(toks)), token = unlist(toks, use.names = FALSE))
docid token
1 d1 a
2 d1 b
3 d1 c
4 d1 d
5 d1 e
6 d1 f
7 d2 g
8 d2 h
9 d2 i
10 d2 j
11 d2 k
if you don't want row names
Ah, makes sense. I didn't realize rep()
was vectorized in that way.
The aim is to meet the standards for this proposal from the text workshop.
https://github.com/ropensci/textworkshop17/issues/14