Closed dselivanov closed 8 years ago
See for example purrr, how it fight against weakly typed functions and why does this make sense.
You beat me to the question. :-)
I see your point about functions always returning the same kind of output. That's certainly the design direction that stringi and stringr take, since they always return a list. I certainly could eliminate a lot of code if I just went with stringi and stringr on this.
The thing I don't like about that is that it forces the function to turn into return a higher order data structure, so a character vector input always gets turned into a list of character vectors.
I'm thinking that the best thing to do is have the argument simplify = FALSE
as the default, as you suggested. Would that be an acceptable choice for you, @dselivanov?
On the question of lowercasing by default. It seems to me that with certain kinds of tokens you almost always want lowercase. So with words or n-grams, I've never not made things lowercase. But with other kinds of "tokens" (loosely defined) such as sentences or paragraphs I don't want things lowercase unless I do it explicitly. So that's what I've tried to do with the defaults.
Two other questions for you, @dselivanov:
Once we nail this down, I'll add tests.
For me simplify = FALSE
is definitely fine.
Regarding the lowercasing. I agree that in 99% of time NLP tasks operate on lower-cased tokens. But, before tokenization users usually perform preprocessing with regular expressions and lowercasing. I didn't observe a lot of cases when user perform tokenization on raw text. And mb it worth to leave tokenizers with only their main function...
For n-gram tokenizer I have c++ implementation and it even exported to public text2vec API - ngrams. Will happy to contribute it to toknizers
. I will polish it a little bit and open PR.
Also think about stopwords for n-grams. Should we include them or not? Personally I think it worth to add such argument.
I'll go through then and add "simplify = FALSE` parameters and revised the documentation.
The n-gram tokenizers are in C++, they just have R wrappers. Happy to get improvements though.
https://github.com/lmullen/tokenizers/blob/master/src/shingle_ngrams.cpp https://github.com/lmullen/tokenizers/blob/master/src/skip_ngrams.cpp
Would you mind sending a PR for stop words?
I'm basically done with this package for its first CRAN version. All the functions have a consistent interface, as explained in the README. And I've added tests. I'm going to send this to CRAN tomorrow, @dselivanov, unless there is something you'd like to see changed that would make it more useful for text2vec.
This package is on CRAN now, though it's not out on all the CRAN mirrors yet.
https://cran.r-project.org/web/packages/tokenizers/index.html
I have strong opinion, that functions should always return values of the same type. Generally it drastically simplifies logic on subsequent functions - do not need to addition type checks. If you really want to keep simplification to vector, it can be a good idea to add
simplify
argument which should beFALSE
by default.Another question is whether we should perform lowercasing by default. I don't have strong arguments against this, but still don't sure...