ropensci / tokenizers

Fast, Consistent Tokenization of Natural Language Text
https://docs.ropensci.org/tokenizers
Other
185 stars 25 forks source link

Design discussion #1

Closed dselivanov closed 8 years ago

dselivanov commented 8 years ago

I have strong opinion, that functions should always return values of the same type. Generally it drastically simplifies logic on subsequent functions - do not need to addition type checks. If you really want to keep simplification to vector, it can be a good idea to add simplify argument which should be FALSE by default.

Another question is whether we should perform lowercasing by default. I don't have strong arguments against this, but still don't sure...

dselivanov commented 8 years ago

See for example purrr, how it fight against weakly typed functions and why does this make sense.

lmullen commented 8 years ago

You beat me to the question. :-)

I see your point about functions always returning the same kind of output. That's certainly the design direction that stringi and stringr take, since they always return a list. I certainly could eliminate a lot of code if I just went with stringi and stringr on this.

The thing I don't like about that is that it forces the function to turn into return a higher order data structure, so a character vector input always gets turned into a list of character vectors.

I'm thinking that the best thing to do is have the argument simplify = FALSE as the default, as you suggested. Would that be an acceptable choice for you, @dselivanov?

lmullen commented 8 years ago

On the question of lowercasing by default. It seems to me that with certain kinds of tokens you almost always want lowercase. So with words or n-grams, I've never not made things lowercase. But with other kinds of "tokens" (loosely defined) such as sentences or paragraphs I don't want things lowercase unless I do it explicitly. So that's what I've tried to do with the defaults.

lmullen commented 8 years ago

Two other questions for you, @dselivanov:

  1. Are there any other tokenizers you might want for this to be useful for text2vec?
  2. Any other problems with the design of the interface?

Once we nail this down, I'll add tests.

dselivanov commented 8 years ago

For me simplify = FALSE is definitely fine. Regarding the lowercasing. I agree that in 99% of time NLP tasks operate on lower-cased tokens. But, before tokenization users usually perform preprocessing with regular expressions and lowercasing. I didn't observe a lot of cases when user perform tokenization on raw text. And mb it worth to leave tokenizers with only their main function... For n-gram tokenizer I have c++ implementation and it even exported to public text2vec API - ngrams. Will happy to contribute it to toknizers. I will polish it a little bit and open PR. Also think about stopwords for n-grams. Should we include them or not? Personally I think it worth to add such argument.

lmullen commented 8 years ago

I'll go through then and add "simplify = FALSE` parameters and revised the documentation.

The n-gram tokenizers are in C++, they just have R wrappers. Happy to get improvements though.

https://github.com/lmullen/tokenizers/blob/master/src/shingle_ngrams.cpp https://github.com/lmullen/tokenizers/blob/master/src/skip_ngrams.cpp

Would you mind sending a PR for stop words?

lmullen commented 8 years ago

I'm basically done with this package for its first CRAN version. All the functions have a consistent interface, as explained in the README. And I've added tests. I'm going to send this to CRAN tomorrow, @dselivanov, unless there is something you'd like to see changed that would make it more useful for text2vec.

lmullen commented 8 years ago

This package is on CRAN now, though it's not out on all the CRAN mirrors yet.

https://cran.r-project.org/web/packages/tokenizers/index.html