Closed piroor closed 6 years ago
Sorry for the delay. I'll try to take a look tomorrow.
I'm super, behind, but I'll get to this soon.
I finally took a deeper look at this. It looks really nice! If you're interested in continuing, I'll happily merge this once it's done and documented.
Thank you for reviewing! Currently my hands are full, so I'll write document for this in the next month.
@piroor no worries, I can relate. I'll probably roll out a release soon, and catch this PR on the next one.
I will keep an eye on this and will look into the code later when I get some cycles to spare.
Thanks @ibnesayeed
Sorry for this large delay. I added descriptions for newly introduced (separated) classes and modules.
Moreover, I've added more changes to make tokenizer and filters customizable. Usage of new options are added to docs/bayes.md.
@piroor I'll take a look tomorrow.
This looks pretty good overall. I need to dig in a bit more once we handle #172 in the next day or so. I'll try to target this for a 2.3
release in the next week.
Thanks for you patience!
Finally, I got a chance to look at it today. It is generally looking good to me except a few places where passing a method would have been easier, but a module is required instead. For example, the :tokenizer
and token_filters
options could just accept their corresponding methods rather than a module that implements those methods with very specific names. Having some default implementation in modules is still fine as long as we pass methods rather than the modules like below:
filters = [
CatFilter.filter,
ClassifierReborn::TokenFilters::Stopword.filter,
]
classifier = ClassifierReborn::Bayes.new tokenizer: BigramTokenizer.tokenize, token_filters: filters
This signature will make it easier to write an inline custom tokenizer or filter, while more complex ones can be wrapped in a module when necessary.
@ibnesayeed the code you suggested won't work as you expected, because
filters = [
CatFilter.filter,
ClassifierReborn::TokenFilters::Stopword.filter,
]
the filters
are not array of methods themselves, it is an array of returned values from those methods.
But I agree that the option should accept lambda. So I think I should rename both fixed method name tokenize
and filter
to call
, then the option can accept both module and lambda.
After the commit 958d3a0, now :tokenizer
and :token_filters
options accept lambda.
This looks good to me
The code LGTM! (I have not tested it though).
Thanks!!
Thanks for the contribution!
Now I'm trying to separate tokenizing operations from the hasher, as the first step for #131. I introduced these new modules and classes:
Tokenizer::Whitespace
Tokenizer::Token
TokenFilter::Stopword
TokenFilter::Stemmer
For testability and flexibility, they are stayed separated for now. Next step, I'm planning to introduce some mechanism to switch the tokenizer and related modules.
How about this approach?