jekyll / classifier-reborn

A general classifier module to allow Bayesian and other types of classifications. A fork of cardmagic/classifier.
https://jekyll.github.io/classifier-reborn/
GNU Lesser General Public License v2.1
551 stars 109 forks source link

Separate tokenizer from hasher #162

Closed piroor closed 6 years ago

piroor commented 7 years ago

Now I'm trying to separate tokenizing operations from the hasher, as the first step for #131. I introduced these new modules and classes:

For testability and flexibility, they are stayed separated for now. Next step, I'm planning to introduce some mechanism to switch the tokenizer and related modules.

How about this approach?

Ch4s3 commented 7 years ago

Sorry for the delay. I'll try to take a look tomorrow.

Ch4s3 commented 6 years ago

I'm super, behind, but I'll get to this soon.

Ch4s3 commented 6 years ago

I finally took a deeper look at this. It looks really nice! If you're interested in continuing, I'll happily merge this once it's done and documented.

piroor commented 6 years ago

Thank you for reviewing! Currently my hands are full, so I'll write document for this in the next month.

Ch4s3 commented 6 years ago

@piroor no worries, I can relate. I'll probably roll out a release soon, and catch this PR on the next one.

ibnesayeed commented 6 years ago

I will keep an eye on this and will look into the code later when I get some cycles to spare.

Ch4s3 commented 6 years ago

Thanks @ibnesayeed

piroor commented 6 years ago

Sorry for this large delay. I added descriptions for newly introduced (separated) classes and modules.

Moreover, I've added more changes to make tokenizer and filters customizable. Usage of new options are added to docs/bayes.md.

Ch4s3 commented 6 years ago

@piroor I'll take a look tomorrow.

Ch4s3 commented 6 years ago

This looks pretty good overall. I need to dig in a bit more once we handle #172 in the next day or so. I'll try to target this for a 2.3 release in the next week.

Thanks for you patience!

ibnesayeed commented 6 years ago

Finally, I got a chance to look at it today. It is generally looking good to me except a few places where passing a method would have been easier, but a module is required instead. For example, the :tokenizer and token_filters options could just accept their corresponding methods rather than a module that implements those methods with very specific names. Having some default implementation in modules is still fine as long as we pass methods rather than the modules like below:

filters = [
  CatFilter.filter,
  ClassifierReborn::TokenFilters::Stopword.filter,
]
classifier = ClassifierReborn::Bayes.new tokenizer: BigramTokenizer.tokenize, token_filters: filters

This signature will make it easier to write an inline custom tokenizer or filter, while more complex ones can be wrapped in a module when necessary.

piroor commented 6 years ago

@ibnesayeed the code you suggested won't work as you expected, because

filters = [
  CatFilter.filter,
  ClassifierReborn::TokenFilters::Stopword.filter,
]

the filters are not array of methods themselves, it is an array of returned values from those methods.

But I agree that the option should accept lambda. So I think I should rename both fixed method name tokenize and filter to call, then the option can accept both module and lambda.

piroor commented 6 years ago

After the commit 958d3a0, now :tokenizer and :token_filters options accept lambda.

Ch4s3 commented 6 years ago

This looks good to me

ibnesayeed commented 6 years ago

The code LGTM! (I have not tested it though).

piroor commented 6 years ago

Thanks!!

Ch4s3 commented 6 years ago

Thanks for the contribution!