koaning / tokenwiser

Bag of, not words, but tricks!
https://koaning.github.io/tokenwiser/
Apache License 2.0
68 stars 7 forks source link

Support for sklearn partial pipelines that take custom doc features as input. #43

Closed narayanacharya6 closed 3 years ago

narayanacharya6 commented 3 years ago

Hey, the library looks promising and I completely agree with the motivation behind it!

I have a question based on your blog post about using custom sklearn models as part of spaCy pipeline here. The example in the blog suggests using the HashingVectorizer from sklearn directly. I wanted to swap that out and use custom features I extract from the Doc from a previous pipeline component as an input to my PartialPipeline that hosts only the classifier. So, the entire pipeline would something like this tokenizer >> custom_featurizer (sets some extension on the doc indicating features) >> partial pipeline (has only the classifier that uses features from the previous component)

The above question does not look well worded, so I'd be happy to add more color to the question if it does not make sense.

narayanacharya6 commented 3 years ago

I think the fastest way to achieve what I want would be to create a subclass of SklearnCat and change the update method to call partial_fit with the features from my previous pipeline component instead of the texts.

If there is a better way to do this, please do let me know :)

koaning commented 3 years ago

Just to confirm, you've seen these featurizers? Also, what kind of features from the Doc would you want to use that might contribute to a better classification?

narayanacharya6 commented 3 years ago

I have some very trivial features at the document level which have worked well for the task at hand. If I understand correctly the featurizers you mention incorporate token level information.

I actually did end up sub-classing the SklearnCat and overriding the methods responsible for the partial_fit, predict, etc. I ran into some other issues, but I guess those are more on spaCy training behaviour than tokenwiser :)

Closing the issue.