[FEA] TF-IDF feature vectorizer

rapidsai / cuml

cuML - RAPIDS Machine Learning Library

https://docs.rapids.ai/api/cuml/stable/

Apache License 2.0

4.16k stars 525 forks source link

Closed JohnZed closed 4 years ago

JohnZed commented 4 years ago

For NLP problems, TF-IDF is a common pre-processing approach. SKlearn has a simple interface for this here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

To allow stateless transformations, it would be nice to support the hashing vectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) variant as well.

VibhuJawa commented 4 years ago

I can take this up as i currently have some initial work for both.

CC: @randerzander

cjnolet commented 4 years ago

@VibhuJawa

Are you also going to be providing a CountVectorizer and a TfidfTransformer with your implementation?

VibhuJawa commented 4 years ago

@cjnolet , Yup, The plan currently is to have these three:

I was waiting on the some of the strings refactor to merge into cudf before i start this implementation. We can move it up if that will help.

CC: @randerzander .

cjnolet commented 4 years ago

@VibhuJawa, have you been able to make any progress on these yet?

cjnolet commented 4 years ago

These are in cuml 0.15. Closing.