Closed JohnZed closed 4 years ago
I can take this up as i currently have some initial work for both.
CC: @randerzander
@VibhuJawa
Are you also going to be providing a CountVectorizer
and a TfidfTransformer
with your implementation?
@cjnolet , Yup, The plan currently is to have these three:
CountVectorizer
HashingVectorizer
TfidfTransformer
I was waiting on the some of the strings refactor to merge into cudf before i start this implementation. We can move it up if that will help.
CC: @randerzander .
@VibhuJawa, have you been able to make any progress on these yet?
These are in cuml 0.15. Closing.
For NLP problems, TF-IDF is a common pre-processing approach. SKlearn has a simple interface for this here: https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html
To allow stateless transformations, it would be nice to support the hashing vectorizer (https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html) variant as well.