Open rth opened 5 years ago
deprecate TfidfVectorizer because it provides no benefit over a pipeline and creates much confusion.
One limiting factor is that pipelines don't support get_feature_names()
for now, so to get the previous behavior one would need to do something like (pipe.named_steps['vect'].get_feature_names()
) but hopefully that would be addressed in the near future.
As mentioned by @jnothman in https://github.com/scikit-learn/scikit-learn/pull/14748#issuecomment-530160881,
+1 on my side to deprecate it.
I would also deprecate
norm
parameter (for removal) inHashingVectorizer
as its combination with TfidfTransformer using default parameters currently produces nonsense results due tonorm='l2'
. That would also resolve https://github.com/scikit-learn/scikit-learn/issues/6972Generally I have also seen experienced Python users do very weird things with text vectorizers (particularly as soon as customization is involved). Taking some time to think what how we would like its API to look ideally for 1.0 and if we can get partially there without major disruption would, I think, be useful.
In particular, I find the way we currently suggest customizing the behavior by subclassing
CountVectorizer
to update the analyzer is really awkward. I am pondering on some way to separate the pipeline that processes a single document and returns n-grams, from theCountVectorizer
class. Basically making passinganalyzers
easier while re-using part of the existing processing.Anyway, we can start with easy deprecations.