scikit-learn / scikit-learn

scikit-learn: machine learning in Python
https://scikit-learn.org
BSD 3-Clause "New" or "Revised" License
59.31k stars 25.23k forks source link

Add Sparse Matrix Support For HistGradientBoostingClassifier #15336

Open jmwoloso opened 4 years ago

jmwoloso commented 4 years ago

Description

Hi!

I'm receiving the error below when attempting to pass a sparse matrix to HistGradientBoostingClassifier. The matrix is the result of using CountVectorizer and TfidfTransformer on input text.

In my case, the size of the text prohibits converting the sparse matrix to a dense one (I run out of memory).

Steps/Code to Reproduce

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import HistGradientBoostingClassifier

df = pd.read_csv(...)

vectorizer = CountVectorizer()
tfidf = TfidfTransformer()
clf = HistGradientBoostingClassifier()

vecs = vectorizer.fit_transform(df.loc[:, "very_large_text"])
vecs = tfidf.fit_transform(vecs)

clf.fit(vecs, df.loc[:, "label"])

Expected Results

No error is thrown.

Actual Results

TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense numpy array.

Versions

System: python: 3.7.3 (default, Oct 1 2019, 18:28:53) [GCC 5.4.0 20160609] executable: /local_disk0/pythonVirtualEnvDirs/virtualEnv-3631eab5-084b-4139-952e-5aff594ac1bb/bin/python machine: Linux-4.15.0-1050-azure-x86_64-with-debian-stretch-sid

Python deps: pip: 19.0.3 setuptools: 40.8.0 sklearn: 0.21.3 numpy: 1.16.2 scipy: 1.2.1 Cython: 0.29.6 pandas: 0.24.2

thomasjpfan commented 4 years ago

Thank you for posting this feature request. We can discuss what kind of semantics we want for sparse matrix support. I.E. we can treat zero as missing or a literal zero. LightGBM uses a parameter to decide which semantic to use.

jmwoloso commented 4 years ago

No problem. Without knowing the full extent of what is required, I'd be happy to try and tackle it with your guidance on where to look, etc.

jnothman commented 4 years ago

Zero semantics would be consistent with every other estimator (except for pairwise data).

NicolasHug commented 3 years ago

For ref I had noted some implem suggestions in https://github.com/scikit-learn/scikit-learn/issues/16885

I believe @StealthyKamereon wants to give it a shot.

Regarding semantics of zeros: we can have a boolean parameter zero_as_missing as LightGBM. For a first version, this is not necessary though, and we should treat zeros as literal zeros for the PR to be as small as possible.

StealthyKamereon commented 3 years ago

Following what you said regarding semantics of zeros, I think in addition to the zero_as_missing parameter there should be a categorical_missing_values which would set the missing values for categorical features. Or maybe something like zero_as: str or list of ndarray of shape (n_cats,), default="missing"

Apoorvgarg-creator commented 6 months ago

Can anyone give some temporary approach to solve this problem ?