scikit-adaptation / skada

Domain adaptation toolbox compatible with scikit-learn and pytorch
https://scikit-adaptation.github.io/
BSD 3-Clause "New" or "Revised" License
56 stars 16 forks source link

[BUG] ValueError raised when using the GaussianReweightDensityAdapter #67

Closed YanisLalou closed 5 months ago

YanisLalou commented 5 months ago

Using this code snippet:

from skada import GaussianReweightDensityAdapter, make_da_pipeline
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from skada.datasets import fetch_office31_surf_all

domain_dataset_office31 = fetch_office31_surf_all()

pipeline = make_da_pipeline(
    GaussianReweightDensityAdapter(),
)

X, y, sample_domain = domain_dataset_office31.pack_lodo()
pipeline.fit(X=X, y=y, sample_domain=sample_domain)

You'll get this error: ValueError: Found array with 0 sample(s) (shape=(0, 800)) while a minimum of 1 is required by StandardScaler.

Also its worth noting that by adding a LogisticRegression at the end of the pipeline, the error magically disappears.

tgnassou commented 5 months ago

I think the pipeline treats differently the last step of a pipeline and the steps before. That's why adding LogisticRegression fixes the problem, I think. When the adapter is the last step, the pipeline function seems to remove the negative values (i.e., the target) in sample_domain, but I don't know why.

kachayev commented 5 months ago

The transform removes target values because it 'prepares' the output for the next step, which is supposed to be an estimator that doesn't have the access to targets (as we can't fit to masked labels).

kachayev commented 5 months ago

The selector, not the transformer, sorry for the confusion. We can prevent this from happening by trying to guess what type of estimator is used. But I'm not sure how deep we want to go with that.

YanisLalou commented 5 months ago

We could just check if the estimator has a 'transform' method --> we don't remove the negative values. If the estimator has a 'predict' method --> we remove the negative values.

kachayev commented 5 months ago

Yes, this is how default sklearn pipeline makes the distinction.