make_da_pipeline does not remove target labels during estimator fit

antoinedemathelin commented 8 months ago

Hi everyone,

When looking at the plot_label_comparison example, I observe very accurate predictions of the DA methods for cases where DA should fail. After some investigation, it seems that the estimator is fitted on both source and target labels, while it should be fitted only on source. I made a little example with KLIEP highlighting this behavior:

import numpy as np
from sklearn.linear_model import LinearRegression
from skada import KLIEP

X_source = np.ones(10)
X_target = X_source

y_source = np.ones(10)
y_target = np.zeros(10)

X = np.concatenate((X_source, X_target))
y = np.concatenate((y_source, y_target))
sample_domain = np.ones(X.shape[0])
sample_domain[X_source.shape[0]:] *= -2

lr = LinearRegression()

kliep = KLIEP(base_estimator=lr)

kliep.fit(X.reshape(-1, 1), y, sample_domain=sample_domain)
kliep.predict(X.reshape(-1, 1))

>>> array([0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5,
       0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])

Here, the source and target inputs are the same X_source = X_target while the labels are different y_source = 1 and y_target = 0. If, the target labels are unknown, kliep should predict 1, while it predicts 0.5 here. I suspect, then, that it has fitted the linear regressor on target labels too.

When looking at the code of make_da_pipeline I do not see where the target labels are removed before fitting the estimator ?

kachayev commented 8 months ago

Hi @antoinedemathelin, the removal happens in the Selector, e.g. here for a single source/single target implementation. Would the result be different if you use DomainAwareDataset to construct X, y, sample_domain tuple with pack_train?

kachayev commented 8 months ago

I ran the example with proper masking of the inputs, found an old bug with regression masks. Please check out this PR #86

kachayev commented 7 months ago

Hi @antoinedemathelin! I just noticed that the issue was automatically closed when the related PR was merge. Did you have a chance to verify it works as expected now? (or should we reopen the issue)

scikit-adaptation / skada

make_da_pipeline does not remove target labels during estimator fit #83