scikit-adaptation / skada

Domain adaptation toolbox compatible with scikit-learn and pytorch
https://scikit-adaptation.github.io/
BSD 3-Clause "New" or "Revised" License
56 stars 16 forks source link

KLIEP pipeline does not use the sample_weight returned by KLIEPAdapter when fitting the estimator #89

Closed antoinedemathelin closed 5 months ago

antoinedemathelin commented 5 months ago

Hi everyone,

I think there is an issue with KLIEP. It provides the same results as the estimator fitted without sample_weight. It seems that the sample_weight returned by KLIEPAdapter are not given as argument of the fit method of the estimator. Here is a little example showing that KLIEP provides the same predictions as the estimator without weighting, while it should provide the same results as the estimator fitted with the weight returned by KLIEPAdapter:

import numpy as np
from sklearn.linear_model import LinearRegression
from skada import KLIEP, KLIEPAdapter
from skada.datasets import DomainAwareDataset

X_source = np.array([[-1.],
                     [0.],
                     [1.]])

X_target = np.array([[1.]])

y_source = np.abs(X_source.ravel())
y_target = np.ones(1)

dataset = DomainAwareDataset([(X_source, y_source, 's'), (X_target, y_target, 't')])
X, y, sample_domain = dataset.pack_train(as_sources=['s'], as_targets=['t'])

pipeline_model = KLIEP(base_estimator=LinearRegression(), gamma=1.)

kliep_adapt = KLIEPAdapter(gamma=1.)
kliep_estimator = LinearRegression()
noweighting_estimator = LinearRegression()

pipeline_model.fit(X, y, sample_domain=sample_domain)

kliep_adapt.fit(X, y, sample_domain=sample_domain)
weighting_outputs = kliep_adapt.transform(X, y, sample_domain=sample_domain, allow_source=True)

sample_weight = weighting_outputs["sample_weights"]

source_idx = ~np.isnan(y)
kliep_estimator.fit(X[source_idx], y[source_idx], sample_weight[source_idx])

noweighting_estimator.fit(X[source_idx], y[source_idx])

print("Pipeline Model preds:", pipeline_model.predict(X_target))
print("No Weighting + Estimator preds:", noweighting_estimator.predict(X_target))
print("KLIEP Weighting + Estimator preds:", kliep_estimator.predict(X_target))

>>> Pipeline Model preds: [0.66666667]
>>> No Weighting + Estimator preds: [0.66666667]
>>> KLIEP Weighting + Estimator preds: [0.96991182]

PS : I think, the issue is not KLIEP specific but is encountered by all reweighting classes.

antoinedemathelin commented 5 months ago

I see that the key for the sample_weight arguments is "sample_weights" in the dictionary returned by KLIEPAdapter, while it is "sample_weight", without "s", in the estimator fit method. It may explain the issue, but I am not sure...

kachayev commented 5 months ago

There are 3 issues here 😃

  1. Typo in the parameter name sample_weights, as you rightfully mentioned
  2. The selector checks if routing parameter is defined for the caller method before performing override while it should check routing keys for the callee (I will fix it)
  3. In you code, you need to let LinearRegression know that it needs sample_weight

Like this one

KLIEP(base_estimator=LinearRegression().set_fit_request(sample_weight=True), gamma=1.)

There's a related task in the list (quite an old one) to throw an error when the adapter returns weights but the downstream estimator does not accepts them.

P.S. Nice catch! And kudos for finding out how to pass weights manually.

rflamary commented 5 months ago

Seems to me that the set_fit_requestshould be done inside the KLIEP class/function so that we can raise an error if wample_weight is no available?

kachayev commented 5 months ago

In a such a case, how would we know if the estimator accepts weights in the first place?

rflamary commented 5 months ago

well we can look at the fit function definition? This would be important because i'm pretty sure the error with weights are not available is critpic and we could detect that before fitting it would be nice no?

kachayev commented 5 months ago

the error with weights are not available is critpic

Which one?

can look at the fit function definition

Metadata routing does this for us (it tracks signatures as far as I recall). 'sample_weight' is an odd edge case as it's defined in many places in the signature just because older versions of sklearn were designed so (before routing was introduced). I need to check sklearn code to see how widely this happens.

kachayev commented 5 months ago

Oh, and KLIEPAdapter cannot turn on 'sample_weight' for the downstream estimator in the pipeline as those are (separate) pipeline steps, they cannot (and should not by design) alter other steps. The decision to pass or not to pass weights is done in the Selector which analyses routing to see if estimator requested weights.

antoinedemathelin commented 5 months ago

I am not sure to understand every message of the discussion. I made a suggestion to perform reweighting with estimators that don't use sample_weight, see #7

There is also a question about how to handle the sample_weight argument given by the user in the fit method of the pipeline (perhaps it is the point of your last message ?). My guess, is that this sample_weight vector should be used in the reweighting (all reweighting techniques can be written to handle sample_weight) and the new sample_weight (returned by the weighting method) should be given to the estimator.

But it is not an easy question... You can also combine both sample_weight (by multiplying them?). I prefer the first solution, because the meaning of giving a sample weight to the pipeline is to reweight the distribution. It makes sense, then, to perform the reweighting on the reweighted distributions. (I don't know, if it's very clear...)

kachayev commented 5 months ago

My guess, is that this sample_weight vector should be used in the reweighting (all reweighting techniques can be written to handle sample_weight) and the new sample_weight (returned by the weighting method) should be given to the estimator.

This is how it is implemented right now. If the adapter yields sample_weight we replace those that we given in the original call to the pipeline. This is somewhat uncommon for a sklearn pipeline, which handles only X samples to be updated (i.e. transformed). That's the reason we have special time of output (AdaptationOutput) and the code in selectors to process it throughout the rest of steps.

I guess, if there's need to have a flexibility, like multiplying different weights, - it should be done in a custom estimator class that combines adaptation and fit/predict (like JDOT does). So we keep the logic of adapter/estimator pipelines straightforward.

rflamary commented 5 months ago

This one is fixed no? I see many .set_fit_request(sample_weight=True) everywhere ;)

kachayev commented 5 months ago

Yeah, it has being fixed already.