better validation of inputs to Deconfounders

raamana commented 2 years ago

the #19 reminds me of how some users can be confused given the code lets the second argument to .fit() and .transform() optional with y=None. The only reason we have y=None is to try follow sklearn conventions and to pass their tests, but given we can't pass them anyway, we should tighten them up and make it an error to not supply the second [necessary] input argument.

cc @jrasero @jameschapman19

jrasero commented 2 years ago

I totally agree.

Maybe we could drop the None value as default for y in both functions, and then add an if condition to raise an error if a second argument is not passed, something like this:

if y is None:
    raise ValueError("A second argument, corresponding to the confounders, should be passed")

This, and a better clarification in the documentation.

raamana commented 2 years ago

In fact, we should not even call it y :), and call it C.

Also, if we stop caring about sklearn, we can even get rid of addtitional layers of calls ._fit() and ._transform(), although there might be other considerations we need to worry about (ability to use sklearn pipepine infra etc). I need to think about this.

jrasero commented 2 years ago

100% agreed. That would avoid misinterpretations.

Also, if we are going to stop caring about sklearn, and this is just a matter of personal preference, I would require X and C to be passed by their keyword, e.g. doing fit(self, *, X, C). In that way the user always knows what they are passing to the functions. Right now things can be passed positionally, but someone could confuse the order of X and C. I think this may happen to sklearn experienced users, where the order of X and y in both fit and transform has a logic behind (X->y).

raamana commented 2 years ago

Good idea, I think that should be possible, let me double check how to enforce it (or please share if you know the details already). btw, would you want to take a crack at this? :)

jrasero commented 2 years ago

Sure Pradeep. This is what I was referring to. Let's take the example for Residualize:

class Residualize(BaseDeconfound):
    """
    Deconfounding estimator class that residualizes the input features by
    subtracting the contributions from the confound variables

    Example methods: Linear, Kernel Ridge, Gaussian Process Regression etc
    """

    def __init__(self, model='linear'):
        """Constructor"""

        super().__init__(name='Residualize')

        self.model = model

    def fit(self,
            *,
            X,  # variable names chosen to correspond to sklearn when possible
            C,  # y is the confound variables here, not the target!
            ):
        """
        Fits the residualizing model (estimates the contributions of confounding
        variables (C) to the given [training] feature set X. 

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_samples, n_features)
            The training input samples.
        C : ndarray
            Array of covariates, shape (n_samples, n_covariates)
            This does not refer to target as is typical in scikit-learn.

        Returns
        -------
        self : object
            Returns self
        """

        return self._fit(X, C)  # which itself must return self

    def _fit(self, in_features, confounds=None):
        """Actual fit method"""

        in_features = check_array(in_features)
        confounds = check_array(confounds, ensure_2d=False)

        # turning it into 2D, in case if its just a column
        if confounds.ndim == 1:
            confounds = confounds[:, np.newaxis]

        try:
            check_consistent_length(in_features, confounds)
        except:
            raise ValueError('X (features) and y (confounds) '
                             'must have the same number of rows/samplets!')

        self.n_features_ = in_features.shape[1]

        regr_model = clone(get_model(self.model))
        regr_model.fit(confounds, in_features)
        self.model_ = regr_model

        return self

    def transform(self, *, X, C):
        """
        Transforms the given feature set by residualizing the [test] features
        by subtracting the contributions of their confounding variables.

        Parameters
        ----------
        X : {array-like, sparse matrix}, shape (n_samples, n_features)
            The training input samples.
        C : ndarray
            Array of covariates, shape (n_samples, n_covariates)
            This does not refer to target as is typical in scikit-learn.

        Returns
        -------
        self : object
            Returns self
        """

        return self._transform(X, C)

    def _transform(self, test_features, test_confounds):
        """Actual deconfounding of the test features"""

        check_is_fitted(self, ('model_', 'n_features_'))
        test_features = check_array(test_features, accept_sparse=True)

        if test_features.shape[1] != self.n_features_:
            raise ValueError('number of features must be {}. Given {}'
                             ''.format(self.n_features_, test_features.shape[1]))

        if test_confounds is None:  # during estimator checks
            return test_features  # do nothing

        test_confounds = check_array(test_confounds, ensure_2d=False)
        if test_confounds.ndim == 1:
            test_confounds = test_confounds[:, np.newaxis]
        check_consistent_length(test_features, test_confounds)

        # test features as can be explained/predicted by their covariates
        test_feat_predicted = self.model_.predict(test_confounds)
        residuals = test_features - test_feat_predicted

        return residuals

As you can see, I changed the name of y to C, and added an asterisk between self and X and C in fit and transform. Then, if you try to run things like this:

res = Residualize()
res.fit(X, C)

it will give you an error, because you now have to pass things by their keyword, not their position. That is, you have to write the following to work:

res = Residualize()
res.fit(X=X, C=C)

It's more wordy, but less prone to errors in my opinion, as the user explicitly knows what they are passing to the functions.

raamana commented 2 years ago

thanks Javier - can you link me to some docs on the how this * trick works?

jrasero commented 2 years ago

Maybe this could work? https://python-3-for-scientists.readthedocs.io/en/latest/python3_advanced.html

raamana commented 2 years ago

I see, thanks. but that would allow weird behaviour like .transform(random_arg1, random_arg2, X=X,C=C), no? I know people won't do it but you see my point?

jrasero commented 2 years ago

No, but that's not possible, Pradeep. Users will still be able to pass only two arguments, in this case X and C. Other weird scenarios like the one you've posted would give an error.

I know that using keyword-only arguments might not be standard, but I find it more clear, particularly in our case in which the order logic of the input data and covariates do not follow as straightforward as in the usual sklearn scenario (input data/target).

Anyway, it was just a suggestion and can be easily changed at any moment in the future if wanted. For example, if we see that users find the current behaviour confusing. We need tester users :-)

raamana commented 2 years ago

I like the idea, just want to make sure there will be no side effects etc

jrasero commented 2 years ago

Pradeep, I've made a little progress today on this issue, but there's something that we need to decide before I carry on pushing my changes.

If we finally decide to give an error if the second input argument in fit and transform is not supplied, as you originally proposed, then we should remove/comment out the test_estimator_API function in the tests modules, otherwise it won't pass the tests due to check_estimator function giving an error.

If we finally do this, we could also remove lines like these in base.py, which were particularly included to pass the mentioned test

if test_confounds is None:  # during estimator checks
            return test_features  # do nothing

https://github.com/raamana/confounds/blob/7c3c4750adb19d80384f1d546b76e9ada205418a/confounds/base.py#L162

raamana commented 2 years ago

sounds good - we can do whatever is needed to achieve our objectives. Let's discuss this when we discuss the other PR #21 too?

jrasero commented 2 years ago

sound good!

raamana / confounds

better validation of inputs to Deconfounders #20