Open raamana opened 2 years ago
I totally agree.
Maybe we could drop the None value as default for y in both functions, and then add an if condition to raise an error if a second argument is not passed, something like this:
if y is None:
raise ValueError("A second argument, corresponding to the confounders, should be passed")
This, and a better clarification in the documentation.
In fact, we should not even call it y
:), and call it C.
Also, if we stop caring about sklearn, we can even get rid of addtitional layers of calls ._fit()
and ._transform()
, although there might be other considerations we need to worry about (ability to use sklearn pipepine infra etc). I need to think about this.
100% agreed. That would avoid misinterpretations.
Also, if we are going to stop caring about sklearn, and this is just a matter of personal preference, I would require X and C to be passed by their keyword, e.g. doing fit(self, *, X, C)
. In that way the user always knows what they are passing to the functions. Right now things can be passed positionally, but someone could confuse the order of X and C. I think this may happen to sklearn experienced users, where the order of X and y in both fit and transform has a logic behind (X->y).
Good idea, I think that should be possible, let me double check how to enforce it (or please share if you know the details already). btw, would you want to take a crack at this? :)
Sure Pradeep. This is what I was referring to. Let's take the example for Residualize:
class Residualize(BaseDeconfound):
"""
Deconfounding estimator class that residualizes the input features by
subtracting the contributions from the confound variables
Example methods: Linear, Kernel Ridge, Gaussian Process Regression etc
"""
def __init__(self, model='linear'):
"""Constructor"""
super().__init__(name='Residualize')
self.model = model
def fit(self,
*,
X, # variable names chosen to correspond to sklearn when possible
C, # y is the confound variables here, not the target!
):
"""
Fits the residualizing model (estimates the contributions of confounding
variables (C) to the given [training] feature set X.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_samples, n_features)
The training input samples.
C : ndarray
Array of covariates, shape (n_samples, n_covariates)
This does not refer to target as is typical in scikit-learn.
Returns
-------
self : object
Returns self
"""
return self._fit(X, C) # which itself must return self
def _fit(self, in_features, confounds=None):
"""Actual fit method"""
in_features = check_array(in_features)
confounds = check_array(confounds, ensure_2d=False)
# turning it into 2D, in case if its just a column
if confounds.ndim == 1:
confounds = confounds[:, np.newaxis]
try:
check_consistent_length(in_features, confounds)
except:
raise ValueError('X (features) and y (confounds) '
'must have the same number of rows/samplets!')
self.n_features_ = in_features.shape[1]
regr_model = clone(get_model(self.model))
regr_model.fit(confounds, in_features)
self.model_ = regr_model
return self
def transform(self, *, X, C):
"""
Transforms the given feature set by residualizing the [test] features
by subtracting the contributions of their confounding variables.
Parameters
----------
X : {array-like, sparse matrix}, shape (n_samples, n_features)
The training input samples.
C : ndarray
Array of covariates, shape (n_samples, n_covariates)
This does not refer to target as is typical in scikit-learn.
Returns
-------
self : object
Returns self
"""
return self._transform(X, C)
def _transform(self, test_features, test_confounds):
"""Actual deconfounding of the test features"""
check_is_fitted(self, ('model_', 'n_features_'))
test_features = check_array(test_features, accept_sparse=True)
if test_features.shape[1] != self.n_features_:
raise ValueError('number of features must be {}. Given {}'
''.format(self.n_features_, test_features.shape[1]))
if test_confounds is None: # during estimator checks
return test_features # do nothing
test_confounds = check_array(test_confounds, ensure_2d=False)
if test_confounds.ndim == 1:
test_confounds = test_confounds[:, np.newaxis]
check_consistent_length(test_features, test_confounds)
# test features as can be explained/predicted by their covariates
test_feat_predicted = self.model_.predict(test_confounds)
residuals = test_features - test_feat_predicted
return residuals
As you can see, I changed the name of y
to C
, and added an asterisk between self and X
and C
in fit
and transform
. Then, if you try to run things like this:
res = Residualize()
res.fit(X, C)
it will give you an error, because you now have to pass things by their keyword, not their position. That is, you have to write the following to work:
res = Residualize()
res.fit(X=X, C=C)
It's more wordy, but less prone to errors in my opinion, as the user explicitly knows what they are passing to the functions.
thanks Javier - can you link me to some docs on the how this *
trick works?
Maybe this could work? https://python-3-for-scientists.readthedocs.io/en/latest/python3_advanced.html
I see, thanks. but that would allow weird behaviour like .transform(random_arg1, random_arg2, X=X,C=C)
, no? I know people won't do it but you see my point?
No, but that's not possible, Pradeep. Users will still be able to pass only two arguments, in this case X
and C.
Other weird scenarios like the one you've posted would give an error.
I know that using keyword-only arguments might not be standard, but I find it more clear, particularly in our case in which the order logic of the input data and covariates do not follow as straightforward as in the usual sklearn scenario (input data/target).
Anyway, it was just a suggestion and can be easily changed at any moment in the future if wanted. For example, if we see that users find the current behaviour confusing. We need tester users :-)
I like the idea, just want to make sure there will be no side effects etc
Pradeep, I've made a little progress today on this issue, but there's something that we need to decide before I carry on pushing my changes.
If we finally decide to give an error if the second input argument in fit
and transform
is not supplied, as you originally proposed, then we should remove/comment out the test_estimator_API
function in the tests modules, otherwise it won't pass the tests due to check_estimator
function giving an error.
If we finally do this, we could also remove lines like these in base.py, which were particularly included to pass the mentioned test
if test_confounds is None: # during estimator checks
return test_features # do nothing
sounds good - we can do whatever is needed to achieve our objectives. Let's discuss this when we discuss the other PR #21 too?
sound good!
the #19 reminds me of how some users can be confused given the code lets the second argument to
.fit()
and.transform()
optional withy=None
. The only reason we havey=None
is to try follow sklearn conventions and to pass their tests, but given we can't pass them anyway, we should tighten them up and make it an error to not supply the second [necessary] input argument.cc @jrasero @jameschapman19