Open katerinakarampasi opened 6 years ago
You can remove data at training time (in fit
) but not at transform/predict time. Providing labels to the FeatureExtractor
at transform time would leak these labels on the test data. If you want to leave out some points from the training, you can do it in the fit
function of the classifier.
Ok thank you. I don't know if I have to open a new topic but eventually what is quality check that we are provided with for the fmri and the anatomy data?
The quality check was done manually. Basically, visual inspection of the pre-processing steps (registration, segmentation) and inspection of the motions of the parameters were checked.
Ok thank you.
Hi, how to remove bad data during FeatureExtractor or Classifier still confuses me. Sorry this may be a very basic question, but it's been confusing for a few days. I tried to impute bad data during Feature extraction, but it seemed it made the model worse. If I understand it correctly, the FeatureExtractor is supposed to return only new_X rather than both new_X and new_y. So it is hard to remove bad samples at this stage.
But if I put this step in Classifier under fit, I used
def fit (self, X, y)
X_new=X[some_good_idx]
y_new=y[some_good_idx]
self.clf.fit(X_new, y_new),
def predict(self, X):
return self.clf.predict(X)
def predict_proba(self, X):
return self.clf.predict_proba(X)
it crashed when running CV evaluation with error `X has a different shape than during fitting.
Can you submit it? I can look at the trace there.
Modifying the starting kit, this should be something like this.
from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin
class FeatureExtractor(BaseEstimator, TransformerMixin):
def fit(self, X_df, y):
return self
def transform(self, X_df):
# get only the anatomical information
X = X_df[[col for col in X_df.columns if col.startswith('anatomy')]]
return X
from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
class Classifier(BaseEstimator):
def __init__(self):
self.clf = make_pipeline(StandardScaler(), LogisticRegression())
def fit(self, X, y):
X_select = X['anatomy_select'] == 1
self.clf.fit(X[X_select], y[X_select.values])
return self
def predict(self, X):
return self.clf.predict(X)
def predict_proba(self, X):
return self.clf.predict_proba(X)
I tried and it works locally with the cross_validate
and ramp_test_submission
Thank you, guys. I have tested the modified anatomy code, it works. So I will double-check with my code to see if I can figure that out.
I think the error was caused by that I was trying to exclude the QC columns (i.e. anatomy_select) in the fit. It should be fine to include that column as they will be all ones and removed by feature selection.
Based on the instructions, my personal comprehension is that we have to provide you the two basic functions, FeatureExtractor( ) and Classifier( ). I would like to access the whole data and exclude some of them, so afterwards I'll have to exclude their corresponding labels, as well. I can exclude the data based on the condition each time the FeatureExtractor is called but I can't do the same for the labels through it. So my question is if we will have to execute all the commands before FeatureExtractor is called (because that would solve my problem) or not.