Can we exclude certain data and labels based on a condition?

ramp-kits / autism

Data Challenge on Autism Spectrum Disorder detection

https://paris-saclay-cds.github.io/autism_challenge/

69 stars 43 forks source link

Can we exclude certain data and labels based on a condition? #28

Open katerinakarampasi opened 6 years ago

katerinakarampasi commented 6 years ago

Based on the instructions, my personal comprehension is that we have to provide you the two basic functions, FeatureExtractor( ) and Classifier( ). I would like to access the whole data and exclude some of them, so afterwards I'll have to exclude their corresponding labels, as well. I can exclude the data based on the condition each time the FeatureExtractor is called but I can't do the same for the labels through it. So my question is if we will have to execute all the commands before FeatureExtractor is called (because that would solve my problem) or not.

kegl commented 6 years ago

You can remove data at training time (in fit) but not at transform/predict time. Providing labels to the FeatureExtractor at transform time would leak these labels on the test data. If you want to leave out some points from the training, you can do it in the fit function of the classifier.

katerinakarampasi commented 6 years ago

Ok thank you. I don't know if I have to open a new topic but eventually what is quality check that we are provided with for the fmri and the anatomy data?

glemaitre commented 6 years ago

The quality check was done manually. Basically, visual inspection of the pre-processing steps (registration, segmentation) and inspection of the motions of the parameters were checked.

katerinakarampasi commented 6 years ago

Ok thank you.

zh1peng commented 6 years ago

Hi, how to remove bad data during FeatureExtractor or Classifier still confuses me. Sorry this may be a very basic question, but it's been confusing for a few days. I tried to impute bad data during Feature extraction, but it seemed it made the model worse. If I understand it correctly, the FeatureExtractor is supposed to return only new_X rather than both new_X and new_y. So it is hard to remove bad samples at this stage.

But if I put this step in Classifier under fit, I used

def fit (self, X, y)
X_new=X[some_good_idx]
y_new=y[some_good_idx]
self.clf.fit(X_new, y_new), 

def predict(self, X):
        return self.clf.predict(X)

def predict_proba(self, X):
        return self.clf.predict_proba(X)

it crashed when running CV evaluation with error `X has a different shape than during fitting.

kegl commented 6 years ago

Can you submit it? I can look at the trace there.

glemaitre commented 6 years ago

Modifying the starting kit, this should be something like this.

from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin

class FeatureExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X_df, y):
        return self

    def transform(self, X_df):
        # get only the anatomical information
        X = X_df[[col for col in X_df.columns if col.startswith('anatomy')]]
        return X 

from sklearn.base import BaseEstimator
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

class Classifier(BaseEstimator):
    def __init__(self):
        self.clf = make_pipeline(StandardScaler(), LogisticRegression())

    def fit(self, X, y):
        X_select = X['anatomy_select'] == 1
        self.clf.fit(X[X_select], y[X_select.values])
        return self

    def predict(self, X):
        return self.clf.predict(X)

    def predict_proba(self, X):
        return self.clf.predict_proba(X)

glemaitre commented 6 years ago

I tried and it works locally with the cross_validate and ramp_test_submission

zh1peng commented 6 years ago

Thank you, guys. I have tested the modified anatomy code, it works. So I will double-check with my code to see if I can figure that out.

I think the error was caused by that I was trying to exclude the QC columns (i.e. anatomy_select) in the fit. It should be fine to include that column as they will be all ones and removed by feature selection.