ramp-kits / autism

Data Challenge on Autism Spectrum Disorder detection
https://paris-saclay-cds.github.io/autism_challenge/
67 stars 43 forks source link

FeatureExtractor() inside cross_validation #35

Closed zh1peng closed 6 years ago

zh1peng commented 6 years ago

When I did some feature selection process in FeatureExtractor() or Classifier() based on some relationship between X and y. I realized different X_split and y_split in each cv fold make the selected features different for each cv fold. This gives different auc. So I was thinking would it be better for the FeatureExtractor() outside of cross_validation? Something like:

    features=FeatureExtractor()
    X_new=features.fit_transform(X,y)
    cross_validate(Classifier(), X_new, y)

Not hundred sure about this. Maybe if the feature selection is stable enough, then the variance from this could be ignored.

When I tried to build FC maps with motion regressed out and compute the graph measures for the FC, it took 10 mins to finish the feature extraction. If the FeatureExtractor() is outside the cross_validation process, it would save a lot of time. This is just a small issue as I gave up building FC myself and calculating graph.

glemaitre commented 6 years ago

So I was thinking would it be better for the FeatureExtractor() outside of cross_validation

Be aware that you are going to make a selection on the full public data set. If the features are noisy, you are almost certain to overfit the public data set and get really bad results on the private set.

You can refer on the following example which shows nested cross-validation which could be used to make such selection and hyperparameters exploration.