Is data shuffling required before creating model?

janhurst / unisa-tbi

Decision Support Tool for suspected Traumatic Brain Injuries

https://unisa-tbi.azurewebsites.net

1 stars 1 forks source link

Is data shuffling required before creating model? #24

Closed karthikkunala closed 4 years ago

karthikkunala commented 4 years ago

I was coming across sklearn.model_selection.StratifiedShuffleSplit, do we need to shuffle records before splitting into train, test or validation. I have seen some examples where they are checking the mean of the target(if continuous) to check how is the variation differed?

janhurst commented 4 years ago

The split routine I used will randomly split. I haven't checked to see if it is resulting in the same distribution of samples, and that is something we need to be careful about.

doughnuted commented 4 years ago

Is this not something that happens with cross-validation?

janhurst commented 4 years ago

Is this not something that happens with cross-validation?

Yep. Right now we are just doing a simple hold out and that is what @karthikkunala was referring to.

We should probably do a stratified k-fold CV, we just havent got there yet :)