openml / openml-data

For tracking issues related to OpenML datasets
1 stars 1 forks source link

teachingAssistant is missing "ignore_attribute" for index #56

Open amueller opened 1 year ago

amueller commented 1 year ago

teachingAssistant has an ID attribute that should be ignored by default. This is particularly bad because there's an ordering in the data, so the ID is informative: image In fact, ignoring everything but the ID gives near-perfect results:

cross_validate(RandomForestClassifier(), df[['ID']], df['class'], scoring="roc_auc_ovr", cv=StratifiedKFold(shuffle=True))
amueller commented 1 year ago

new dataset version: https://openml.org/search?type=data&status=active&id=45688&sort=runs