openml / automlbenchmark

OpenML AutoML Benchmarking Framework
https://openml.github.io/automlbenchmark
MIT License
396 stars 132 forks source link

Loading test dataset failed with inferred data type from train dataset #434

Open yinweisu opened 2 years ago

yinweisu commented 2 years ago

Currently the logic when loading a dataset involving use pre-determined data types: _ensure_loaded This would cause loading to failure when the inferred data type is incorrect. In my case, I have a column in the training data, where there's no value at all. Hence, the dtype is inferred as float. However, in the test data, there are actually couple rows containing string in the specific column.

Innixma commented 2 years ago

More generally, imagine int8 at training time but int16 at test, or int64 at train, float64 test because of NaN, etc. These are hard cases but do appear in real user workflows, especially with messy data. AutoML systems should try their best to handle them (or at least avoid crashing if feasible).

PGijsbers commented 2 years ago

So if I understand correctly, using an unannotated train/test split from a custom file results in an error with the benchmark framework itself because it tries to apply dtypes inferred from the training data to the test data? If so, I would indeed consider that a bug. What should be the desired behavior in this case? Try to apply the inferred dtype, and if it incongruent with the test data instead infer from the combined data at test time? Or in that case maybe always treat it as object?

sebhrusen commented 2 years ago

Try to apply the inferred dtype, and if it incongruent with the test data instead infer from the combined data at test time?

sounds reasonable to me. Anyone against this suggestion?

Innixma commented 2 years ago

I think fallback to inferring on combined is reasonable.

yinweisu commented 2 years ago

+1