Open yinweisu opened 2 years ago
More generally, imagine int8 at training time but int16 at test, or int64 at train, float64 test because of NaN, etc. These are hard cases but do appear in real user workflows, especially with messy data. AutoML systems should try their best to handle them (or at least avoid crashing if feasible).
So if I understand correctly, using an unannotated train/test split from a custom file results in an error with the benchmark framework itself because it tries to apply dtypes inferred from the training data to the test data? If so, I would indeed consider that a bug. What should be the desired behavior in this case? Try to apply the inferred dtype, and if it incongruent with the test data instead infer from the combined data at test time? Or in that case maybe always treat it as object
?
Try to apply the inferred dtype, and if it incongruent with the test data instead infer from the combined data at test time?
sounds reasonable to me. Anyone against this suggestion?
I think fallback to inferring on combined is reasonable.
+1
Currently the logic when loading a dataset involving use pre-determined data types: _ensure_loaded This would cause loading to failure when the inferred data type is incorrect. In my case, I have a column in the training data, where there's no value at all. Hence, the dtype is inferred as float. However, in the test data, there are actually couple rows containing string in the specific column.