mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
2.99k stars 400 forks source link

Confusion between : train, val, test. #573

Open adrienpacifico opened 1 year ago

adrienpacifico commented 1 year ago

Hi, I think that there are some elements that creates confusion (at least to me) between different sets in mljar.

One example are graphs created by mljar: image

The "test" set from the learning curve is actually the validation set (or cross-validation set)! That might be to some extent confusing.

It would be nice to:

Such an option could be nice to have on all the metrics produced with the validation set on the test set.

it could be done as such:

automl.fit(
    X,
    y,
    sample_weight = df_train_undersampled.weight_discounting.values,
    cv=cv, 
    X_test = X_test,
    y_test = y_test
)

ps: it could also be nice to give a X_val, y_val option such in addition to cv (i.e. cv seems to provide a lot more flexibility, but it is not user friendly if you just want to use a validation set.

pplonski commented 1 year ago

@adrienpacifico you are right, labels should be changed to validation.

For test data there should be separate function score or evaluate

adrienpacifico commented 1 year ago

https://github.com/mljar/mljar-supervised/blob/31e95852ccdd1a0338292965474d8ad1f9c36885/supervised/automl.py#L89-L92

Do you think documentation/docstring (e.g. line 90) should also be modified?

For test data there should be separate function score or evaluate

Why so? Users could have an very easy way to separate train, val and test that way?

pplonski commented 1 year ago

Do you think documentation/docstring (e.g. line 90) should also be modified?

Yes, all confusing places should be updated.

Why so? Users could have an very easy way to separate train, val and test that way?

  1. I'm little afraid that it might be too complex for users.
  2. User can have several test datasets, for example new test data every day, then it will be nice to just evaluate vs new data.
  3. Adding evaluate() function might be a little simpler than extending the training function with new arguments.