mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.03k stars 403 forks source link

Preproccessing of data #531

Open picousse opened 2 years ago

picousse commented 2 years ago

Hi, I might have missed it, but I fail to find some info on the preprocessing of data. Is there some documentation on that part? What columns are being used in the end? How are they encoded?

Can I extract the preprocessed dataframe as it is being fed to the models? I would like to do some bench marking, but having the same base dataset is required for that.

picousse commented 2 years ago

While digging into the Automl instance, I'm also seeing that the different models get different datasets. E.g. if I feed the titanic dataset, then XGboost get data from the columns: Age, Cabin, Embarked, Name, Sex and Ticket, whilst the default neural network get the columns: age, Cabin, Embarked, Fare, Name, Pclass, Sex, SibSp, and Ticket.

does that make sense? If the different models don't get the same input, is it then fair to compare then afterwards?