mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.01k stars 402 forks source link

Failure to properly preprocess categorical data #652

Open williamty opened 1 year ago

williamty commented 1 year ago

There're some categorical columns in my dataset which are stored by numbers. So I checked data_info.json file to see if they are preprocessed. Unfortunately, all of them are not recognized by mljar. Then I use the following code to convert these columns to categorical manually.

with open('enum.txt', 'r') as enum_file:
    categorical_columns = enum_file.read().splitlines()
for col in categorical_columns:
    df[col] = df[col].astype("category")

After doing this, I got an error:

ValueError: pandas dtypes must be int, float or bool.
Fields with bad pandas dtypes: target: category

It seems that mljar can't preprocess categorical data stored in numbers.

pplonski commented 1 year ago

It should handle category data type. Might be some bug.