mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
2.97k stars 392 forks source link

TypeError: ufunc 'isnan' not supported for the input types #147

Open shahules786 opened 3 years ago

shahules786 commented 3 years ago

Code to reproduce the error

df = pd.read_csv('/home/shahul/Downloads/train.csv.zip').sample(10000)
y = df['target']
X = df.drop(['target'],axis=1)

a = AutoML(total_time=30,tuning_mode="Normal")

a.fit(X, y)

the error happens due to the use of np.isna() to object dtype which happens in np.nanmedian() used in

/mljar-supervised/supervised/preprocessing/preprocessing_utils.py

here I have used PNB Paribas dataset as train

pplonski commented 3 years ago

Thank you @shahules786 for reporting. This might be a bug.

Could you give a minimal code example with data that is generated? So anyone can easily reproduce (without any downloads).

abtheo commented 3 years ago

The issue is that the input Y data in df['Target'] is of the type of 'object'.

@shahules786 you can simply cast your data to the intended type using .astype().

I went pretty deep into this issue (before giving up and deciding that throwing an exception was the simplest fix), so I'm happy to discuss further if ya like :)

shahules786 commented 3 years ago

hey @abtheo Yes, you're right. As indicated in the issue, the error is because of the use of object types, to solve the issue you need to convert the object types to to to float types before passing it to np.nanmedian() used in /mljar-supervised/supervised/preprocessing/preprocessing_utils.py you can reproduce the error just by doing np.isnan([1,2,2,3,"nan"]) and fix the same using pd.isna([1,2,2,3,"nan"]) because pandas isna supports object types.

Also, you're most welcomed to contribute, join our slack group :)

pplonski commented 3 years ago

The AutoML should work with object types in X and y variables. If y has missing values, then such rows are dropped from the training. The missing values are detected with pd.isnull method, maybe the problem is with missing values signature. It should be None or pd.NA - the value that returns True from pd.isnull. Maybe the solution to the problem is smarter detecting of missing values.

shahules786 commented 3 years ago

@pplonski Isn't AutoML filling missing values using standard methods like mean, median, mode, etc?

pplonski commented 3 years ago

@shahules786 you are right, we are filling missing values. But first, you need to recognize the missing value. The missing value is not always represented as None. That's might cause the problem. I've seen a dataset where missing values were set as characters: "?" - in such a case, the AutoML will fail.

shahules786 commented 3 years ago

@pplonski Yes, That should be it. Maybe before filling missing values with any methods, we can do pd.to_numeric(, errors='coerce') and ensure that all the points are in float dtype.I think that will fix this issue, for example np.nanmedian(pd.to_numeric([1,2,3,"?"],errors='coerce')) runs with no errors

pplonski commented 3 years ago

Sounds good @shahules786. What to do with categorical features? what about the performance of this method? it will work fast for large datasets?

shahules786 commented 3 years ago

@pplonski This is only possible If AutoML imputing missing values after encoding the categorical features. Is it?

pplonski commented 3 years ago

We first input missing values and then do the encoding.

shahules786 commented 3 years ago

@pplonski Ok, But we don't impute categorical values using np.nanmedian() , that is where the issue is coming from. So I think it will be okay to go with the above solution.

abtheo commented 3 years ago

Practically, to handle objects of unknown content, I see three main use cases we need to cover:

#To be encoded as CATEGORICAL
1. strs_as_object = np.array(["A", "B", "C"], dtype=object)

#To be encoded as DISCRETE (or equivalently, CONTINUOUS for floats)
2. nums_as_object = np.array([1, 2, "3"], dtype=object)

#To be encoded as CATEGORICAL
3. mixed_input_object = np.array([1, "B", 3], dtype=object)

Currently, Cases 1&2 work as expected, however we are not handling Case 3. The ambiguous object type causes Numpy issues all over the place.

As one example, using Case 3 as the input to AutoML.fit(y=mixed_input_object) causes the following error to occur at Line 44 of preprocessing_utils.py :

unique_cnt = len(np.unique(x[~pd.isnull(x)]))

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
TypeError: '<' not supported between instances of 'str' and 'int'

As another example, using Case 3 as the input to AutoML.fit(x=mixed_input_object) causes problems with to_parquet(), as seen here: https://github.com/pandas-dev/pandas/issues/21228

To solve both of these issues, the very first thing we should do is validate the type of the data. Here is my proposed solution:

y_train_type = PreprocessingUtils.get_type(y_train)

if y_train_type == PreprocessingUtils.DISCRETE or y_train_type == PreprocessingUtils.CONTINUOUS:
            y_train = pd.to_numeric(y_train, errors='coerce')

if y_train_type == PreprocessingUtils.CATEGORICAL:
            y_train = pd.Series([str(y) for y in y_train], name="target")
pplonski commented 3 years ago

@abtheo thank you for explanations! I see the problem right now.