Open shahules786 opened 3 years ago
Thank you @shahules786 for reporting. This might be a bug.
Could you give a minimal code example with data that is generated? So anyone can easily reproduce (without any downloads).
The issue is that the input Y data in df['Target']
is of the type of 'object'.
@shahules786 you can simply cast your data to the intended type using .astype()
.
I went pretty deep into this issue (before giving up and deciding that throwing an exception was the simplest fix), so I'm happy to discuss further if ya like :)
hey @abtheo Yes, you're right. As indicated in the issue, the error is because of the use of object types, to solve the issue you need to convert the object types to to to float types before passing it to np.nanmedian()
used in
/mljar-supervised/supervised/preprocessing/preprocessing_utils.py
you can reproduce the error just by doing
np.isnan([1,2,2,3,"nan"])
and fix the same using
pd.isna([1,2,2,3,"nan"])
because pandas isna
supports object types.
Also, you're most welcomed to contribute, join our slack group :)
The AutoML
should work with object
types in X
and y
variables. If y
has missing values, then such rows are dropped from the training. The missing values are detected with pd.isnull
method, maybe the problem is with missing values signature. It should be None
or pd.NA
- the value that returns True
from pd.isnull
. Maybe the solution to the problem is smarter detecting of missing values.
@pplonski Isn't AutoML
filling missing values using standard methods like mean, median, mode, etc?
@shahules786 you are right, we are filling missing values. But first, you need to recognize the missing value. The missing value is not always represented as None
. That's might cause the problem. I've seen a dataset where missing values were set as characters: "?" - in such a case, the AutoML
will fail.
@pplonski Yes, That should be it. Maybe before filling missing values with any methods, we can do pd.to_numeric(, errors='coerce')
and ensure that all the points are in float dtype.I think that will fix this issue, for example
np.nanmedian(pd.to_numeric([1,2,3,"?"],errors='coerce'))
runs with no errors
Sounds good @shahules786. What to do with categorical features? what about the performance of this method? it will work fast for large datasets?
@pplonski This is only possible If AutoML imputing missing values after encoding the categorical features. Is it?
We first input missing values and then do the encoding.
@pplonski Ok, But we don't impute categorical values using np.nanmedian()
, that is where the issue is coming from. So I think it will be okay to go with the above solution.
Practically, to handle objects of unknown content, I see three main use cases we need to cover:
#To be encoded as CATEGORICAL
1. strs_as_object = np.array(["A", "B", "C"], dtype=object)
#To be encoded as DISCRETE (or equivalently, CONTINUOUS for floats)
2. nums_as_object = np.array([1, 2, "3"], dtype=object)
#To be encoded as CATEGORICAL
3. mixed_input_object = np.array([1, "B", 3], dtype=object)
Currently, Cases 1&2 work as expected, however we are not handling Case 3. The ambiguous object type causes Numpy issues all over the place.
As one example, using Case 3 as the input to AutoML.fit(y=mixed_input_object)
causes the following error to occur at Line 44 of preprocessing_utils.py
:
unique_cnt = len(np.unique(x[~pd.isnull(x)]))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
TypeError: '<' not supported between instances of 'str' and 'int'
As another example, using Case 3 as the input to AutoML.fit(x=mixed_input_object)
causes problems with to_parquet()
, as seen here: https://github.com/pandas-dev/pandas/issues/21228
To solve both of these issues, the very first thing we should do is validate the type of the data. Here is my proposed solution:
y_train_type = PreprocessingUtils.get_type(y_train)
if y_train_type == PreprocessingUtils.DISCRETE or y_train_type == PreprocessingUtils.CONTINUOUS:
y_train = pd.to_numeric(y_train, errors='coerce')
if y_train_type == PreprocessingUtils.CATEGORICAL:
y_train = pd.Series([str(y) for y in y_train], name="target")
@abtheo thank you for explanations! I see the problem right now.
Code to reproduce the error
the error happens due to the use of
np.isna()
to object dtype which happens innp.nanmedian()
used in/mljar-supervised/supervised/preprocessing/preprocessing_utils.py
here I have used PNB Paribas dataset as train