pplonski / my_ml_service

My Machine Learning Web Service
MIT License
616 stars 166 forks source link

Error "y contains previously unseen labels: 'Private'" (Training may be unable to encode all categoricals) #4

Closed dezoito closed 4 years ago

dezoito commented 4 years ago

When first testing the RandomForestClassifier class I got an error:

python manage.py test apps.ml.tests

{'status': 'Error', 'message': "y contains previously unseen labels: 'Private'"}

I believe that due to the 30% split in test/train data, there was no person with the workclass "Private", and thus that value was never encoded to a number in the training dataset artifact.

Rerunning the training and artifact generation in the jupyter notebook seemed to fix it for me.

(Posting this just in case someone gets stuck due to this error, as I have no suggestions on how to stop this from happening in the first place)

pplonski commented 4 years ago

Yes, the problem is with an unseen category.

pplonski commented 4 years ago

In the AutoML package that I'm working on I have a try ... except block for such situations. You can check details here: https://github.com/mljar/mljar-supervised/blob/master/supervised/preprocessing/label_encoder.py#L14

dezoito commented 4 years ago

Will do. Thank you for the awesome work.