scikit-learn-contrib / DESlib

A Python library for dynamic classifier and ensemble selection
BSD 3-Clause "New" or "Revised" License
479 stars 106 forks source link

"invalid value encountered in true_divide" #198

Closed jayahm closed 4 years ago

jayahm commented 4 years ago

Hi,

I got the following warnings when running "Probabilistic DES"

C:\Users\razai0002\Anaconda3\lib\site-packages\deslib\des\probabilistic\base.py:133: RuntimeWarning: invalid value encountered in true_divide
  competences = competences / sum_potential.reshape(-1, 1)
C:\Users\razai0002\Anaconda3\lib\site-packages\deslib\des\probabilistic\base.py:164: 
RuntimeWarning: invalid value encountered in greater
  selected_classifiers = (competences > selection_threshold)

Where should I check to fix this?

Menelau commented 4 years ago

@jayahm Hi,

This sounds like the sum_potential variable contains NaN values. Which one of the probabilistic DES method are you using? There are 5 methods in the library from this type: DES-RRC, DES-KL, Exponential, Logarithmic, and MinimumDifference.

Can you provide a code piece where this problem occurs? That would really help me in finding the issue.

jayahm commented 4 years ago

May I know what is sum_potential?

I used all the methods under probabilistic.

jayahm commented 4 years ago

Here is the code: https://www.dropbox.com/sh/snzoq5zu8x6f922/AAARc-uTY5R6IDeqVw11T2Kra?dl=0

jayahm commented 4 years ago

May I get some updates on this? I'm afraid there is something wrong with my dataset that has caused the issue

Menelau commented 4 years ago

Hello @jayahm,

I'm Checking that right now. Will give you an answer probably tomorrow. Even if there may be problems with the datasets (I still need to check if it is the cause) I will probably still need to change the code to be more robust to such problems and/or also give back meaningul error messages to the user in the presence of such problems.

Menelau commented 4 years ago

@jayahm Hello,

I carefully investigated this problem and it is being caused by an outlier on the test data which has feature values orders of magnitude larger than any other (instance with index 399 in X_test). With this value being much larger than normal, its distance to any of the data points in X_dsel is very large and that is causing an underflow when computing the potential function used by these models which depends on the distance information in order to weight the influence of each neighbor by its distance to the query. In this case, the value ends up going to 0. That causes a division by zero later and consequently, having NaN values in the competence estimation.

I will send a patch to make this computation more robust in a way that if this problem arrives, a value is assigned to entries that becomes 0 at the end so that a division by zero does not occur at the end and this warning is eliminated (and possibly warn the user about possible problems in the data). However, you should also check your dataset and see if this example causing the error should indeed be part of the data or it was a mistake in the data collection and should be fixed.

jayahm commented 4 years ago

@Menelau ,

Can normalization solve the issue in the dataset

Menelau commented 4 years ago

@jayahm ,

It can probably solve your problem, especially if you use a normalization method that is robust to outliers such as the RobustScaler on scikit-learn:

RobustScaler: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.RobustScaler.html Normalization with outliers: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py

I also sent a patch making the code more robust in order to avoid division by zero errors, so the version on the master branch should not give you the same error/warnings anymore.

jayahm commented 4 years ago

So, I should update the library?

Menelau commented 4 years ago

Yes you should update with the new master code.