scikit-learn-contrib / boruta_py

Python implementations of the Boruta all-relevant feature selection method.
BSD 3-Clause "New" or "Revised" License
1.5k stars 256 forks source link

Fixed "Tuple Index Out of range error", unit test and example notebook #48

Closed guitarmind closed 5 years ago

guitarmind commented 5 years ago

This PR relates to #47, a bug I made in #46. We should only check the size of 1st dimension of not_selected array, as it would be 0 already if all features are relevant.

This time I have also double-checked that unit test case is passed (fixed a small issue inside as well).

Here is the output log of unit test:

$ python unit_tests.py
..
----------------------------------------------------------------------
Ran 2 tests in 11.859s

OK

I also discovered some compatibility issues to Python (I'm using 3.6.5) and Pandas in the example notebook while doing correctness test, and it should work well with current version as well!

freshnemo commented 5 years ago

HI, @guitarmind I tried the Madalon Dataset. ipynd provided by the package. It showed the following error message "Type error: unhashable type: slice" from "pandas/core/generic.py line 2487 : res= cache.get (item)". The current python version I used is 3.6.4 pandas is 0.23.4. Could you provide your package setting from your test environment?

guitarmind commented 5 years ago

Hi @freshnemo,

Are you using the my forked version? The error Type error: unhashable type: slice is actually what I was trying to fix in the Madalon_Data_Set notebook. It is because that X needs to be an numpy array to do slicing in the line 402 of boruta_py.py source:

x_cur = np.copy(X[:, x_cur_ind])

In the PR I made a change in the notebook to get X in numpy format:

y = data.pop('target')
X = data.copy().values

Note that this PR is not merged so the changes are not applied yet. Could you share full stacktrace mesage to know more details, thanks.

Test environment:

freshnemo commented 5 years ago

Yes, when I forked the commend you provided, at beginning, Boruta_py can run but will soon stop. If the iteration is 100, Boruta usually stop at 45 iteration and show the error as #47 "if not_selected.shape[0] > 0 and not_selected.shape[1] > 0:" tuple is out of index. My python version is 3.6.4 and pandas is 0.23.4. In addition, I found an example which used Boruta_py at kaggle which could work. This is why I suspect python 3.6.4 might have a bug.

guitarmind commented 5 years ago

So what is the stacktrace of error? "if not_selected.shape[0] > 0 and not_selected.shape[1] > 0: has been changed in the PR.

freshnemo commented 5 years ago

Oh, Sorry, I did not check the "file changed " tab. I modified the code which you provided. Thanks for your help, the code can run.

guitarmind commented 5 years ago

Good to know that 👍

silverstone1903 commented 5 years ago

Edit: Seems like it's working. 😄

Hi @guitarmind,

When I change the line 336 to if not_selected.shape[0] > 0: I get this error:

IndexError                                Traceback (most recent call last)
<timed eval> in <module>()

<ipython-input-40-4c6a084e678c> in fit(self, X, y)
    199         """
    200 
--> 201         return self._fit(X, y)
    202 
    203     def transform(self, X, weak=False):

<ipython-input-40-4c6a084e678c> in _fit(self, X, y)
    312         tentative = np.where(dec_reg == 0)[0]
    313         # ignore the first row of zeros
--> 314         tentative_median = np.median(imp_history[1:, tentative], axis=0)
    315         # which tentative to keep
    316         tentative_confirmed = np.where(tentative_median

IndexError: too many indices for array

Before changing it just gives tuple index error in the end but function works properly. Do you have any idea?