Add shuffle parameter to fit

jmrichardson commented 2 years ago

Hi,

Added shuffle parameter to fit:

selector.fit(X, y, shuffle=False)

Default is True matching sklearn's train_test_split default.

Thank you for your merge consideration.

jvdd commented 2 years ago

Hi @jmrichardson

Thx for your contribution! I guess this PR relates to issue #14? I quickly checked your additions, and your code looks fine! I would advise to add a test for this new functionality (but feel free to let me know if you're not comfortable adding tests - I'll gladly do this myself).

I also discussed this PR with @JarneVerhaeghe - he believes that the impact of having the same split in each powershap iteration will be minimal (BUT we have not empirically checked this yet)

Cheers, Jeroen

PS: one test fails due to smth with categorical input (I'll fix this as this has nothing to do with your PR)

jmrichardson commented 2 years ago

Hi @jvdd ,

If you don't mind creating the test, I would appreciate it. I don't use poetry and ran into some issues with getting the environment setup.

I am not familiar with how powershap iterations work and just came across this package today, so very interested in if having the same split would impact effectiveness? If there is a way to pass a CV iterator (ie rolling window) for the number or iterations, perhaps that may be an alternative? I am not sure if this is possible and how much effort it would require but just a thought.

Thanks again and looking forward to using this great library with my project! :)

jmrichardson commented 2 years ago

Hi @jvdd

I've added the ability to send a CV iterator. In my use case, I used a time series iterator but any iterator yielding train/test split indices should work:

        cv = GapKFold(n_splits=10, gap_before=1, gap_after=1)  # TS CV iterator
        self.selector = PowerShap(model=CatBoostClassifier(n_estimators=250, verbose=0, use_best_model=True), verbose=True)
        self.selector.fit(X, y, cv=cv)

There is code to ensure the number of splits matches loop_its. The train/test indices are yielded into a list of tuples which is then referenced by the iteration count in shap_explainer.py.

Thank you for your consideration and please let me know if it fails any tests, or conflicts with powerhap's design/operation.

jvdd commented 2 years ago

Hi @jmrichardson,

I'll have a look at it right now! I'll add some tests for the new functionality

PS: just fixed the failing test in PR #16

JarneVerhaeghe commented 2 years ago

Hi @jmrichardson !

Firstly, thanks for the contribution and the ideas! I checked the consistency and the algorithmic properties by changing the shuffle to false in the train test split and this has no negative impact on performance and is a nice added feature!

I also checked the code for the CV iterator. The code looks decent, @jvdd is also taking a look at it. However, I saw you added it before the first call of the explain function (there are multiple of these calls in the automatic mode) which has an impact on the algorithm, mainly the automatic mode. In the automatic mode we do not need to specifiy iterations as it automatically finds the minimum best number, however adding the CV iterator with fixed folds will impede compatibility with the automatic mode. I do see a lot of merit in adding a custom CV iterator though. A first idea I have is to put the CV iterator code in the explain function and use a random fold every powershap iteration instead of a set fold. This removes the interleaving between the folds and the amount of iterations and opens the possibility to have more iterations than folds. This is good for the statistical tests for determining the relevance of the features. This is just a quick comment and I will also look into it!

jvdd commented 2 years ago

Hi @jmrichardson,

I implemented shuffle and cv myself and found out the following;

shuffle only works when stratifiy is False (see this open issue https://github.com/scikit-learn/scikit-learn/issues/16962)
cv should be an object that implements the BaseCrossValidator interface that has a .split method (with arguments X, y, groups) -> this allows us to decorate this object to create an infinite & stateful split yielder (see POC implementation below)

As a result of these findings I created PR #17 There I implemented cv support (and did not implement shuffle support)

I'm interested in hearing your opinion!

Do you think supporting shuffle is still necessary, since PR #17 supports cv
Do you think we should also support arrays / vanilla iterators as cv argument?

jmrichardson commented 2 years ago

hi @jvdd ,

Wow! Great job! I agree there is no need for the shuffle parameter with cv support. I was running into an issue with iterating the cv splitter with next() in my PR and just resorted to iterating to arrays prior. It was a hack and don't see any reason to support arrays. I have tested the PR with my use case and works wonderfully! Thank you so much for adding support for cv! I will close both this PR and also the issue.

Thanks again and appreciate the help!

jvdd commented 2 years ago

Glad to hear that it works for your use-case @jmrichardson! Thank you as well for your time & efforts (for creating this PR + discussing it with us) :smile:

PS: powershap v0.0.7 includes the cv support :tada:

predict-idlab / powershap

Add shuffle parameter to fit #15