ploomber / sklearn-evaluation

Machine learning model evaluation made easy: plots, tables, HTML reports, experiment tracking and Jupyter notebook analysis.
https://sklearn-evaluation.ploomber.io
Apache License 2.0
455 stars 54 forks source link

Updated feature selection tutorial #304

Closed gtduncan closed 1 year ago

gtduncan commented 1 year ago

Describe your changes

Edits https://github.com/ploomber/sklearn-evaluation/pull/294 to pass lint check and allows appearance in navbar

Original tutorial: @bbeat2782

Checklist before requesting a review


:books: Documentation preview :books:: https://sklearn-evaluation--304.org.readthedocs.build/en/304/

coveralls commented 1 year ago

Pull Request Test Coverage Report for Build 4327038359

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report. To ensure accuracy in future PRs, please see these guidelines. A quick fix for this PR: rebase it; your next report should be accurate.


Totals Coverage Status
Change from base Build 4317956189: 0.0%
Covered Lines: 3228
Relevant Lines: 3429

💛 - Coveralls
idomic commented 1 year ago

@gtduncan looks like you have a conflict on changelog.md. Please mark as ready for review when you're ready. Also I can't see the guide via the link, please fix

gtduncan commented 1 year ago

Docs building was failing and I found it was one of the cells timing out— specifically, the cell describing forward selection in the 2.1 Forward Selection portion. I ran just that cell, found it was taking 90 seconds to run, and changed the execution time in conf.py to mitigate this– let me know if there's something else you'd want to do about it because it's a pretty large jump. I'm also confused how the docs passed on @bbeat2782's CI as well. I also added @neelasha23's requested edits from the original PR

neelasha23 commented 1 year ago

docs build is failing @gtduncan

edublancas commented 1 year ago

I think the problem is here:

nbclient.exceptions.CellTimeoutError: A cell timed out while it was being executed, after 120 seconds.
The message was: Cell execution timed out.
Here is a preview of the cell contents:
-------------------
['from sklearn.feature_selection import SequentialFeatureSelector', 'from sklearn.ensemble import RandomForestClassifier', 'from sklearn_evaluation import plot', 'rfc = RandomForestClassifier()', 'forward_select = SequentialFeatureSelector(']
...
[')', 'forward_select.fit(X_clf_train, y_clf_train)', 'features = forward_select.get_feature_names_out()', 'rfc.fit(X_clf_train[features], y_clf_train)', 'plot.feature_importances(rfc)']
-------------------

let's try reducing the data size or any other parameter that affects runtime

worst case, we can increase the cell timeout but that should be our last resort since it'll slow down the doc building process

gtduncan commented 1 year ago

I was able to get the cell you mentioned to run quicker by changing the n_features_to_select parameter from 'auto' to 0.1 seen here: forward_select = SequentialFeatureSelector(rfc, direction='forward', n_features_to_select=0.1)— however, in the following cell where backward_select = SequentialFeatureSelector(rfc, direction='backward', n_features_to_select='auto') is called, regardless if I change the n_features_to_select parameter, the cell takes around 2 and a half minutes to execute. Any ideas on how to reduce this runtime?

edublancas commented 1 year ago

reducing the number of rows will help with runtime (I'm guessing that's the rfc parameter), how large it is?

gtduncan commented 1 year ago

The rfc parameter is RandomForestClassifier(). I've tried lowering the n_estimators parameter in that model as well, which seems to make the forward selection run pretty quickly, but the backward selection still times out. It does run locally, but I think it may just be too slow for the CI... I'll keep looking into solutions

edublancas commented 1 year ago

I think let's make it a non-runnable cell (create it as a markdown cell in Jupyter). just copy whatever output it produces