Explainers stopped working on DigitalOcean Droplet

jayswartz commented 3 years ago

Hello, great repo! I installed the code and ran my first tests on Friday 20210723 with excellent results.

I am training numerous models using a jupyterLab notebook with MLflow implemented. I included calls to ebm.explain_global() and ebm.explain_local(X_test, y_test). I trained 75 models on Friday and the explain_global and explain_local functions both worked well, adding inline clickable analyses. I then trained another 708 models using four additional notebooks.

When I opened the notebooks today, Monday 20210726, no explainers were generated in the four additional notebooks. In the first notebook where there used to be explainers there are now file not found icons.

The following questions come to mind:

Is there a limit to the number of explainers?
Is there an issue with running multiple concurrent notebooks?
Could this be related to any of the recent code updates?
Any suggestions on where to look for debugging?

Code that worked perfectly on Friday that now does not generate output:

        ebm_global = ebm.explain_global()
        show(ebm_global)

        ebm_local = ebm.explain_local(X_test, y_test)
        show(ebm_local)

Again, great repo, just stuck on getting the explainers to run.

jayswartz commented 3 years ago

When running the notebook on my local laptop, versus the DigitalOcean Droplet that has the bug, the explainers function.

Based on this and the running of multiple notebooks a new question:

Is there some form of throttling or other limiter that may have been triggered by the large volume of models?

jayswartz commented 3 years ago

I've been fitting EBM models all day today on the DigitalOcean Droplet without the benefit of explainers to build a base of predictions I can measure using a ground truth. I've been encountering a variety of solvable errors and restarting the notebook. I restarted the notebook at 1:20P MT and the explainers are now running!?! The run just prior to this failed on a type error (error data provided below). For background, I am training models to predict property listings for the 783 largest counties in the US. I processed ~80 million records two times and this is the sole type error encountered. This appears to be an edge case.

Could the error below have cleared the explainer error?

In any event, this is certainly performing as I would expect an Alpha release to perform.

The error from the prior notebook run, which is related to EBM:

RemoteTraceback Traceback (most recent call last) RemoteTraceback: """ Traceback (most recent call last): File "/home/jay/anaconda3/lib/python3.8/multiprocessing/pool.py", line 125, in worker result = (True, func(*args, kwds)) File "/home/jay/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 595, in call return self.func(*args, *kwargs) File "/home/jay/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 262, in call return [func(args, kwargs) File "/home/jay/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 262, in return [func(*args, **kwargs) File "/home/jay/anaconda3/lib/python3.8/site-packages/interpret/glassbox/ebm/ebm.py", line 513, in fit_parallel self.interindices, self.interscores = self._build_interactions( File "/home/jay/anaconda3/lib/python3.8/site-packages/interpret/glassbox/ebm/ebm.py", line 566, in _build_interactions scores_train = EBMUtils.decision_function( File "/home/jay/anaconda3/lib/python3.8/site-packages/interpret/glassbox/ebm/utils.py", line 390, in decisionfunction for , _, scores in scores_gen: File "/home/jay/anaconda3/lib/python3.8/site-packages/interpret/glassbox/ebm/utils.py", line 363, in scores_by_feature_group scores = tensor[tuple(sliced_X)] TypeError: 'NoneType' object is not subscriptable """

The above exception was the direct cause of the following exception:

TypeError Traceback (most recent call last)

in 758 # Fit & Predict EBM synthetic listing_closed 759 if run_fit_predict_ebm_syn_lc: --> 760 fit_predict_data = prep_fit_score_ebm_lc_syn_mlf( 761 TRACKING_URI, 762 ARTIFACT_LOCATION, in prep_fit_score_ebm_lc_syn_mlf(TRACKING_URI, ARTIFACT_LOCATION, EXPERIMENT_NAME, gamma_weight, max_depth_setting, child_weight, test_percentage, prediction_path, baseline_feature_set, county_count, county_fips, data_version, end_date_yymmdd, experiment, experimental_feature_set, investigation, merge_version, normalized_feature_set, output_report, output_version, run_name, run_normalize_features, transform_df, transform_path, version) 141 f.write('Classifier Model EBM \n') 142 # Fit model to training data --> 143 ebm.fit(X_train, y_train) 144 y_pred = ebm.predict(X_test) 145 ~/anaconda3/lib/python3.8/site-packages/interpret/glassbox/ebm/ebm.py in fit(self, X, y, sample_weight) 995 ) 996 --> 997 estimators = provider.parallel(BaseCoreEBM.fit_parallel, train_model_args_iter) 998 999 def select_pairs_from_fast(estimators, n_interactions): ~/anaconda3/lib/python3.8/site-packages/interpret/provider/compute.py in parallel(self, compute_fn, compute_args_iter) 18 19 def parallel(self, compute_fn, compute_args_iter): ---> 20 results = Parallel(n_jobs=self.n_jobs, backend='multiprocessing')( 21 delayed(compute_fn)(*args) for args in compute_args_iter 22 ) ~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in __call__(self, iterable) 1059 1060 with self._backend.retrieval_context(): -> 1061 self.retrieve() 1062 # Make sure that we get a last message telling us we are done 1063 elapsed_time = time.time() - self._start_time ~/anaconda3/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self) 938 try: 939 if getattr(self._backend, 'supports_timeout', False): --> 940 self._output.extend(job.get(timeout=self.timeout)) 941 else: 942 self._output.extend(job.get()) ~/anaconda3/lib/python3.8/multiprocessing/pool.py in get(self, timeout) 769 return self._value 770 else: --> 771 raise self._value 772 773 def _set(self, i, obj): TypeError: 'NoneType' object is not subscriptable

interpret-ml commented 3 years ago

Hi @jayswartz - just got to this, sounds like a mess!

We haven't tried it on DigitalOcean Droplet before so not sure about anything environmentally specific to it. In terms of show not working, across non-local machines you could be running into a myriad of issues including port forwarding as by default the visualizations are produced via a Flask server. To get around this behavior run this code below and the visualizations will be injected as inline javascript within the notebook itself.

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())

This last problem you're facing is pretty weird, the quickest way to triage it would be to rollback to interpret v0.2.5 and see if the same problem occurs. Looks like either the characteristics of the data you're processing might have changed and we've hit an edge case, or there was a bug introduced into the latest release after some refactoring.

jayswartz commented 3 years ago

Thank you! This works.

I'm rerunning with this to see if the type error resurfaces. If it does, I'll do the roll back.

Could someone point me to where to find the dataframes that support show? I'd like to gather the data together to compare the 783 models I'm training to see patterns in performance that explain the performance variation between models.

jayswartz commented 3 years ago

There appear to be memory leaks or table overflows in the JS alternative. I can only get runs of 8-9 counties before the notebook fails with a single jupyterLab error message: Error code: SIGTRAP, which indicates an out of memory situation. I suspect that the numerous 'show' calls are too much for this type of application. A mod that pushes the show to external json, csv or other formats and releases the memory would enable me to proceed while also adding some utility.

I can still use interpret, but it would be VERY helpful if someone could tell me how to get to or save the underlying data structures that populate the show function so I don't have to fork the code and crawl through it to find them.

interpret-ml commented 3 years ago

Hi @jayswartz,

Haven't seen that error before on the client but your intuition sounds right, I will have to check this. If it's available, do you see this error on Jupyter notebook as well?

You can call .data on the explanation with an optional index for selecting a feature or instance depending on if you're using a global or local explanation. This is what is being used for visualizations.

## Access the underlying data as a dictionary
ebm_global = ebm.explain_global()
# show(ebm_global) 
data_di = ebm_global.data(0)  # Get first feature as data.
data_di = ebm_global.data()  # Get overall/summary data if available.
data_di = ebm_global.data(-1)  # Get everything in one dictionary.

## Access the underlying visualization (i.e. Plotly Figure).
ebm_global.visualize(0)  # Get figure of first feature.
ebm_global.visualize()  # Get figure of overall/summary.

jayswartz commented 3 years ago

Thank you! Dooh, I should have tried something obvious like this. Getting the one dictionary is exactly what I needed.

FYI, after a few more runs, I managed to lock up Anaconda as well as Chrome with an error -9, indicating memory again. I commented out the explainers to see if I can get a clean run for all 783 counties. It's made it to 21, so looks promising to be used for production. I can build up a subset of runs with the explainers on as long as I don't go above 7 counties.

interpretml / interpret

Explainers stopped working on DigitalOcean Droplet #266

The error from the prior notebook run, which is related to EBM: