Closed AlJohri closed 6 years ago
Can you show us the entire error?
hi @echo66, since it's its not an error that's all there was in the logs. this is the article I'm running the classifier on: https://gist.githubusercontent.com/AlJohri/c32da2a4d9378f42f2f80b00c74ffe71/raw/5eac3730fa634f1acbb19b253e22008649e52082/article.json
potomac-api_1 | 2018-01-19 13:47:10,010 Starting new HTTP connection (1): content-api-wrapper.internal.clavis.nile.works
potomac-api_1 | 2018-01-19 13:47:10,051 http://********************:80 "GET /api/v1/articles?url=https://www.washingtonpost.com/powerpost/shutdown-looms-as-senate-democrats-dig-in-against-gop-spending-plan/2018/01/19/f4370868-fccd-11e7-a46b-a3614530bd87_story.html?hpid=hp_rhp-top-table-main_shutdown-740am-desktoptablet:homepage/story HTTP/1.1" 200 None
potomac-api_1 | /usr/local/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
potomac-api_1 | return _compile(pattern, flags).split(string, maxsplit)
How are you using LIME? Is it behind a REST endpoint? From the looks of your logs, there is nothing related to LIME code. Where is re module being called? The only place it is used is at https://github.com/marcotcr/lime/blob/7b4c7a5353087f47e40f8af3640e4225ce6ccb65/lime/lime_text.py#L97 . So, without looking into your code, I suspect you are using an empty pattern for the split.
Yeah, sorry- don't have time right now to look much further but I appreciate your response. I saw the warning after introducing the lime code so I'm confident it wasn't there before.
Yes, you're right, I added it as part of a rest API. The invocation seems fairly standard from the docs
if explain:
explanations = []
for topic in [x['name'] for x in sorted(responses, key=lambda x: x['decision_score'], reverse=True)][:2]:
clf = classifiers[topic]['clf']
transformer = classifiers[topic]['transformer']
vectorizer = classifiers[topic]['vectorizer']
c = make_pipeline(vectorizer, transformer, clf)
explainer = LimeTextExplainer(verbose=True)
explanation = explainer.explain_instance(text, c.predict_proba, num_features=10)
explanations.append({"topic": topic, "features": explanation.as_list()})
ret['explanations'] = explanations
_note, I can probably do this using a single LimeTextExplainer
but I wasn't able to figure out how to adapt my multiple binomial classifiers into a single predict_proba
function that explain_instance
can understand just yet, but that's an issue for another time. I'm working on a multiclass, multilabel problem_
The tokenized version of the text and the output of the API is attached here: https://gist.githubusercontent.com/AlJohri/cd919cf6f4ec284877febdc1b828dd26/raw/8fa2496a972f4590ccab772b8778f0792c629d6c/response.json
Off the top of my head, I'm guessing that one of the tokens in the bag of words is an empty token?
It's because our default split expression matches an empty string because of the '|$' in it. I don't remember why I put the '|$' in the split expression, it doesn't seem to do anything. Anyway, this is just a warning that this expression will not work in future versions of python (but it works now).
As to using a single LimeTextExplainer: note that an instance of LimeTextExplainer does not depend on a specific model - you can use the same explainer to explain different models, by changing the predict_proba function when you call explain_instance. It doesn't matter much anyway because the constructor of LimeTextExplainer doesn't do much actual work.
@marcotcr thanks for checking that out!
for the LimeTextExplainer , if I created a psuedo predict_proba
that runs predict_proba
on each of the topics and returns the combined response, would that make things faster?
def predict_proba(texts):
probas = []
for topic in topics:
vectorizer = classifiers[topic]['vectorizer']
transformer = classifiers[topic]['transformer']
clf = classifiers[topic]['clf']
c = make_pipline(vectorizer, transformer, clf)
proba = c.predict_proba(texts)
probas.append(proba)
return np.array(probas) # or np.vstack(probas) ?
I was imagining something like the above code (untested)
If you have N labels and put the prediction probabilities of each label in a column in the output of predict_proba, and then call explain_instance
with the parameter labels=range(N)
, it will be faster than if you call explain_instance
N times.
@marcotcr thanks! I will try that
hey, I'm getting this message when using lime. haven't dug in further but hopefully this gives a start.
thanks!