marcotcr / lime

Lime: Explaining the predictions of any machine learning classifier
BSD 2-Clause "Simplified" License
11.58k stars 1.81k forks source link

FutureWarning: split() requires a non-empty pattern match #147

Closed AlJohri closed 6 years ago

AlJohri commented 6 years ago

hey, I'm getting this message when using lime. haven't dug in further but hopefully this gives a start.

potomac-api_1  | /usr/local/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
potomac-api_1  |   return _compile(pattern, flags).split(string, maxsplit)

thanks!

echo66 commented 6 years ago

Can you show us the entire error?

AlJohri commented 6 years ago

hi @echo66, since it's its not an error that's all there was in the logs. this is the article I'm running the classifier on: https://gist.githubusercontent.com/AlJohri/c32da2a4d9378f42f2f80b00c74ffe71/raw/5eac3730fa634f1acbb19b253e22008649e52082/article.json

potomac-api_1  | 2018-01-19 13:47:10,010 Starting new HTTP connection (1): content-api-wrapper.internal.clavis.nile.works
potomac-api_1  | 2018-01-19 13:47:10,051 http://********************:80 "GET /api/v1/articles?url=https://www.washingtonpost.com/powerpost/shutdown-looms-as-senate-democrats-dig-in-against-gop-spending-plan/2018/01/19/f4370868-fccd-11e7-a46b-a3614530bd87_story.html?hpid=hp_rhp-top-table-main_shutdown-740am-desktoptablet:homepage/story HTTP/1.1" 200 None
potomac-api_1  | /usr/local/lib/python3.6/re.py:212: FutureWarning: split() requires a non-empty pattern match.
potomac-api_1  |   return _compile(pattern, flags).split(string, maxsplit)
echo66 commented 6 years ago

How are you using LIME? Is it behind a REST endpoint? From the looks of your logs, there is nothing related to LIME code. Where is re module being called? The only place it is used is at https://github.com/marcotcr/lime/blob/7b4c7a5353087f47e40f8af3640e4225ce6ccb65/lime/lime_text.py#L97 . So, without looking into your code, I suspect you are using an empty pattern for the split.

AlJohri commented 6 years ago

Yeah, sorry- don't have time right now to look much further but I appreciate your response. I saw the warning after introducing the lime code so I'm confident it wasn't there before.

Yes, you're right, I added it as part of a rest API. The invocation seems fairly standard from the docs

    if explain:
        explanations = []
        for topic in [x['name'] for x in sorted(responses, key=lambda x: x['decision_score'], reverse=True)][:2]:
            clf = classifiers[topic]['clf']
            transformer = classifiers[topic]['transformer']
            vectorizer = classifiers[topic]['vectorizer']
            c = make_pipeline(vectorizer, transformer, clf)
            explainer = LimeTextExplainer(verbose=True)
            explanation = explainer.explain_instance(text, c.predict_proba, num_features=10)
            explanations.append({"topic": topic, "features": explanation.as_list()})
        ret['explanations'] = explanations

_note, I can probably do this using a single LimeTextExplainer but I wasn't able to figure out how to adapt my multiple binomial classifiers into a single predict_proba function that explain_instance can understand just yet, but that's an issue for another time. I'm working on a multiclass, multilabel problem_

The tokenized version of the text and the output of the API is attached here: https://gist.githubusercontent.com/AlJohri/cd919cf6f4ec284877febdc1b828dd26/raw/8fa2496a972f4590ccab772b8778f0792c629d6c/response.json

Off the top of my head, I'm guessing that one of the tokens in the bag of words is an empty token?

marcotcr commented 6 years ago

It's because our default split expression matches an empty string because of the '|$' in it. I don't remember why I put the '|$' in the split expression, it doesn't seem to do anything. Anyway, this is just a warning that this expression will not work in future versions of python (but it works now).

As to using a single LimeTextExplainer: note that an instance of LimeTextExplainer does not depend on a specific model - you can use the same explainer to explain different models, by changing the predict_proba function when you call explain_instance. It doesn't matter much anyway because the constructor of LimeTextExplainer doesn't do much actual work.

AlJohri commented 6 years ago

@marcotcr thanks for checking that out!

for the LimeTextExplainer , if I created a psuedo predict_proba that runs predict_proba on each of the topics and returns the combined response, would that make things faster?

def predict_proba(texts):
    probas = []
    for topic in topics:
        vectorizer = classifiers[topic]['vectorizer']
        transformer = classifiers[topic]['transformer']
        clf = classifiers[topic]['clf']
        c = make_pipline(vectorizer, transformer, clf)
        proba = c.predict_proba(texts)
        probas.append(proba)
    return np.array(probas) # or np.vstack(probas) ?

I was imagining something like the above code (untested)

marcotcr commented 6 years ago

If you have N labels and put the prediction probabilities of each label in a column in the output of predict_proba, and then call explain_instance with the parameter labels=range(N), it will be faster than if you call explain_instance N times.

AlJohri commented 6 years ago

@marcotcr thanks! I will try that