marcotcr / lime

Lime: Explaining the predictions of any machine learning classifier
BSD 2-Clause "Simplified" License
11.58k stars 1.81k forks source link

Add PySpark MLlib support #115

Closed rjurney closed 5 years ago

rjurney commented 6 years ago

"All we require is that the classifier implements a function that takes in raw text or a numpy array and outputs a probability for each class."

Necessarily to operate within the context in where it acts, Spark MLlib takes more complicated data structures as input. Would it be possible to add support for Spark? If it is, I might do it, but I'm not sure it is possible.

marcotcr commented 6 years ago

It is certainly possible, but having never used Spark I don't know if it will be easy or hard. If the prediction function just needs extra boilerplate or an object that encapsulates the instance, it would be very easy. All one would have to do would be to redefine the prediction function:

def new_predict_fn(texts):
    spark_objects = transform_texts(texts)
    predictions = spark_predict(spark_objects)
    return predictions

If Spark really needs other data (I don't know what it would need, but maybe there's something), one would need to basically implement other Explainers similar to LimeTextExplainer and LimeTabularExplainer, or inherit from these explainers and redefine the explain_instance function to factor out the extra data.

FlorentPajot commented 6 years ago

Hello, @rjurney , what are your expectations regarding a Spark implementation? Are you interested in parallelizing the explain_instance() method?

GlaserIngo commented 6 years ago

I would be also very interested in support for PySpark MLlib!

FavioVazquez commented 6 years ago

👍

talgo10 commented 6 years ago

I would like to help working on this! PySpark MLlib and next phase Spark-Scala MLlib as well

evants commented 6 years ago

👍

J2Niklas commented 6 years ago

I would also like to help out working on this 👍

magedHelmy commented 6 years ago

I would kindly like to ask what are the updates regarding this issue?

jamesdvance commented 5 years ago

Am running into memory issues with the explain_instance() method. Would love a Spark implementation

marcotcr commented 5 years ago

A spark implementation is definitely not in my plans. If anyone wants to do it, I'm happy to merge : )

hellenlima commented 5 years ago

As a short term solution, it's possible to define a classifier_fn and a split_expression that converts the input to PySpark types, performs the processing in the PySpark environment and then converts the output back to Numpy arrays, as expected by the implemented explain_instance. But, obviously, by doing that we would still have memory problems for very large datasets.

If Spark really needs other data (I don't know what it would need, but maybe there's something), one would need to basically implement other Explainers similar to LimeTextExplainer and LimeTabularExplainer, or inherit from these explainers and redefine the explain_instance function to factor out the extra data.

That is the reason I believe this is the best option ^

Ps.: Sorry for resurrecting this issue, but I also think it would be very very useful to have this support!

yeamusic21 commented 5 years ago

Microsoft has an implementation called LIME on Spark

https://github.com/Azure/mmlspark

Examples are hard to find for tabular data, but here is what I've found that has been a bit helpful.

fajim1 commented 2 years ago

As a short term solution, it's possible to define a classifier_fn and a split_expression that converts the input to PySpark types, performs the processing in the PySpark environment and then converts the output back to Numpy arrays, as expected by the implemented explain_instance. But, obviously, by doing that we would still have memory problems for very large datasets.

If Spark really needs other data (I don't know what it would need, but maybe there's something), one would need to basically implement other Explainers similar to LimeTextExplainer and LimeTabularExplainer, or inherit from these explainers and redefine the explain_instance function to factor out the extra data.

That is the reason I believe this is the best option ^

Ps.: Sorry for resurrecting this issue, but I also think it would be very very useful to have this support!

Is there any example of this implementation? thank you