Closed rjurney closed 5 years ago
It is certainly possible, but having never used Spark I don't know if it will be easy or hard. If the prediction function just needs extra boilerplate or an object that encapsulates the instance, it would be very easy. All one would have to do would be to redefine the prediction function:
def new_predict_fn(texts):
spark_objects = transform_texts(texts)
predictions = spark_predict(spark_objects)
return predictions
If Spark really needs other data (I don't know what it would need, but maybe there's something), one would need to basically implement other Explainers similar to LimeTextExplainer and LimeTabularExplainer, or inherit from these explainers and redefine the explain_instance function to factor out the extra data.
Hello, @rjurney , what are your expectations regarding a Spark implementation? Are you interested in parallelizing the explain_instance() method?
I would be also very interested in support for PySpark MLlib!
👍
I would like to help working on this! PySpark MLlib and next phase Spark-Scala MLlib as well
👍
I would also like to help out working on this 👍
I would kindly like to ask what are the updates regarding this issue?
Am running into memory issues with the explain_instance() method. Would love a Spark implementation
A spark implementation is definitely not in my plans. If anyone wants to do it, I'm happy to merge : )
As a short term solution, it's possible to define a classifier_fn
and a split_expression
that converts the input to PySpark types, performs the processing in the PySpark environment and then converts the output back to Numpy arrays, as expected by the implemented explain_instance
. But, obviously, by doing that we would still have memory problems for very large datasets.
If Spark really needs other data (I don't know what it would need, but maybe there's something), one would need to basically implement other Explainers similar to LimeTextExplainer and LimeTabularExplainer, or inherit from these explainers and redefine the explain_instance function to factor out the extra data.
That is the reason I believe this is the best option ^
Ps.: Sorry for resurrecting this issue, but I also think it would be very very useful to have this support!
Microsoft has an implementation called LIME on Spark
https://github.com/Azure/mmlspark
Examples are hard to find for tabular data, but here is what I've found that has been a bit helpful.
As a short term solution, it's possible to define a
classifier_fn
and asplit_expression
that converts the input to PySpark types, performs the processing in the PySpark environment and then converts the output back to Numpy arrays, as expected by the implementedexplain_instance
. But, obviously, by doing that we would still have memory problems for very large datasets.If Spark really needs other data (I don't know what it would need, but maybe there's something), one would need to basically implement other Explainers similar to LimeTextExplainer and LimeTabularExplainer, or inherit from these explainers and redefine the explain_instance function to factor out the extra data.
That is the reason I believe this is the best option ^
Ps.: Sorry for resurrecting this issue, but I also think it would be very very useful to have this support!
Is there any example of this implementation? thank you
"All we require is that the classifier implements a function that takes in raw text or a numpy array and outputs a probability for each class."
Necessarily to operate within the context in where it acts, Spark MLlib takes more complicated data structures as input. Would it be possible to add support for Spark? If it is, I might do it, but I'm not sure it is possible.