[BUG] Error when using custom Transformer with TabularSHAP in SynapseML

TakuyaInoue-github commented 1 year ago

SynapseML version

0.10.1

System information

Language version (e.g. python 3.8, scala 2.12): python 3.10
Spark Version (e.g. 3.2.3):
3.3.1
Spark Platform (e.g. Synapse, Databricks)
Amazon EKS

Describe the problem

Hello,

I encountered an issue when using the TabularSHAP module in SynapseML with a custom Transformer.I received the following error message (AttributeError:SimpleTransformerobject has no attribute '_to_java').

I believe this issue may be caused by either a bug in the TabularSHAP implementation or an insufficient implementation of my custom Transformer. Could you please help me determine whether this issue is due to a bug in TabularSHAP or a problem with my custom Transformer.If it is the latter, any suggestions for improving my implementation would be greatly appreciated.

Thank you in advice for your assistance.

Code to reproduce issue

class SimpleTransformer(
    Transformer,
    HasInputCol,
    HasOutputCol,
    DefaultParamsReadable,
    DefaultParamsWritable,
):
    inputCol = Param(
        Params._dummy(),
        "inputCol",
        "inputCol",
    )
    outputCol = Param(
        Params._dummy(),
        "outputCol",
        "outputCol",
    )
    num = Param(
        Params._dummy(),
        "num",
        "the Number of putting value",
    )

    @keyword_only
    def __init__(self, inputCol=None, outputCol=None, num=0):
        super().__init__()
        self._setDefault(num=0)
        kwargs = self._input_kwargs
        self.setParams(**kwargs)

    @keyword_only
    def setParams(self, inputCol=None, outputCol=None, num=0):
        kwargs = self._input_kwargs
        self._set(**kwargs)

    def getNum(self):
        return self.getOrDefault(self.num)

    def _transform(self, dataset):
        if not self.isSet("inputCol"):
            raise ValueError("No")

        input_columns = self.getInputCol()
        output_column = self.getOutputCol()
        num = self.getNum()

        return dataset.withColumn(output_column, F.col(input_columns) + num)

sdf = spark.createDataFrame(
    [
        [
            'iD-01',
            1,
            1,
            'a',
            4,
        ],
        [
            'iD-02',
            2,
            2,
            'b',
            3,
        ],
        [
            'iD-03',
            3,
            3,
            'c',
            4,
        ],
        [
            'iD-04',
            0,
            0,
            'b',
            1,
        ],
        *[
            [
                f'iD-SAMPLE{i}-label1',
                1,
                1,
                'a',
                4,
            ]
            for i in range(100)
        ],
        *[
            [
                f'iD-SAMPLE{i}-label2',
                2,
                2,
                'b',
                3,
            ]
            for i in range(100)
        ],
        *[
            [
                f'iD-SAMPLE{i}-label3',
                3,
                3,
                'c',
                4,
            ]
            for i in range(100)
        ],
        *[
            [
                f'iD-SAMPLE{i}-label0',
                0,
                0,
                'b',
                1,
            ]
            for i in range(100)
        ],
    ],
    schema=['ID', 'colA', 'colB', 'colC', 'colD'],
)

si = StringIndexer(inputCol='colC', outputCol='featured_colC')
st = SimpleTransformer(inputCol="colB", outputCol="newColB", num=1)
va = VectorAssembler(
    inputCols=['newColB', 'featured_colC', 'colD'], outputCol='features'
)

model = LightGBMClassifier(
    objective="multiclass",
    featuresCol="features",
    labelCol="colA",
    numTasks=3,
    useBarrierExecutionMode=True,
    categoricalSlotIndexes=[1],
    categoricalSlotNames=['featured_colC'],
)

pipeline = Pipeline(stages=[si, st, va, model])
model = pipeline.fit(sdf)

explain_instances = model.transform(sdf)

from pyspark.sql.functions import broadcast, rand
from synapse.ml.explainers import TabularSHAP

shap = TabularSHAP(
    inputCols=["colB", "colC", "colD"],
    outputCol="shapValues",
    numSamples=5000,
    model=model,
    targetCol="probability",
    targetClasses=[1, 2, 3],
    backgroundData=broadcast(sdf.orderBy(rand()).limit(100).cache()),
)

# We got some errors, "Attribution Error: `SimpleTransfomer` object has no attribute '_to_java'"
shap_df = shap.transform(explain_instances)

Other info / logs

Attribution Error: SimpleTransfomer object has no attribute '_to_java'

What component(s) does this bug affect?

[ ] area/cognitive: Cognitive project
[X] area/core: Core project
[ ] area/deep-learning: DeepLearning project
[X] area/lightgbm: Lightgbm project
[ ] area/opencv: Opencv project
[ ] area/vw: VW project
[ ] area/website: Website
[ ] area/build: Project build system
[ ] area/notebooks: Samples under notebooks folder
[ ] area/docker: Docker usage
[ ] area/models: models related issue

What language(s) does this bug affect?

[ ] language/scala: Scala source code
[ ] language/python: Pyspark APIs
[ ] language/r: R APIs
[ ] language/csharp: .NET APIs
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[ ] integrations/synapse: Azure Synapse integrations
[ ] integrations/azureml: Azure ML integrations
[ ] integrations/databricks: Databricks integrations

github-actions[bot] commented 1 year ago

Hey @TakuyaInoue-github :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

mhamilton723 commented 1 year ago

Thanks for reporting this @TakuyaInoue-github . Looks like @memoryz is already on the case

memoryz commented 1 year ago

@TakuyaInoue-github can you show me the full error stack? I want to understand where is this error coming from.

TakuyaInoue-github commented 1 year ago

@memoryz Sure, The following is a stack trace of the error.

Traceback (most recent call last):
  File "/mnt/share/example/shap_example_not_working.py", line 172, in <module>
    shap_df = shap.transform(explain_instances)
  File "/opt/spark/python/pyspark/ml/base.py", line 217, in transform
    return self._transform(dataset)
  File "/opt/spark/python/pyspark/ml/wrapper.py", line 349, in _transform
    self._transfer_params_to_java()
  File "/home/user/.local/lib/python3.10/site-packages/synapse/ml/core/schema/Utils.py", line 131, in _transfer_params_to_java
    pair = self._make_java_param_pair(param, self._paramMap[param])
  File "/home/user/.local/lib/python3.10/site-packages/synapse/ml/core/serialize/java_params_patch.py", line 88, in _mml_make_java_param_pair
    java_value = _mml_py2java(sc, value)
  File "/home/user/.local/lib/python3.10/site-packages/synapse/ml/core/serialize/java_params_patch.py", line 60, in _mml_py2java
    obj = obj._to_java()
  File "/opt/spark/python/pyspark/ml/pipeline.py", line 333, in _to_java
    java_stages[idx] = stage._to_java()
AttributeError: 'SimpleTransformer' object has no attribute '_to_java'

I hope you find it useful. Thank you.

kappanful commented 1 year ago

Hi, is there any update regarding this? I encountered the same error when trying to calculate SHAPs for a SparkXGBClassifier model. Thank you in advance for the information.

AlejandroGVC commented 1 year ago

Hi, im having the same problem for SparkXGBClassifer! Any updates?

microsoft / SynapseML