microsoft / SynapseML

Simple and Distributed Machine Learning
http://aka.ms/spark
MIT License
5.07k stars 831 forks source link

ICEExplainer returns same feature importance #1894

Open akshat-suwalka-dream11 opened 1 year ago

akshat-suwalka-dream11 commented 1 year ago

SynapseML version

Version: 0.11.0

System information

10.4 LTS ML (includes Apache Spark 3.2.1, Scala 2.12) com.microsoft.azure:synapseml_2.12:0.10.1 pyspark in databricks

Describe the problem

In my randomforesstclassification model, which is pyspark model....all the features are numerical..

the ouptut

Code to reproduce issue

pdp_1 = ICETransformer( model=model_object_1, targetCol="probability", kind="average", targetClasses=[1], numericFeatures=[{"name": "pd1_amount_join", "numSplits": 50, "rangeMin": 0.0, "rangeMax": 400000.0}]

convert -290 to -1

)

output_pdp_1 = pdp_1.transform(features_1.filter(features_1.days_inactive == 0)) display(output_pdp_1)

Below is the code which is showing error

df_userid_1 = get_pandas_df_from_column(output_pdp_1, "pd1_amount_join_dependence") plot_dependence_for_numeric(df_userid_1, "pd1_amount_join")

Other info / logs

1st display result -> {"264000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "0.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "400000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "80000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "336000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "56000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "32000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "384000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "24000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "152000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "72000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "248000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "160000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "176000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "200000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "296000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "368000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "376000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "168000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "64000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "184000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "240000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "88000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "360000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "320000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "256000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "352000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "136000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "8000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "312000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "16000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "192000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "216000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "232000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "272000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "104000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "392000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "224000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "128000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "288000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "344000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "208000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "40000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "96000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "280000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "112000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "48000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "144000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "304000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "328000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}, "120000.0": {"vectorType": "dense", "length": 1, "values": [0.34720682012802506]}}

2nd error ->

/databricks/spark/python/pyspark/sql/pandas/conversion.py:92: UserWarning: toPandas attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Unable to convert the field 104000.0. If this column is not necessary, you may consider dropping it or converting to primitive type before the conversion. Direct cause: Unsupported type in conversion to Arrow: VectorUDT Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warnings.warn(msg) ValueError: invalid literal for int() with base 10: '104000.0'

What component(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

github-actions[bot] commented 1 year ago

Hey @akshat-suwalka-dream11 :wave:! Thank you so much for reporting the issue/feature request :rotating_light:. Someone from SynapseML Team will be looking to triage this issue soon. We appreciate your patience.

akshat-suwalka-dream11 commented 1 year ago

@mhamilton723

memoryz commented 1 year ago

I'll investigate.

memoryz commented 1 year ago

@akshat-suwalka-dream11 , can you modify the plot_dependence_for_numeric function to this and see if it works:

def plot_dependence_for_numeric(df, col, col_int=True, figsize=(20, 5)):
    dict_values = {}
    col_names = list(df.columns)

    for col_name in col_names:
        dict_values[col_name] = df[col_name][0].toArray()[0]
        marklist = sorted(
            dict_values.items(), key=lambda x: int(float(x[0])) if col_int else x[0]
        )
        sortdict = dict(marklist)

    fig = plt.figure(figsize=figsize)

    plt.plot(list(sortdict.keys()), list(sortdict.values()))

    plt.xlabel(col, size=13)
    plt.ylabel("Dependence")
    plt.ylim(0.0)
    plt.show()
akshat-suwalka-dream11 commented 1 year ago

@memoryz Thank you for the reply... It is solving this problem which is to plot. But i am seeing for every column and their every bucket there is only one constant value is coming ....like in above 0.34720682012802506. One may say ok this feature is not important thats why it is showing constant value but i saw that for every feature it is the same value...Now this is a problemastic

memoryz commented 1 year ago

@akshat-suwalka-dream11 can you attach a screenshot of what you're seeing? I'm not sure if I understand what the problem is.

akshat-suwalka-dream11 commented 1 year ago

@memoryz

Screenshot 2023-04-14 at 3 21 09 PM
akshat-suwalka-dream11 commented 1 year ago
Screenshot 2023-04-14 at 3 22 02 PM
akshat-suwalka-dream11 commented 1 year ago

for every single column have this type of data

akshat-suwalka-dream11 commented 1 year ago

@memoryz