ing-bank / popmon

Monitor the stability of a Pandas or Spark dataframe ⚙︎
https://popmon.readthedocs.io/
MIT License
493 stars 33 forks source link

Decimal type support #252

Closed sbrugman closed 2 years ago

twalen commented 2 years ago

Following code triggers an error:

import pyspark.sql.types as T
from decimal import Decimal
sdf = spark.createDataFrame([
    ("2022-01-01", 1, 1.12, Decimal(2.34), "abc")
], T.StructType([
       T.StructField("dt", T.StringType(), True),
       T.StructField("col_int", T.IntegerType(), True),
       T.StructField("col_double", T.DoubleType(), True),
       T.StructField("col_decimal", T.DecimalType(10, 2), True),
       T.StructField("col_str", T.StringType(), True),
]))
sdf.printSchema()
sdf.show(10, False)
sdf.pm_stability_report(time_axis="dt")

with stack trace:

...
~/.virtualenvs/random/lib/python3.7/site-packages/histogrammar/dfinterface/histogram_filler_base.py in _execute(self, df)
    198         # 1. check presence and data type of requested features
    199         # sort features into numerical, timestamp and category based
--> 200         cols_by_type = self.categorize_features(df)
    201 
    202         # 2. assign features to make histograms of (if not already provided)

~/.virtualenvs/random/lib/python3.7/site-packages/histogrammar/dfinterface/histogram_filler_base.py in categorize_features(self, df)
    406             for col in col_list:
    407 
--> 408                 dt = self.var_dtype.get(col, check_dtype(self.get_data_type(df, col)))
    409 
    410                 if col not in self.var_dtype:

~/.virtualenvs/random/lib/python3.7/site-packages/histogrammar/dfinterface/spark_histogrammar.py in get_data_type(self, df, col)
    171             dt = np.int64
    172 
--> 173         return np.dtype(dt)
    174 
    175     def process_features(self, df, cols_by_type):

~/.virtualenvs/random/lib/python3.7/site-packages/numpy/core/_internal.py in _commastring(astr)
    176                     raise ValueError(
    177                         'format number %d of "%s" is not recognized' %
--> 178                         (len(result)+1, astr))
    179                 startindex = mo.end()
    180 

ValueError: format number 1 of "decimal(10,2)" is not recognized

Also the similar problem occur for Pandas dataframes with decimal values:

df = sdf.toPandas()
df.pm_stability_report(time_axis="dt")

tiggers:

...
~/.virtualenvs/random/lib/python3.7/site-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
   3361                 return self._engine.get_loc(casted_key)
   3362             except KeyError as err:
-> 3363                 raise KeyError(key) from err
   3364 
   3365         if is_scalar(key) and isna(key) and not self.hasnans:

KeyError: 'col_decimal'