ing-bank / popmon

Monitor the stability of a Pandas or Spark dataframe ⚙︎
https://popmon.readthedocs.io/
MIT License
493 stars 33 forks source link

KeyError: 'date' on synthetic data #281

Closed marrrcin closed 1 year ago

marrrcin commented 1 year ago

Hi, I'm exploring the use of your library and I've stumped across an error when working with my data.

Popmon version: 1.4.5 Error:

 in <lambda>(plot)
    157             # filter out potential empty plots
    158             plots = [e for e in plots if len(e)]
--> 159             plots = sorted(plots, key=lambda plot: plot["date"])
    160 
    161             # basic checks for histograms

KeyError: 'date'
Full stack trace: ⬇️ ``` KeyError Traceback (most recent call last) [](https://localhost:8080/#) in () ----> 1 report = popmon.df_stability_report( 2 df, 3 time_axis="time", 4 time_width="1w", 5 ) 7 frames [/usr/local/lib/python3.10/dist-packages/popmon/pipeline/report.py](https://localhost:8080/#) in df_stability_report(df, settings, time_width, time_offset, var_dtype, reference, split, **kwargs) 196 197 # generate data stability report --> 198 return stability_report( 199 hists=hists, 200 settings=settings, [/usr/local/lib/python3.10/dist-packages/popmon/pipeline/report.py](https://localhost:8080/#) in stability_report(hists, settings, reference, **kwargs) 73 # execute reporting pipeline 74 pipeline = get_report_pipeline_class(settings.reference_type, reference)(**cfg) ---> 75 result = pipeline.transform(datastore) 76 77 stability_report_result = StabilityReport(datastore=result) [/usr/local/lib/python3.10/dist-packages/popmon/base/pipeline.py](https://localhost:8080/#) in transform(self, datastore) 65 for module in self.modules: 66 self.logger.debug(f"transform {module.__class__.__name__}") ---> 67 datastore = module.transform(datastore) 68 return datastore 69 [/usr/local/lib/python3.10/dist-packages/popmon/pipeline/report_pipelines.py](https://localhost:8080/#) in transform(self, datastore) 255 def transform(self, datastore): 256 self.logger.info(f'Generating report "{self.store_key}".') --> 257 return super().transform(datastore) [/usr/local/lib/python3.10/dist-packages/popmon/base/pipeline.py](https://localhost:8080/#) in transform(self, datastore) 65 for module in self.modules: 66 self.logger.debug(f"transform {module.__class__.__name__}") ---> 67 datastore = module.transform(datastore) 68 return datastore 69 [/usr/local/lib/python3.10/dist-packages/popmon/base/module.py](https://localhost:8080/#) in _transform(self, datastore) 49 50 # transformation ---> 51 outputs = func(self, *list(inputs.values())) 52 53 # transform returns None if no update needs to be made [/usr/local/lib/python3.10/dist-packages/popmon/visualization/histogram_section.py](https://localhost:8080/#) in transform(self, data_obj, sections) 157 # filter out potential empty plots 158 plots = [e for e in plots if len(e)] --> 159 plots = sorted(plots, key=lambda plot: plot["date"]) 160 161 # basic checks for histograms [/usr/local/lib/python3.10/dist-packages/popmon/visualization/histogram_section.py](https://localhost:8080/#) in (plot) 157 # filter out potential empty plots 158 plots = [e for e in plots if len(e)] --> 159 plots = sorted(plots, key=lambda plot: plot["date"]) 160 161 # basic checks for histograms KeyError: 'date' ```

Reproduction steps: https://colab.research.google.com/drive/1N59kn7C9LN6W9AJkfz9SougiZoOMM0bn?usp=sharing

Additional information: I'm using a function to generate synthetic data (see colab). When I generate "less" data - e.g. for 200 days, the code works fine, but after some unknown threshold (like 360 days), it breaks. I've also tried changing the time_width parameter - sometimes it starts to work with 2w, sometimes it works with 1d but I haven't figured out any pattern.

Also note that it happens both for self-referencing data as well as data with a reference set (see second part of the colab).

Expected result: Monitoring report generates properly.

sbrugman commented 1 year ago

Thanks for reporting Marcin, will look into it

marrrcin commented 1 year ago

@sbrugman an update from my side: It seems like the following lines in the data generator are causing the popmon to break:

    feature_anomalies = np.random.normal(loc=0.5, scale=0.05, size=num_days)
    anomaly_indices = np.random.choice(num_days, num_anomalies, replace=False)
    feature_anomalies[anomaly_indices] = np.random.uniform(low=-5, high=0.1, size=num_anomalies)
    feature_out_of_range = np.random.uniform(low=0, high=1, size=num_days)
    out_of_range_indices = np.random.choice(num_days, num_out_of_range, replace=False)
    feature_out_of_range[out_of_range_indices] = np.random.uniform(low=2, high=3, size=num_out_of_range)

Initially, I thought it has something to do with the memory allocation / assignments, but it it seems like the range of values is a problem. If I increase the num_anomalies to something in closer to at half of my examples (which means - generating more examples that are e.g. out of range), the code proceeds normally. It should work in both cases though.

sbrugman commented 1 year ago

@marrrcin Could you please provide the minimum reproducible code here as a snippet? Policy doesn't allow us to use colab...

marrrcin commented 1 year ago

Absolutely!

import pandas as pd
import popmon
import numpy as np

def generate_mock_data(num_days, num_anomalies, num_out_of_range, random_state=666, start_date='1/1/2022'):
    np.random.seed(random_state)
    time = pd.date_range(start=start_date, periods=num_days, freq='D')
    feature_increasing = np.arange(1, num_days+1)
    feature_decreasing = np.arange(1000000, 1000000-num_days, -1)
    feature_stable = np.random.normal(loc=0.5, scale=0.05, size=num_days)
    feature_unstable = np.random.normal(loc=0.5, scale=2.0, size=num_days)
    feature_anomalies = np.random.normal(loc=0.5, scale=0.05, size=num_days)
    anomaly_indices = np.random.choice(num_days, num_anomalies, replace=False)
    feature_anomalies[anomaly_indices] = np.random.uniform(low=-5, high=0.1, size=num_anomalies)
    feature_out_of_range = np.random.uniform(low=0, high=1, size=num_days)
    out_of_range_indices = np.random.choice(num_days, num_out_of_range, replace=False)
    feature_out_of_range[out_of_range_indices] = np.random.uniform(low=2, high=3, size=num_out_of_range)
    trend_change = np.concatenate([np.linspace(0, 3.0, num_days//2+(num_days % 2)), np.linspace(3.0, 0, num_days//2)]) + np.random.normal(loc=0, scale=0.01, size=num_days)
    cyclic_feature = np.sin(np.linspace(0, 4*np.pi, num_days)) + np.random.normal(loc=0, scale=0.1, size=num_days)
    data = {'time': time, 'feature_increasing': feature_increasing, 'feature_decreasing': feature_decreasing, 'feature_stable': feature_stable, 'feature_unstable': feature_unstable, 'feature_anomalies': feature_anomalies, 'feature_out_of_range': feature_out_of_range, 'trend_change': trend_change, 'cyclic_feature': cyclic_feature}
    df = pd.DataFrame(data)
    return df

df = generate_mock_data(num_days=300, num_anomalies=10, num_out_of_range=13)

report = popmon.df_stability_report(
    df,
    time_axis="time",
    time_width="1w",
)
sbrugman commented 1 year ago

Can confirm this is a bug with the histogram plotting with outliers, will release a patch soon!

sbrugman commented 1 year ago

@marrrcin Release is out, feel free to open up another issue if you encounter other problems. Thanks a lot!

marrrcin commented 1 year ago

Thanks for a quick fix, I confirm that it works now!