ing-bank / popmon

Monitor the stability of a Pandas or Spark dataframe ⚙︎
https://popmon.readthedocs.io/
MIT License
493 stars 33 forks source link

Histogram error on large floats #244

Closed twalen closed 2 years ago

twalen commented 2 years ago

Running pm_stability_error on float columns with large values triggers (in some cases) Assertion Error.

For example running following code:

import pandas as pd
import numpy as np
import popmon

np.random.seed(1)
n = 1000
start_date = pd.to_datetime("2022-01-01")
example = pd.DataFrame({
    "dt": [start_date + pd.DateOffset(i//100) for i in range(n)], 
    "a": (np.random.rand(n) - 0.5) * 10**4
})
example.loc[len(example)//2, 'a'] *= 10**4
example.pm_stability_report(time_axis="dt", time_width="1w")

Gives following output:

% python popmon_bug.py
.../.virtualenvs/random/lib/python3.7/site-packages/histogrammar/dfinterface/make_histograms.py:172: UserWarning: time-axis "dt" already found in binning specifications. not overwriting.
  f'time-axis "{time_axis}" already found in binning specifications. not overwriting.'
2022-08-12 14:14:19,649 INFO [histogram_filler_base]: Filling 1 specified histograms. auto-binning.
100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 463.15it/s]
2022-08-12 14:14:19,652 INFO [hist_splitter]: Splitting histograms "hists" as "split_hists"
2022-08-12 14:14:19,654 INFO [hist_comparer]: Comparing "split_hists" with rolling sum of 1 previous histogram(s).
2022-08-12 14:14:19,666 INFO [hist_profiler]: Profiling histograms "split_hists" as "profiles"
2022-08-12 14:14:19,692 INFO [hist_comparer]: Comparing "split_hists" with reference "split_hists"
2022-08-12 14:14:19,702 INFO [pull_calculator]: Comparing "comparisons" with median/mad of reference "comparisons"
2022-08-12 14:14:19,713 INFO [pull_calculator]: Comparing "profiles" with median/mad of reference "profiles"
2022-08-12 14:14:19,749 INFO [apply_func]: Computing significance of (rolling) trend in means of features
2022-08-12 14:14:19,752 INFO [compute_tl_bounds]: Calculating static bounds for "profiles"
2022-08-12 14:14:19,795 INFO [compute_tl_bounds]: Calculating static bounds for "comparisons"
2022-08-12 14:14:19,806 INFO [compute_tl_bounds]: Calculating traffic light alerts for "profiles"
2022-08-12 14:14:19,819 INFO [compute_tl_bounds]: Calculating traffic light alerts for "comparisons"
2022-08-12 14:14:19,825 INFO [apply_func]: Generating traffic light alerts summary.
2022-08-12 14:14:19,828 INFO [alerts_summary]: Combining alerts into artificial variable "_AGGREGATE_"
2022-08-12 14:14:19,831 INFO [report_pipelines]: Generating report "html_report".
2022-08-12 14:14:19,831 INFO [overview_section]: Generating section "Overview". skip empty plots: True
100%|████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 276.10it/s]
2022-08-12 14:14:19,842 INFO [histogram_section]: Generating section "Histograms".
  0%|                                                                         | 0/1 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "popmon_bug.py", line 13, in <module>
    example.pm_stability_report(time_axis="dt", time_width="1w")
  File ".../python3.7/site-packages/popmon/pipeline/report.py", line 196, in df_stability_report
    reference=reference_hists,
  File ".../python3.7/site-packages/popmon/pipeline/report.py", line 71, in stability_report
    result = pipeline.transform(datastore)
  File ".../python3.7/site-packages/popmon/base/pipeline.py", line 69, in transform
    datastore = module.transform(datastore)
  File ".../python3.7/site-packages/popmon/pipeline/report_pipelines.py", line 250, in transform
    return super().transform(datastore)
  File ".../python3.7/site-packages/popmon/base/pipeline.py", line 69, in transform
    datastore = module.transform(datastore)
  File ".../python3.7/site-packages/popmon/base/module.py", line 50, in _transform
    outputs = func(self, *list(inputs.values()))
  File ".../python3.7/site-packages/popmon/visualization/histogram_section.py", line 141, in transform
    plots = parallel(_plot_histograms, args)
  File ".../python3.7/site-packages/popmon/utils.py", line 52, in parallel
    func(*args) if mode == "args" else func(**args) for args in args_list
  File ".../python3.7/site-packages/popmon/utils.py", line 52, in <listcomp>
    func(*args) if mode == "args" else func(**args) for args in args_list
  File ".../python3.7/site-packages/popmon/visualization/histogram_section.py", line 247, in _plot_histograms
    hists, feature, hist_names, y_label, is_num, is_ts
  File ".../python3.7/site-packages/popmon/visualization/utils.py", line 297, in plot_histogram_overlay
    len(bin_edges), len(bin_values), x_label
AssertionError: bin edges (+ upper edge) and bin values have inconsistent lengths: 43 vs 41. a
twalen commented 2 years ago

It seems that this might be an issues with Histogrammar. From debugging it looks like in SparselyBin in some cases len(hist.bin_edges(low, high)) > len(hist.bin_entries(low, high))+1

https://github.com/histogrammar/histogrammar-python/blob/master/histogrammar/primitives/sparselybin.py#L717