ing-bank / popmon

Monitor the stability of a Pandas or Spark dataframe ⚙︎
https://popmon.readthedocs.io/
MIT License
493 stars 33 forks source link

Error when I try to stitch histograms #266

Closed drandas07 closed 1 year ago

drandas07 commented 1 year ago

I try to extract 2 histograms from 2 different datasets then stitch them together. While I try to do that, I get this ValueError: Input histograms are not all similar

features = ["datetime:prog_revenue"]
may10_hists = may10_df.pm_make_histograms(
    features=features, time_axis="datetime", time_width="1h", time_offset="2023-05-10"
)
may9_hists = may9_df.pm_make_histograms(
    features=features, time_axis="datetime", time_width="1h", time_offset="2023-05-09"
)

hist_add = popmon.stitch_histograms(
    hists_basis=may9_hists, hists_delta=may10_hists, mode="add"
)

Before I encounter this error, I also get this warning at the stitching step.

Input SparselyBin histograms have inconsistent origin attributes: [1.6835904e+18, 1.6836768e+18]

Can someone help me resolve this issue? I am clueless about the resolution step for this issue.

mbaak commented 1 year ago

You managed? (Forgot to reply. It's important the histograms have identical bin specifications.)

mbaak commented 1 year ago

For completeness, see the function: get_bin_specs() in the example notebook: https://github.com/ing-bank/popmon/blob/master/popmon/notebooks/popmon_tutorial_incremental_data.ipynb

drandas07 commented 1 year ago

I have not managed to resolve the issue. But we got an understanding that the origin must be identical. Also your input regarding get_bin_specs() helped a bit.

Having said that, I am figuring out a way to add or customise the origin value inside the snippet that I have shared above.

mbaak commented 1 year ago

Consistent binning for histograms (for stitching and comparison) can be imposed as follows:

  1. get the bin specs from one dict of histograms (the first (time) axis is skipped here): bin_specs = popmon.get_bin_specs(hists, skip_first_axis=True)

  2. impose it on the next created histograms: h = df.pm_make_histograms(features=features, bin_specs=bin_specs)

Changing the origin of a histogram can be done with: h.origin = value