ing-bank / popmon

Monitor the stability of a Pandas or Spark dataframe ⚙︎
https://popmon.readthedocs.io/
MIT License
493 stars 33 forks source link

[Question] Hourly data pipelines #267

Closed qwert666 closed 1 year ago

qwert666 commented 1 year ago

Hi

I have more of a question around using the library as all the examples consists of building histograms that are wider then the defined time_width.

My setup consists of a complex project that has a lot of factors that can influence metrics that I'm interested in keeping an eye on. I have a data pipeline that process the data on hourly basis that means that the data I have access to consists always of one hour. I was thinking of building separate historical histograms for each hour, as I want to compare apples to apples and eliminate the additional noise as I have a lot of seasonalities, within a day, week, month etc.

An example project could be users on a website and keeping track of their page views and generated revenue, and I want to early detect major shifts in page views.

In this usecase from my understanding a reference_type should be as "external" and the time_width would be 1h, and in the reports/metrics I would always have just one hour but then how would the stitch_histograms work, the replace functionality would not work right? and if I would like to control the size of stiched histograms I would need to cap it in a different way 🤔 Does this hourly setup make sense in popmon?

Best

mbaak commented 1 year ago

I think I understand what you want: a rolling reference that is - say - a year behind the current time. Have you tried the rolling reference functionality? You can specify the window size and how far it should lag behind the current time. See: https://popmon.readthedocs.io/en/latest/reference_types.html#rolling-reference (Or let me know if you mean something else.)

qwert666 commented 1 year ago

Rolling reference would require to access historical data for building the report and that's not what I can do as I'm having a lot of data and want to fully utilize the histograms.

What I was thinking of was something like:

calculate one time histograms per hour

hourly_histograms = {}
bin_specs = {}

for hour in range(0, 24):
    pdf_hour = historical_pdf[historical_pdf.hour == hour]
    histogram = make_histograms(pdf_hour, features=features, time_axis="datetime", time_width="1h", time_offset="2023-04-27")
    hourly_histograms[hour] = histogram
    bin_specs[hour] = get_bin_specs(histogram)

and then when the pipeline is running I only process one hour of data (the most recent)

last_hour_pdf = pd.DataFrame(...) # containing my new data
last_hour_histogram = make_histograms(last_hour_pdf, features=features, time_axis="datetime", bin_specs=bin_specs[current_hour])

btw. can last_hour_histogram be used directly for reports/metrics?

settings = Settings(time_axis="datetime", reference_type="external")
settings.report.extended_report = True

report = last_hour_pdf.pm_stability_report(
    settings=settings,
    reference=hourly_histograms[current_hour]
)

and then I stitch_histograms for next days comparsion

combined_hist = stitch_histograms(
    hists_basis=hourly_histograms[current_hour], hists_delta=last_hour_histogram
)

it all works just not sure if this setup make sense in Popmon and can be done differently

qwert666 commented 1 year ago

I'll move this to the discussion space