great-expectations / great_expectations

Always know what to expect from your data.
https://docs.greatexpectations.io/
Apache License 2.0
9.85k stars 1.52k forks source link

Expectations for time series data #1183

Closed seeM closed 4 years ago

seeM commented 4 years ago

This project looks really cool, but since we work with time series data, it'd only cover about half of the checks we'd like to be able to do. Here are some rough ideas for possible improvements. I'm happy to try contributing here as well.

Expectations on datetime columns themselves

my_filehandle.expect_column_datetime_values_to_have_frequency(
    "datetime_column_name",
    freq="30s",
)

Not sure of how else to namespace / specify datetime columns as distinct from other columns. Maybe expect_datetime_column?

Also may be preferable to use freq as shorthand for frequency, given that it's such a long function name already, and freq is widely used in pandas. I noticed that longer and more descriptive variable names are preferred here, though.

freq can be specified using the existing pandas Offset Aliases.

We could also have a variety of checks on ordering:

my_filehandle.expect_column_datetime_values_to_be_increasing("datetime_column_name")

# This is possibly unnecessary, since you could check whether its both increasing + unique.
my_filehandle.expect_column_datetime_values_to_be_strictly_increasing("datetime_column_name")

my_filehandle.expect_column_datetime_values_to_be_decreasing("datetime_column_name")

Expectations on a value column with datetime indexes

my_filehandle.expect_column_datetime_values_to_be_stationary(
    datetime_column="datetime_column_name",
    value_column="value_column_name",
    statsmodels_test_name="kpss",
    statsmodels_test_args={},
)

where statsmodels_test_name is from statsmodels.tsa.stattools.

I can't think of a good naming convention to separate when we're talking about expectations on an actual datetime column versus another column that's assumed to have a datetime index. Any ideas?

Or it could be more general, something like

my_filehandle.expect_column_datetime_values_to_pass_test(
    datetime_column="datetime_column_name",
    value_column="value_column_name",
    statsmodels_test_name="kpss",
    statsmodels_test_args={},
)

Expectations on multiple value columns with datetime indexes

my_filehandle.expect_column_datetime_pair_values_to_have_crosscorrelation(
    datetime_column="datetime_column_name",
    value_column="value_column_name",
    correlation_coefficient=0.3,
)

Again, any ideas on naming conventions here would be great.

jcampbell commented 4 years ago

@seeM : Thank you so much for the thought on this. I think it's a great idea, and I love the idea of adding support for datetime here. It's something that several folks including @eugmandel @abegong have been thinking about, and we'd love to support you in building it out.

Some reflections on specifics: (1) I like expect_datetime_column_values_toXXXXXX. Preference for "period" or "periodicity" instead of frequency, for the "30s"-style option as a precision measure...I think all the examples you gave are absolutely perfect.

(2) For value with datetime index...In that case, I think it makes sense to allow the user to declare a timeseries, either at the dataset level or in the expectation. Also, in other places, we've put the test name into the expectation, so I think it'd be worth considering that as an option.. For example, we might say something like expect_timeseries_kpss_test_p_value_to_be_less_than(column_name, timeseries_index_column_name, p_value, regression, nlags) That approach keeps our preference for more explicit names. It also limits the exposure to the total number of required tests that non-pandas backends would have to implement (so that we're less totally tied to statsmodels in cases where we can't bring all data back).

I think the same broad naming conventions should apply in the last case.

Two final things:

(1) I am SO EXCITED about your moving this forward and really eager to ensure we help and support you in building out the feature.

(2) Have you given any thought to cases wherein there is some raw data (e.g. an event stream) that is aperiodic, but for which we'd like to bin into either a window or fixed interval and then apply an analysis? While it is not at all necessary to fold that thinking into this to make for an amazing PR, I'd love ideas on a good API for handling that common case as well.

seeM commented 4 years ago

@jcampbell Great to hear, and thanks for the feedback! 😄

  1. I agree that period better suites the "30s" style. freq was to be consistent with pandas, but I guess we don't have to be coupled with pandas.

  2. I like the expect_timeseries_ naming scheme. I think we can declare time series at the expectation level (i.e., in the expectation name), given that you'd still want access to all of the other expectations in the existing datasets, and that the time series expectations could work across different types of datasets.

I agree with putting test names in expectations.

  1. Re aperiodic data: I would approach this as the time series extension to groupby tests. For example, if we want to test that there should be between 5 and 8 events per minute intervals, I would do something like:
events_per_minute = df.resample("1min").size()
# abusing notation -- this code won't actually work
expect_column_values_to_be_between(events_per_minute, 5, 8)

In which case, I might agree with @abegong's comment:

We've avoided including transformations (including group-bys) in expectations, because the surface area of possible transformations is enormous. This might be a good avenue for implementing something along these lines: #294

So I think the API should be consistent with whatever the decided groupby API becomes. Is the current direction to do per-expectation pre-processing (#294)?

BTW, I think "transformation" may be a better name. Data pre-processing to me refers to the initial preparation step going from a more raw form of data to a more workable form, involving many transformations along the way.

On the other hand, every expectation is itself a transformation followed by a test (or a condition). For example, expect_column_values_to_be_between optionally parses strings to datetimes if parse_strings_as_datetimes == True. This makes it a difficult line to draw – and how would we ensure its drawn consistently? From the user's perspective, incorporating transformations into expectations seems to add convenience. But from a developer's perspective, it's hard to generalize across them all. My main point here is that it may be worth considering explicit APIs for common transformation-test pairs like the groupby or the resample. I'm not yet convinced by the quote above.

jcampbell commented 4 years ago

Sorry for the protracted delay here. World got a bit crazy. :)

I think the bottom line is that we would love to move forward with this design. I don't think we should gate on any deeper changes, and in particular don't think we should wrap preprocessing/transformation (I agree with your note that transformation is a better term) into this work.

Is there anything I can do to help you move foward?

seeM commented 4 years ago

Sorry as well, things have been pretty crazy indeed.

Unfortunately, I don't know if I'll have the time to work on this...

If anyone else is interested, please feel free to get started.

github-actions[bot] commented 4 years ago

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?\n\nThis issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cwdjankoski commented 1 year ago

@seeM Hello and sorry to revive this old issue thread - but i was wondering what happened to this nice idea ?
Is there a way to do such checks in general with Great Expectations (or with a non-core package).
Would be curious to know and appreciate your time and thoughts - thank you.

seeM commented 1 year ago

Hi @cwdjankoski, apologies, this was quite a while ago and I don't remember what my plan was at the time. Hope you find what you're looking for!

abegong commented 1 year ago

Fun coincidence: just last weekend, I dusted off some old work related to this issue. It was based on facebook's Prophet model, so a non-core package would almost certainly be the way to go.

An ARIMA version with lighter dependencies might also be interesting. That'd get closer to the statsmodels approach that @seeM originally proposed.

From a usability perspective, I think the most interesting/challenging question is how to define Expectations to help with explainability. Most people I've talked to don't want expect_time_series_column_values_to_match_black_box_model. They want things like expect_time_series_column_values_to_fit_trend_line, expect_time_series_column_values_to_match_weekly_seasonality.

@cwdjankoski, would you be up for a call in the next few weeks to talk through ideas?

cwdjankoski commented 1 year ago

@seeM ok no worries, it's been a while indeed since you started the thread and i can imagine this comes as completely new now 🙂

@abegong hehe what a nice coincidence ⭐ I think your proposals based on Prophet / ARIMA are super interesting and probably much more powerful in general - however i think ... at least my initial curiousity that brought me here was around much more simpler ways à la what @seeM mentioned :

expect_column_values_to_be_between(events_per_minute, 5, 8)

I think, in my opinion, that many people would already consider this massively useful as they already have certain understanding and expectations 😉 from their events streams for example. My use-case for which i was looking if the fitting expectation exists was something like: we have lots of business events streams coming into our data lake, we would like to make sure that each of them has between X and Y events, within a certain time period (where X and Y is something we know/compute from out data). Thought to explain a bit more in case you are interested.

I would love to be helpful in any way I can - however you have to know that I have just started exploring Great Expectations ( amazing tool ! thank you and your team for all this great work ) and can probably only contribute "conceptually" thinking along and not so much writing code for now, but I hope i can get to this level too. I am part of the slack-community so I can probably send you a mssg there, if you are still interested to setup a call.

thanks both for your replies and thoughts and time.

abegong commented 1 year ago

@cwdjankoski, I'd love to set up a call. DM me in Slack?