Closed seeM closed 4 years ago
@seeM : Thank you so much for the thought on this. I think it's a great idea, and I love the idea of adding support for datetime here. It's something that several folks including @eugmandel @abegong have been thinking about, and we'd love to support you in building it out.
Some reflections on specifics:
(1) I like expect_datetime_column_values_toXXXXXX
. Preference for "period" or "periodicity" instead of frequency, for the "30s"-style option as a precision measure...I think all the examples you gave are absolutely perfect.
(2) For value with datetime index...In that case, I think it makes sense to allow the user to declare a timeseries, either at the dataset level or in the expectation. Also, in other places, we've put the test name into the expectation, so I think it'd be worth considering that as an option.. For example, we might say something like expect_timeseries_kpss_test_p_value_to_be_less_than(column_name, timeseries_index_column_name, p_value, regression, nlags)
That approach keeps our preference for more explicit names. It also limits the exposure to the total number of required tests that non-pandas backends would have to implement (so that we're less totally tied to statsmodels in cases where we can't bring all data back).
I think the same broad naming conventions should apply in the last case.
Two final things:
(1) I am SO EXCITED about your moving this forward and really eager to ensure we help and support you in building out the feature.
(2) Have you given any thought to cases wherein there is some raw data (e.g. an event stream) that is aperiodic, but for which we'd like to bin into either a window or fixed interval and then apply an analysis? While it is not at all necessary to fold that thinking into this to make for an amazing PR, I'd love ideas on a good API for handling that common case as well.
@jcampbell Great to hear, and thanks for the feedback! 😄
I agree that period
better suites the "30s" style. freq
was to be consistent with pandas, but I guess we don't have to be coupled with pandas.
I like the expect_timeseries_
naming scheme. I think we can declare time series at the expectation level (i.e., in the expectation name), given that you'd still want access to all of the other expectations in the existing datasets, and that the time series expectations could work across different types of datasets.
I agree with putting test names in expectations.
groupby
tests. For example, if we want to test that there should be between 5 and 8 events per minute intervals, I would do something like:events_per_minute = df.resample("1min").size()
# abusing notation -- this code won't actually work
expect_column_values_to_be_between(events_per_minute, 5, 8)
In which case, I might agree with @abegong's comment:
We've avoided including transformations (including group-bys) in expectations, because the surface area of possible transformations is enormous. This might be a good avenue for implementing something along these lines: #294
So I think the API should be consistent with whatever the decided groupby
API becomes. Is the current direction to do per-expectation pre-processing (#294)?
BTW, I think "transformation" may be a better name. Data pre-processing to me refers to the initial preparation step going from a more raw form of data to a more workable form, involving many transformations along the way.
On the other hand, every expectation is itself a transformation
followed by a test (or a condition). For example, expect_column_values_to_be_between
optionally parses strings to datetimes if parse_strings_as_datetimes == True
. This makes it a difficult line to draw – and how would we ensure its drawn consistently? From the user's perspective, incorporating transformations
into expectations seems to add convenience. But from a developer's perspective, it's hard to generalize across them all. My main point here is that it may be worth considering explicit APIs for common transformation-test pairs like the groupby or the resample. I'm not yet convinced by the quote above.
Sorry for the protracted delay here. World got a bit crazy. :)
I think the bottom line is that we would love to move forward with this design. I don't think we should gate on any deeper changes, and in particular don't think we should wrap preprocessing/transformation (I agree with your note that transformation is a better term) into this work.
Is there anything I can do to help you move foward?
Sorry as well, things have been pretty crazy indeed.
Unfortunately, I don't know if I'll have the time to work on this...
If anyone else is interested, please feel free to get started.
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?\n\nThis issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
@seeM Hello and sorry to revive this old issue thread - but i was wondering what happened to this nice idea ?
Is there a way to do such checks in general with Great Expectations (or with a non-core package).
Would be curious to know and appreciate your time and thoughts - thank you.
Hi @cwdjankoski, apologies, this was quite a while ago and I don't remember what my plan was at the time. Hope you find what you're looking for!
Fun coincidence: just last weekend, I dusted off some old work related to this issue. It was based on facebook's Prophet
model, so a non-core package would almost certainly be the way to go.
An ARIMA version with lighter dependencies might also be interesting. That'd get closer to the statsmodels
approach that @seeM originally proposed.
From a usability perspective, I think the most interesting/challenging question is how to define Expectations to help with explainability. Most people I've talked to don't want expect_time_series_column_values_to_match_black_box_model
. They want things like expect_time_series_column_values_to_fit_trend_line
, expect_time_series_column_values_to_match_weekly_seasonality
.
@cwdjankoski, would you be up for a call in the next few weeks to talk through ideas?
@seeM ok no worries, it's been a while indeed since you started the thread and i can imagine this comes as completely new now 🙂
@abegong hehe what a nice coincidence ⭐ I think your proposals based on Prophet / ARIMA are super interesting and probably much more powerful in general - however i think ... at least my initial curiousity that brought me here was around much more simpler ways à la what @seeM mentioned :
expect_column_values_to_be_between(events_per_minute, 5, 8)
I think, in my opinion, that many people would already consider this massively useful as they already have certain understanding and expectations 😉 from their events streams for example. My use-case for which i was looking if the fitting expectation exists was something like: we have lots of business events streams coming into our data lake, we would like to make sure that each of them has between X and Y events, within a certain time period (where X and Y is something we know/compute from out data). Thought to explain a bit more in case you are interested.
I would love to be helpful in any way I can - however you have to know that I have just started exploring Great Expectations ( amazing tool ! thank you and your team for all this great work ) and can probably only contribute "conceptually" thinking along and not so much writing code for now, but I hope i can get to this level too. I am part of the slack-community so I can probably send you a mssg there, if you are still interested to setup a call.
thanks both for your replies and thoughts and time.
@cwdjankoski, I'd love to set up a call. DM me in Slack?
This project looks really cool, but since we work with time series data, it'd only cover about half of the checks we'd like to be able to do. Here are some rough ideas for possible improvements. I'm happy to try contributing here as well.
Expectations on
datetime
columns themselvesNot sure of how else to namespace / specify
datetime
columns as distinct from other columns. Maybeexpect_datetime_column
?Also may be preferable to use
freq
as shorthand forfrequency
, given that it's such a long function name already, andfreq
is widely used in pandas. I noticed that longer and more descriptive variable names are preferred here, though.freq
can be specified using the existing pandas Offset Aliases.We could also have a variety of checks on ordering:
Expectations on a
value
column withdatetime
indexeswhere
statsmodels_test_name
is fromstatsmodels.tsa.stattools
.I can't think of a good naming convention to separate when we're talking about expectations on an actual
datetime
column versus another column that's assumed to have adatetime
index. Any ideas?Or it could be more general, something like
Expectations on multiple
value
columns withdatetime
indexesAgain, any ideas on naming conventions here would be great.