unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.13k stars 296 forks source link

Pandera strategy re-write: improve base implementation and add API for custom strategies and global schema-level override strategy #561

Open cosmicBboy opened 2 years ago

cosmicBboy commented 2 years ago

Is your feature request related to a problem? Please describe.

Currently, strategies are limited by the hypothesis.extras.pandas convention of how to define a dataframe. Namely, the strategy used to generate data values are at the element-level. This makes it hard to create strategies for a whole column or those that model the dependencies between columns.

For previous context on the problem with strategies, see #1605, #1220, #1275.

Describe the solution you'd like

We need a re-write! 🔥

As described in #1605, the requirements for a pandera pandas strategy rewrite are:

More context on the current state

At a high level, this is how pandera currently translates a schema to a hypothesis strategy:

ghost commented 2 years ago

Following up on the discussion in https://github.com/pandera-dev/pandera/discussions/648.

There are often use cases where it would be useful to override the base strategy for a column. The following hypothesis strategies clearly express the shape of data, but cannot be easily represented using the pandera check API.

# uuids
st.uuids().map(str)

# dictionaries
st.fixed_dictionaries(
    {
        'symbol': st.text(string.ascii_uppercase),
         'cusip': st.text(string.ascii_uppercase + string.digits),
    },
)

A workaround described in https://github.com/pandera-dev/pandera/discussions/648 uses custom check methods to store a strategy override for later use and accesses it during strategy generation in a subclass of pandera.DataFrameSchema. This approach does not support column checks as the entire column strategy is replaced by the strategy specified in the field.

As suggested by @cosmicBboy in https://github.com/pandera-dev/pandera/discussions/648, first class support for this use case could be added by adding a strategy or base_strategy parameter to pandera.Field and passing this user-provided strategy to the field_element_strategy method. field_element_strategy would need to be updated to support passing a base strategy, rather than creating the base strategy by looking at the column's dtype.

This would allow for the following schema specification, while still supporting additional checks on the column (unlike the workaround described above).

class Schema(SchemaModel):
    uuids: Series[object] = pa.Field(strategy=st.uuids().map(str))
francesco086 commented 1 year ago

Following from https://github.com/unionai-oss/pandera/discussions/1088

Perhaps not exactly what you had in mind but... a rather simple brute-force approach: create a strategy with hypothesis that generates the whole dataframe, and feed it in the schema as the one to use to generate examples.

What do you think of this @cosmicBboy ? It could be something relatively simple to implement (if it fits your design choices)...? If so, I volunteer to create a PR for this.

cosmicBboy commented 1 month ago

@NowanIlfideme I re-wrote this issue to encapsulate a broader re-write of the pandera pandas strategy module. Please chime in here with your thoughts on how this might work!

NowanIlfideme commented 1 month ago

Hi, this turned into quite a big comment, so I added sections. I should also note that I am quite new at Hypothesis specfically, though not with data generation in general. I see several cases that are very relevant to my day-to-day work that would be great to support in Pandera; they would let me do API contract testing with "I need this data schema as input" and to generate more complex data from that schema.

Columns with dependencies within the column

The example in #1605 was for generating time series data. Here I would want to create timestamps with a particular frequency, such as pd.date_range(freq="D", start=SAMPLED_DATE, periods=DF_LENGTH), where DF_LENGTH is the total number of elements, and SAMPLED_DATE can be sampled from Hypothesis. This is the most common case in time series analysis/forecasting.

Another example would be generating monotonically increasing, but not necessarily contiguous, IDs. You need to know the length of the series, and you can generate the series with np.cumsum(RAND_INTS_GE_1) or some other method. Creating non-uniform time series is essentially the same - create a random series of time differences, then use a cumulative sum with a starting timestamp.

In terms of user API, the strategy itself would need to be a "vectorized" one, rather than per-element. And the dataframe size would need to be known ahead of time. Perhaps a pa.Field(strategy=..., column_strategy=...) with only one keyword argument allowed?

def _make_freq_series(start_date: date, periods: int, freq: str) -> pd.Series:
    return pd.Series(pd.date_range(start_date=start-date, freq=freq, periods=periods))

def freq_strategy(
    pandera_dtype: pa.DataType,
    strategy: Optional[st.SearchStrategy] = None,  # would you even support a base strategy?
    size: int,
    *,
    freq: str,
) -> Strategy:  # creates series/arrays/lists of size SIZE instead of a single element
    date_gen = hypothesis.dates(min_value=datetime(1800, 1, 1), max_value=date(3080, 1, 1))
    return hypothesis.builds(_make_freq_series, start_date=date_gen, periods=size, freq=freq)

One major potential issue with this is that only Pandas has a real defined ordering of elements. Other dataframes can be generated in Pandas and then converted, but that isn't useful for things like performance testing. (Though, I guess that use case is limited enough that custom generators could be made...)

Columns that depend on other columns

A totally different issue is to generate columns that depend on the values other columns. That would be valuable for all sorts of things. For example, hierarchical relationships can be generated like this (if value is "A", you can create "A1" - "A9", for "B" you create "B1" - "B5", etc.).

This would be difficult to implement as a column-based strategy, since (as far as I understand) Pandera doesn't support cross-column checks except as entire dataframes. So, one way to "fix" this would be to use a custom entire-dataframe strategy; however, that means you lose out on generating the other columns using Pandera.

From the user API, you could consider pa.Field(column_strategy=func, column_strategy_depends_on=["a", "b"]) with func taking the values of the other columns. You'd need to do a bit of DAG resolution to check satisfiability.

def cond_strategy(
    pandera_dtype: pa.DataType,
    strategy: Optional[st.SearchStrategy] = None,  # would you even support a base strategy?
    base_df: pd.DataFrame,
) -> Strategy:  # creates series/arrays/lists of size LEN(base_df) instead of a single element
    def inner(func): ...  # not quite sure how to make this
        return base_df.apply(func, axis='columns')
    return hypothesis.builds(inner, func=hypothesis.strategies.sampled_from(['sum', 'mean', 'median']))

It's not entirely clear to me how to actually use Hypothesis to generate different elements for every column, though.

Generating from existing dataframes (e.g. for grouped dataframes)

Another use case that is very common for me is to generate dataframes that are grouped somehow. For example, I have a composite primary key that consists of IDs and timestamps, and I want an outer join of these dataframes. Here, I guess the best case would be to just generate the individual dataframes and merge them. However, what if I want other columns in the dataframe, that I want to be filled?

Here, I think the example generation API could work to "complete" the example. Naming is tough, but example(base_df=df) or a separate example_from_base(df) would work quite well:

value_schema = ...  # the 'values' part of your schema
df_ids = id_schema.example()
df_timestamps = ts_schema.example()
df_index = pd.merge(df_ids, df_timestamps, how='cross')

full_schema = value_schema.add_columns(id_schema.columns).add_columns(ts_schema.columns)
df_all = full_schema.example(base_df=df_index)

I hope some of the above makes sense - even after going through it again it seems a bit ramble-y.