Open cosmicBboy opened 2 years ago
Following up on the discussion in https://github.com/pandera-dev/pandera/discussions/648.
There are often use cases where it would be useful to override the base strategy for a column. The following hypothesis strategies clearly express the shape of data, but cannot be easily represented using the pandera check API.
# uuids
st.uuids().map(str)
# dictionaries
st.fixed_dictionaries(
{
'symbol': st.text(string.ascii_uppercase),
'cusip': st.text(string.ascii_uppercase + string.digits),
},
)
A workaround described in https://github.com/pandera-dev/pandera/discussions/648 uses custom check methods to store a strategy override for later use and accesses it during strategy generation in a subclass of pandera.DataFrameSchema
. This approach does not support column checks as the entire column strategy is replaced by the strategy specified in the field.
As suggested by @cosmicBboy in https://github.com/pandera-dev/pandera/discussions/648, first class support for this use case could be added by adding a strategy
or base_strategy
parameter to pandera.Field
and passing this user-provided strategy to the field_element_strategy method. field_element_strategy
would need to be updated to support passing a base strategy, rather than creating the base strategy by looking at the column's dtype.
This would allow for the following schema specification, while still supporting additional checks on the column (unlike the workaround described above).
class Schema(SchemaModel):
uuids: Series[object] = pa.Field(strategy=st.uuids().map(str))
Following from https://github.com/unionai-oss/pandera/discussions/1088
Perhaps not exactly what you had in mind but... a rather simple brute-force approach: create a strategy with hypothesis that generates the whole dataframe, and feed it in the schema as the one to use to generate examples.
What do you think of this @cosmicBboy ? It could be something relatively simple to implement (if it fits your design choices)...? If so, I volunteer to create a PR for this.
@NowanIlfideme I re-wrote this issue to encapsulate a broader re-write of the pandera pandas strategy module. Please chime in here with your thoughts on how this might work!
Hi, this turned into quite a big comment, so I added sections. I should also note that I am quite new at Hypothesis specfically, though not with data generation in general. I see several cases that are very relevant to my day-to-day work that would be great to support in Pandera; they would let me do API contract testing with "I need this data schema as input" and to generate more complex data from that schema.
The example in #1605 was for generating time series data. Here I would want to create timestamps with a particular frequency, such as pd.date_range(freq="D", start=SAMPLED_DATE, periods=DF_LENGTH)
, where DF_LENGTH
is the total number of elements, and SAMPLED_DATE
can be sampled from Hypothesis. This is the most common case in time series analysis/forecasting.
Another example would be generating monotonically increasing, but not necessarily contiguous, IDs. You need to know the length of the series, and you can generate the series with np.cumsum(RAND_INTS_GE_1)
or some other method. Creating non-uniform time series is essentially the same - create a random series of time differences, then use a cumulative sum with a starting timestamp.
In terms of user API, the strategy
itself would need to be a "vectorized" one, rather than per-element. And the dataframe size would need to be known ahead of time. Perhaps a pa.Field(strategy=..., column_strategy=...)
with only one keyword argument allowed?
def _make_freq_series(start_date: date, periods: int, freq: str) -> pd.Series:
return pd.Series(pd.date_range(start_date=start-date, freq=freq, periods=periods))
def freq_strategy(
pandera_dtype: pa.DataType,
strategy: Optional[st.SearchStrategy] = None, # would you even support a base strategy?
size: int,
*,
freq: str,
) -> Strategy: # creates series/arrays/lists of size SIZE instead of a single element
date_gen = hypothesis.dates(min_value=datetime(1800, 1, 1), max_value=date(3080, 1, 1))
return hypothesis.builds(_make_freq_series, start_date=date_gen, periods=size, freq=freq)
One major potential issue with this is that only Pandas has a real defined ordering of elements. Other dataframes can be generated in Pandas and then converted, but that isn't useful for things like performance testing. (Though, I guess that use case is limited enough that custom generators could be made...)
A totally different issue is to generate columns that depend on the values other columns. That would be valuable for all sorts of things. For example, hierarchical relationships can be generated like this (if value is "A", you can create "A1" - "A9", for "B" you create "B1" - "B5", etc.).
This would be difficult to implement as a column-based strategy, since (as far as I understand) Pandera doesn't support cross-column checks except as entire dataframes. So, one way to "fix" this would be to use a custom entire-dataframe strategy; however, that means you lose out on generating the other columns using Pandera.
From the user API, you could consider pa.Field(column_strategy=func, column_strategy_depends_on=["a", "b"])
with func
taking the values of the other columns. You'd need to do a bit of DAG resolution to check satisfiability.
def cond_strategy(
pandera_dtype: pa.DataType,
strategy: Optional[st.SearchStrategy] = None, # would you even support a base strategy?
base_df: pd.DataFrame,
) -> Strategy: # creates series/arrays/lists of size LEN(base_df) instead of a single element
def inner(func): ... # not quite sure how to make this
return base_df.apply(func, axis='columns')
return hypothesis.builds(inner, func=hypothesis.strategies.sampled_from(['sum', 'mean', 'median']))
It's not entirely clear to me how to actually use Hypothesis to generate different elements for every column, though.
Another use case that is very common for me is to generate dataframes that are grouped somehow. For example, I have a composite primary key that consists of IDs and timestamps, and I want an outer join of these dataframes. Here, I guess the best case would be to just generate the individual dataframes and merge them. However, what if I want other columns in the dataframe, that I want to be filled?
Here, I think the example generation API could work to "complete" the example. Naming is tough, but example(base_df=df)
or a separate example_from_base(df)
would work quite well:
value_schema = ... # the 'values' part of your schema
df_ids = id_schema.example()
df_timestamps = ts_schema.example()
df_index = pd.merge(df_ids, df_timestamps, how='cross')
full_schema = value_schema.add_columns(id_schema.columns).add_columns(ts_schema.columns)
df_all = full_schema.example(base_df=df_index)
I hope some of the above makes sense - even after going through it again it seems a bit ramble-y.
Is your feature request related to a problem? Please describe.
Currently, strategies are limited by the
hypothesis.extras.pandas
convention of how to define a dataframe. Namely, the strategy used to generate data values are at the element-level. This makes it hard to create strategies for a whole column or those that model the dependencies between columns.For previous context on the problem with strategies, see #1605, #1220, #1275.
Describe the solution you'd like
We need a re-write! 🔥
As described in #1605, the requirements for a pandera pandas strategy rewrite are:
More context on the current state
At a high level, this is how pandera currently translates a schema to a hypothesis strategy:
column
. This contains the datatypes, elements, and other properties of the column.pa.Column
dtype, properties (e.g. unique), and first check in the list ofcheck
, forward them to the hypothesis column. This creates an element strategy for a single value in that column.Check
in the list, get their check stats (constraint values) and chain them to the element strategy withfilter
(this really sucks, i.e. slows down performance.)