unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.05k stars 280 forks source link

Support Series generation with serial dependence #1605

Open NowanIlfideme opened 3 weeks ago

NowanIlfideme commented 3 weeks ago

Is your feature request related to a problem? Please describe.

I've been trying - and failing - to generate dataframes with a Pandera strategy that will create a date column with values from pd.date_range(). I can generate a series via hypothesis directly:

import hypothesis.strategies as hs
import pandas as pd

_freq_strat = hs.builds(
    _gen_daterange,
    base_date=hs.dates(
        min_value=pd.Timestamp("1980-01-01").date(),
        max_value=pd.Timestamp("2100-01-01").date(),
    ),
    freq=hs.sampled_from(["D", "W-SUN", "W-MON", "5D", "MS"]),
    periods=hs.integers(min_value=3, max_value=100),
)

However, I can't create a Pandera strategy. The best I could come up with is this:

import pandera as pa
import pandera.strategies as st

def freq_strategy(
    pandera_dtype: pa.DataType,
    strategy: st.SearchStrategy | None = None,
    *,
    freq: FreqLike,
) -> st.SearchStrategy:
    """Strategy for frequency."""
    if strategy is None:
        return st.pandas_dtype_strategy(
            pandera_dtype=pandera_dtype, strategy=_freq_strat
        )
    raise RuntimeError("The frequency strategy must be the first strategy.")

alternatively:

def freq_strategy_alt(
    pandera_dtype: pa.DataType,
    strategy: st.SearchStrategy | None = None,
    *,
    freq: FreqLike,
) -> st.SearchStrategy:
    """Strategy for frequency."""
    if strategy is None:
        return _freq_strat
    raise RuntimeError("The frequency strategy must be the first strategy.")

Neither of the above work, because Pandera assumes the elements are individually generated.

I have also tried subclassing pa.Column to overwrite the .strategy() and .strategy_component() to return a custom hs.builds(...) strategy, but it fails because these are hypothesis.extra.pandas.impl.column() passed to hypothesis.extra.pandas.impl.dataframe()... which a custom strategy misses. Oof.

I also ran into https://github.com/unionai-oss/pandera/issues/1220 constantly (on 0.18.3), not sure if it's fixed for 0.19.0b3 - didn't check that yet.

Describe the solution you'd like

Ideally, I would like the ability to generate a whole series with a custom function, or at least with the hs.builds function. I've seen https://github.com/unionai-oss/pandera/issues/561, which might be the more proper fix. A shorter-term solution would be to allow custom generation in another code path (though with the layers of abstraction, this might be hard to accomplish...).

Since Hypothesis requires the .dataframe() to take columns, perhaps any custom columns could be generated alongside it? The custom generator function would have to be given the length of series to generate. More complicated cases would be handled by #561 then.

Describe alternatives you've considered

See above in problem description.

Additional context

Currently, I have a check for data frequency (i.e. if data is daily, weekly, etc.) that I want to generate valid data for. However, there are more complicated cases, such as ensuring we have ALL dates being contiguous within that frequency (from min to max). Without either this or #561 we can't generate things from the schema.

https://github.com/unionai-oss/pandera/issues/1275 is also relevant - if I could generate a global column of timestamps (with or without pandera schema), and use that column to be "joinable" with other Pandera-schema-defined dataframes, that would cover most of my use cases as well.

cosmicBboy commented 1 week ago

Hi @NowanIlfideme

Pandera strategies are currently quite limited, as you've experienced. The limitation is sort of bounded by the fact that it's leveraging the hypothesis data_frames API: https://hypothesis.readthedocs.io/en/latest/numpy.html#hypothesis.extra.pandas.data_frames. Basically, you need to specify columns and their elements, each of which are drawn from a strategy that generates a scalar.

Ideally, I would like the ability to generate a whole series with a custom function, or at least with the hs.builds function.

Yes, so #561 is the issue for improving this in pandera, I just haven't had the time to work on this because it'll pretty much involve a re-write of the pandas_strategy module.

I consider this issue, #1220, and #1275 to be problems to be addressed by the re-write (#1275 sounds pretty hard to implement tho, I'd maybe keep that out of the design and rely on docs/recipes on how to generate strategies with a fixed column based on the data generated from another strategy).

If you have to time/capacity, would you be able to chime in on #561 with a high-level set of requirements and (ideally) a code sketch of how this might be implemented in pandera? It would involve departing from hypothesis.extra.pandas.data_frames altogther.

From my understanding, we want: