Open NowanIlfideme opened 3 weeks ago
Hi @NowanIlfideme
Pandera strategies are currently quite limited, as you've experienced. The limitation is sort of bounded by the fact that it's leveraging the hypothesis data_frames
API: https://hypothesis.readthedocs.io/en/latest/numpy.html#hypothesis.extra.pandas.data_frames. Basically, you need to specify columns and their elements, each of which are drawn from a strategy that generates a scalar.
Ideally, I would like the ability to generate a whole series with a custom function, or at least with the hs.builds function.
Yes, so #561 is the issue for improving this in pandera, I just haven't had the time to work on this because it'll pretty much involve a re-write of the pandas_strategy module.
I consider this issue, #1220, and #1275 to be problems to be addressed by the re-write (#1275 sounds pretty hard to implement tho, I'd maybe keep that out of the design and rely on docs/recipes on how to generate strategies with a fixed column based on the data generated from another strategy).
If you have to time/capacity, would you be able to chime in on #561 with a high-level set of requirements and (ideally) a code sketch of how this might be implemented in pandera? It would involve departing from hypothesis.extra.pandas.data_frames
altogther.
From my understanding, we want:
filter
, it should maybe override data with the new constraint.
Is your feature request related to a problem? Please describe.
I've been trying - and failing - to generate dataframes with a Pandera strategy that will create a
date
column with values frompd.date_range()
. I can generate a series viahypothesis
directly:However, I can't create a Pandera strategy. The best I could come up with is this:
alternatively:
Neither of the above work, because Pandera assumes the elements are individually generated.
I have also tried subclassing
pa.Column
to overwrite the.strategy()
and.strategy_component()
to return a customhs.builds(...)
strategy, but it fails because these arehypothesis.extra.pandas.impl.column()
passed tohypothesis.extra.pandas.impl.dataframe()
... which a custom strategy misses. Oof.I also ran into https://github.com/unionai-oss/pandera/issues/1220 constantly (on 0.18.3), not sure if it's fixed for 0.19.0b3 - didn't check that yet.
Describe the solution you'd like
Ideally, I would like the ability to generate a whole series with a custom function, or at least with the
hs.builds
function. I've seen https://github.com/unionai-oss/pandera/issues/561, which might be the more proper fix. A shorter-term solution would be to allow custom generation in another code path (though with the layers of abstraction, this might be hard to accomplish...).Since Hypothesis requires the
.dataframe()
to take columns, perhaps any custom columns could be generated alongside it? The custom generator function would have to be given the length of series to generate. More complicated cases would be handled by #561 then.Describe alternatives you've considered
See above in problem description.
Additional context
Currently, I have a check for data frequency (i.e. if data is daily, weekly, etc.) that I want to generate valid data for. However, there are more complicated cases, such as ensuring we have ALL dates being contiguous within that frequency (from min to max). Without either this or #561 we can't generate things from the schema.
https://github.com/unionai-oss/pandera/issues/1275 is also relevant - if I could generate a global column of timestamps (with or without pandera schema), and use that column to be "joinable" with other Pandera-schema-defined dataframes, that would cover most of my use cases as well.