Closed cosmicBboy closed 3 years ago
The faker package might also be useful for generating fake data of particular kinds, like names, addresses, etc.
This would probably necessitate checks for these, like Check.is_name
and Check.is_address
, which means that pandera
would need to come up with a principled approach to distinguishing between the checking logic and the generation logic underlying a Check
.
For things like Check.greater_than
this is straightforward, since the .statistics
attribute contains all the information needed to generate valid mock data.
On the other hand, Check.is_name
would need to distinguish between the logic that verifies values for real data (probably a regex expression), and the logic that generates mock data, which could be a regex expression, but using faker
would result in a better user experience, since it generates more plausible values.
Hi, I think Pandera could delegate these responsibilities during a first development stage.
faker would be a very good resource, in order to have something automatically ready to produce synthetic data.
In a second stage we could just make a facade to most relevant and basilar fields, like email, names, geoloc,.. along with Pandera integration tests, like @cosmicBboy was saying (e.g. Check.is_name
).
e.g.
# early stage
from Pandera.mocks import Faker
schema = DataFrameSchema(
columns={
"col1": Column(pa.Float),
"col2": Column(pa.Int),
"email": pa.Column(
pa.Str, # the raw type
mock= Faker.email
)
},
)
# after
from Pandera.mocks import Mocks
schema = DataFrameSchema(
columns={
"col1": Column(pa.Float),
"col2": Column(pa.Int),
"email": pa.Column(
pa.Str, # the raw type
mock = Mocks.email # relying on Checks.is_email
# alternative
mock = lambda _: str(random_fancy_func()) + ".com"
)
},
)
we could delegate to the user the validation test as well in a simple lambda function. I'm thinking especially about user custom types. In other words to have something like:
EMAIL_PAT = r'^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$'
is_email = lambda x: re.compile(EMAIL_PAT).match(x) != None
schema = DataFrameSchema( columns={ "col1": Column(pa.Float), "col2": Column(pa.Int), "email": pa.Column( pa.Str, # the raw type validate= is_email
validate= pa.CustomValidator(is_email)
)
},
)
I hope I was clear ^^'
What do you think?
hi @gianfa thanks for adding your thoughts here! There are some great concepts in your proposal, and I think there's opportunity to further refine the user-facing API. I'd like to figure this part out first and then working backwards through to implementation details.
Re: custom types and validation logic, pa.Check
is perfectly suited for this use case:
EMAIL_PAT = r'^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$'
is_email = lambda x: re.compile(EMAIL_PAT).match(x) != None
schema = DataFrameSchema(
columns={
"email": pa.Column(
pa.Str, checks=pa.Check(is_email, element_wise=True)
)
},
)
One question I had about our sketch-code was the part about mock = Mocks.email # relying on Checks.is_email
. How would the Mocks class work with the Checks class?
I think the mock abstraction is a useful one when developing tests, but I'm wondering if we can abstract this in such a way that the bare minimum a user has to do is think about the statistics
of a data type based purely on the Check
abstraction, but still support custom generator functions, as you've hinted at your proposal.
For example, if you look at how built-in checks are implemented, the classmethods are wrapped with register_check_statistics
, which adds the relevant statistics from the classmethod arguments to become available to a check
object via check.statistics
.
Currently, these statistics are being used by the io
module to serialize schemas into scripts/yaml files. However, they could also be used generate hypothesis strategies that generate data. Using the email example, we could leverage hypothesis.strategies.from_regex to generate valid samples:
# generators.py
def regex(pattern):
# strategies have an .example() method to lazily generate data
return hypothesis.strategies.from_regex(pattern)
# checks.py
DEFAULT_EMAIL_PAT = r'^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$'
# a sketch of the implementation of built-in checks
class Check:
...
@classmethod
@register_check_statistics(["pattern"], generators.regex) # check statistics would be fed into the generator function as *args
def is_email(cls, pattern=DEFAULT_EMAIL_PAT, **kwargs):
def _is_email(series: pd.Series) -> pd.Series
return series.str.match(pattern)
return cls(_is_email, ..., **kwargs)
# user-facing API
schema = DataFrameSchema(
columns={
"email": pa.Column(
pa.Str, # the raw type
checks=pa.Check.is_email() # custom email pattern can be supplied here
)
},
)
examples = schema.generate_examples(5)
# user-facing API for custom checks and generators.
# this toy example is a name generator based on a pre-defined list of names
import hypothesis.strategies
from faker import Faker
faker = Faker()
def name_generator(name_list):
return hypothesis.strategies.sampled_from(name_list)
# in theory, the generator function can ignore the check statistics and
# generate data independently, as long as it passes the check criteria
def name_faker(*args):
# use builds function and faker to create a custom hypothesis strategy
# that uses Faker package to generate names
return hypothesis.strategies.builds(lambda: faker.name())
# register_check_statistics would need to be extended to support regular functions
@register_check_statistics(statistics=["name_list"], generator=name_generator)
def is_name(name_list):
def _is_name(series):
return series.map(lambda x: x in name_list)
return _is_name
# read in name list from a txt file
with open("names.txt") as f:
allowed_names: List[str] = list(f.readlines())
schema = DataFrameSchema(
columns={
"names": pa.Column(
pa.Str, # the raw type
checks=pa.Check(
is_name(allowed_names),
# alternatively, a generator argument could be used here, which is
# fed the check statistics specified in register_check_statistics.
# under the hood, this function would be wrapped around hypothesis.strategies.build
# to create a hypothesis strategy.
generator=lambda name_list: random.choice(name_list)
)
)
},
)
Let me know if this aligns with what you were thinking!
Column
s with multiple checksWe'll need to be able to handle columns with multiple checks (and therefore multiple statistics). A promising direction would be to leverage the hypothesis package to create adaptive/composite strategies to incorporate the statistics from multiple checks. In the simplest case, however, pandera can easily handle Column
schemas with a single check
. So far, pandera
doesn't do anything in terms of validating the validation checks (so meta 🤯), but I can foresee the need for this for making the interaction between checks and generators more robust.
HI @cosmicBboy , thank you so much for your exhaustive reply! Here I write in some points what I think.
I. Answering to your last message
II. In the future we should consider :
III. I was wondering about an "infer" functionality. Given that each column may declare at least:
We already have enough info from which a generation algorithm may infer (examples): categoricals: random choice; datetimes: random between ranges, ecc.. int and bools: already known random algos An example of logic would be:
# schemas.SeriesSchemaBase?
def _generate(self):
... # preliminary checks and stuff
self._infer_generation = True
if self.generator is not None:
... # builtins or custom
if self.generator is None and self._infer_generation:
if is_boolean(self._pandas_dtype):
return hypothesis.strategies.booleans()
if is_int(self._pandas_dtype):
return hypothesis.strategies.integers()
...
if self._enum is not None and len(self._enum) > 0:
return hypothesis.strategies.sampled_from(self._enum)
elif self._pattern is not None and is_numerical(self._pandas_dtype):
return hypothesis.strategies.from_regex(self._pattern)
elif self._pattern is not None and self._pandas_dtype == 'DateTime':
if type(self._min_date) in ['datetime', 'str'] and type(self.max_date) ['datetime', 'str']:
return hypothesis.strategies.dates(min_value=self._min_date, max_value=self._min_date)
else:
return hypothesis.strategies.dates(min_value=datetime(1970, 01, 01), max_value=datetime(2021,01,01))
return
I think this way we could have many possibilities already available and, maybe, is quicker to implement as a first feature. What do you think?
Q: I still don't get how could we generate specific number of examples with the syntax above, from pandera. Is there any way to inject the number parameter?
like a DataFrame length for the example dataset. Otherwise we could just put it as an argument into generate()
method.
Hi @gianfa, thanks for keeping the momentum on this discussion going! I'm particularly excited about this feature for the ML use case: testing models with fake data at the beginning of the pipeline would be a game-changer for me.
I'm not sure yet about what "statistics" specifically does.
Currently it doesn't do a whole lot. All the register_check_statistics
decorator does is give access to the arguments of built-in checks for the purpose of serializing schemas in yml files or scripts, see here.
For example
check = pa.Check.gt(0)
print(check.statistics)
# {'min_value': 0}
What I'm suggesting is to use these statistics as metadata to plug into the data generator inference functionality that you've sketched out above.
In the future we should consider relationships between columns.
This introduces considerable complexity to the system and for sure we'd want to tackle this after we have a working prototype.
For the initial implementation of this functionality, I'd like to propose the following assumptions:
nullable
, allow_duplicates
) and checks within a column fully-specify the constraints of the hypothesis strategyAlthough we are assuming (1), I think we should design the inference logic to at least be flexible enough to easily loosen this assumption eventually. E.g., we'd probably want to use the rows
argument here to enforce values across columns.
Here's a very simple example:
import pandera as pa
schema = pa.DataFrameSchema({
"column": pa.Column(int, pa.Check.gt(0))
})
Under the hood:
class Check:
...
def generator_strategy(self):
if is_int(self._pandas_dtype):
return hypothesis.strategies.integers(
min_value=check.statistics.get("min_value"),
max_value=check.statistics.get("max_value"), # this will be None
)
# handle other data types
...
class SeriesSchemaBase:
...
def generator_strategy(self):
for check in self.checks:
strategy = check.generator_strategy()
... # ✨ magically compose multiple checks into a single strategy ✨
Of course, the gotcha is to be able to:
pa.Check.isin(["foo", "bar"])
in an integer columnpa.Check.gt(10)
and pa.Check.lt(-10)
. We need to see to what extent composite hypothesis strategies
can handle this complexity... it would be nice if we could just offload it to hypothesis
and re-raise an
hypothesis.errors.Unsatisfiable
error if the constraints can't play well together.Let me know if this makes sense! I think the next steps would be to start spec'ing out a few examples across different data types and see what the capabilities are of hypothesis
to compose different check constraints.
This also raises a related, but separate, issue - currently the way checks work is that they have an AND relationship:
schema = pa.DataFrameSchema({
"column": pa.Column(int, [pa.Check.gt(0), pa.Check.lt(10)]) # column > 0 AND column < 10
})
This is logically implausible because all values must be true for all checks in order for data validation to pass. To have an OR relationship, users would have to write:
schema = pa.DataFrameSchema({
"column": pa.Column(int, pa.Check(lambda s: (s > 0) | (s < 10))) # lambda checks don't have check statistics though
})
I don't think we have to tackle this issue here, but for the record it would be nice to be able to do something like this:
schema = pa.DataFrameSchema({
"column": pa.Column(int, pa.Check.gt(0) | pa.Check.lt(10)) # column > 0 OR column < 10
})
Hi @cosmicboy,
Ok. I suppose I have a better understanding of check.statistics, now. In other words it is like an accessor to meta characteristics of the column, the same ones we want to inject in schema definition, maybe. As you already said, we'll expand it.
Prototype Spec
Agree.
Catches
Ok we're talking about data consistency here. Your suggested code looks good, IMHO, as long as it is validated at schema level (definition), not at generation level.
About capturing inconsistencies, maybe this is related to your next observation about the logical coherence between Checks. What about to use a hierarchical approach like:
test checks in a sequential/incremental fashion, in order to catch lambdas inconsistencies.
[gt(5), lt(10), lambda x: 4]
than [4,4,4]
it's ok, while [4,5,7,]
is not. And it's validated cumulating conditions from left to right. It can be run simply like:#pseudo
df = pd.DataFrame([4,4,5])
checks = [gt(5), lt(10), lambda x: 4]
def do_something(df, condition:bool): # generic func to test consistency
return df[condition]
def perform_checks(check1, check2):
try:
do_something(df, (check1) and (check2) )
return (check1) & (check2)
except Exception as e:
raise InconsistentCheckException(f'"{check2}" check is incompatible with the previous ones!')
reduce(perform_checks, checks)
I'm a little bit worried about freely using logical operators in instantiation, since it can allow users to pass whatever data into columns, misunderstanding the validation operation itself and maybe adding not-very-needed complexity to the goal. Please let me know your opinion about it
Look into using hypothesis as a way for generating valid samples from a particular schema for testing code for model fitting, data visualization, etc.
Example use case:
As a user, I want to generate a small dataset for testing my machine learning code. I have an
estimator
and want to callestimator.fit(X, y)
. The pandera API to fulfill this case would be something like:More generally, this would enable users to verify that any arbitrary pandas code can successfully execute.