unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.27k stars 305 forks source link

schemas can generate valid samples #200

Closed cosmicBboy closed 3 years ago

cosmicBboy commented 4 years ago

Look into using hypothesis as a way for generating valid samples from a particular schema for testing code for model fitting, data visualization, etc.

Example use case:

As a user, I want to generate a small dataset for testing my machine learning code. I have an estimator and want to call estimator.fit(X, y). The pandera API to fulfill this case would be something like:

# train a house price regression model
schema = DataFrameSchema(
    columns={
        "square_footage": Column(pa.Float),
        "number_of_rooms": Column(pa.Int),
        "house_price": Column(pa.Float),
    },
)

dataset = schema.generate_samples(100)

features = ["square_footage", "number_of_rooms"]
target = "house_price
estimator = ...  # e.g. sklearn estimator
estimator.fit(dataset[features], dataset[target])

More generally, this would enable users to verify that any arbitrary pandas code can successfully execute.

# in data pipeline .py file
schema = DataFrameSchema(
    columns={
        "col1": Column(pa.Float),
        "col2": Column(pa.Int),
        "col3": Column(pa.String),
    },
)

def process_data(df):
    # do a bunch of transformations
    return ...

# e.g. using pytest in testing .py file
def test_process_data():
    assert isinstance(process_data(schema.generate_samples(100)), pd.DataFrame)
cosmicBboy commented 3 years ago

The faker package might also be useful for generating fake data of particular kinds, like names, addresses, etc.

This would probably necessitate checks for these, like Check.is_name and Check.is_address, which means that pandera would need to come up with a principled approach to distinguishing between the checking logic and the generation logic underlying a Check.

For things like Check.greater_than this is straightforward, since the .statistics attribute contains all the information needed to generate valid mock data.

On the other hand, Check.is_name would need to distinguish between the logic that verifies values for real data (probably a regex expression), and the logic that generates mock data, which could be a regex expression, but using faker would result in a better user experience, since it generates more plausible values.

gianfa commented 3 years ago

Hi, I think Pandera could delegate these responsibilities during a first development stage.

schema = DataFrameSchema( columns={ "col1": Column(pa.Float), "col2": Column(pa.Int), "email": pa.Column( pa.Str, # the raw type validate= is_email

alternative

            validate= pa.CustomValidator(is_email)
   )
},

)



I hope I was clear ^^' 

What do you think?
cosmicBboy commented 3 years ago

hi @gianfa thanks for adding your thoughts here! There are some great concepts in your proposal, and I think there's opportunity to further refine the user-facing API. I'd like to figure this part out first and then working backwards through to implementation details.

Re: custom types and validation logic, pa.Check is perfectly suited for this use case:

EMAIL_PAT = r'^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$'
is_email = lambda x: re.compile(EMAIL_PAT).match(x) != None

schema = DataFrameSchema(
    columns={
        "email": pa.Column(
            pa.Str, checks=pa.Check(is_email, element_wise=True)
       )
    },
)

One question I had about our sketch-code was the part about mock = Mocks.email # relying on Checks.is_email. How would the Mocks class work with the Checks class?

I think the mock abstraction is a useful one when developing tests, but I'm wondering if we can abstract this in such a way that the bare minimum a user has to do is think about the statistics of a data type based purely on the Check abstraction, but still support custom generator functions, as you've hinted at your proposal.

For example, if you look at how built-in checks are implemented, the classmethods are wrapped with register_check_statistics, which adds the relevant statistics from the classmethod arguments to become available to a check object via check.statistics.

Currently, these statistics are being used by the io module to serialize schemas into scripts/yaml files. However, they could also be used generate hypothesis strategies that generate data. Using the email example, we could leverage hypothesis.strategies.from_regex to generate valid samples:

# generators.py
def regex(pattern):
     # strategies have an .example() method to lazily generate data
     return hypothesis.strategies.from_regex(pattern)

# checks.py
DEFAULT_EMAIL_PAT = r'^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$'

# a sketch of the implementation of built-in checks
class Check:
    ...
    @classmethod
    @register_check_statistics(["pattern"], generators.regex)  # check statistics would be fed into the generator function as *args
    def is_email(cls, pattern=DEFAULT_EMAIL_PAT, **kwargs):

        def _is_email(series: pd.Series) -> pd.Series
            return series.str.match(pattern)

        return cls(_is_email, ..., **kwargs)

# user-facing API
schema = DataFrameSchema(
    columns={
        "email": pa.Column(
                pa.Str,  # the raw type
                checks=pa.Check.is_email()  # custom email pattern can be supplied here
       )
    },
)
examples = schema.generate_examples(5)

Custom Checks and Generators

# user-facing API for custom checks and generators.
# this toy example is a name generator based on a pre-defined list of names
import hypothesis.strategies
from faker import Faker

faker = Faker()

def name_generator(name_list):
    return hypothesis.strategies.sampled_from(name_list)

# in theory, the generator function can ignore the check statistics and
# generate data independently, as long as it passes the check criteria
def name_faker(*args):
    # use builds function and faker to create a custom hypothesis strategy
    # that uses Faker package to generate names
    return hypothesis.strategies.builds(lambda: faker.name())

# register_check_statistics would need to be extended to support regular functions
@register_check_statistics(statistics=["name_list"], generator=name_generator)
def is_name(name_list):

    def _is_name(series):
        return series.map(lambda x: x in name_list)

    return _is_name

# read in name list from a txt file
with open("names.txt") as f:
    allowed_names: List[str] = list(f.readlines())

schema = DataFrameSchema(
    columns={
        "names": pa.Column(
                pa.Str,  # the raw type
                checks=pa.Check(
                    is_name(allowed_names),
                    # alternatively, a generator argument could be used here, which is
                    # fed the check statistics specified in register_check_statistics.
                    # under the hood, this function would be wrapped around hypothesis.strategies.build
                    # to create a hypothesis strategy.
                    generator=lambda name_list: random.choice(name_list)
                )
       )
    },
)

Let me know if this aligns with what you were thinking!

Problem: handling Columns with multiple checks

We'll need to be able to handle columns with multiple checks (and therefore multiple statistics). A promising direction would be to leverage the hypothesis package to create adaptive/composite strategies to incorporate the statistics from multiple checks. In the simplest case, however, pandera can easily handle Column schemas with a single check. So far, pandera doesn't do anything in terms of validating the validation checks (so meta 🤯), but I can foresee the need for this for making the interaction between checks and generators more robust.

gianfa commented 3 years ago

HI @cosmicBboy , thank you so much for your exhaustive reply! Here I write in some points what I think.

I. Answering to your last message

  1. pa.Check. yes it is perfect to host validators 1.1. I'm not sure yet about what "statistics" specifically does. Is it limited to validation properties declaration, such like "nullable"? 1.2. I never used Hypothesis strategies, but they do look promising for generation.
  2. Handling multiple checks: I love the compositional approach, therefore I agree with finding a way to make it easy to compose a check from many. I'll check more about Hypothesis composite strategies.
  3. Your example code for Custom Checks and Generator looks good to me at a first glance.

II. In the future we should consider :

  1. relationships between columns. I think about properties of the same thing. For example, about a country, I'd like to have a "ISO-alpha2" and a "name" columns, but both should be from the same country, not just random values. Example: ['US', 'United States of America']. Maybe the best way to handle this is through nested columns definition, like for serializers (have a look here just to have a common ground https://google.github.io/flatbuffers/md__schemas.html), but we can think about it later

III. I was wondering about an "infer" functionality. Given that each column may declare at least:

We already have enough info from which a generation algorithm may infer (examples): categoricals: random choice; datetimes: random between ranges, ecc.. int and bools: already known random algos An example of logic would be:

# schemas.SeriesSchemaBase?

def _generate(self):
    ... # preliminary checks and stuff
    self._infer_generation = True
    if self.generator is not None:
        ... # builtins or custom
    if self.generator is None and self._infer_generation:
        if is_boolean(self._pandas_dtype):
            return hypothesis.strategies.booleans()
        if is_int(self._pandas_dtype):
            return hypothesis.strategies.integers()
        ...
        if self._enum is not None and len(self._enum) > 0:
            return hypothesis.strategies.sampled_from(self._enum)
        elif self._pattern is not None and is_numerical(self._pandas_dtype):
            return hypothesis.strategies.from_regex(self._pattern)
        elif self._pattern is not None and self._pandas_dtype == 'DateTime':
             if type(self._min_date) in ['datetime', 'str'] and type(self.max_date) ['datetime', 'str']:
                 return hypothesis.strategies.dates(min_value=self._min_date, max_value=self._min_date)
             else:
                 return hypothesis.strategies.dates(min_value=datetime(1970, 01, 01), max_value=datetime(2021,01,01))
     return

I think this way we could have many possibilities already available and, maybe, is quicker to implement as a first feature. What do you think?

Q: I still don't get how could we generate specific number of examples with the syntax above, from pandera. Is there any way to inject the number parameter? like a DataFrame length for the example dataset. Otherwise we could just put it as an argument into generate() method.

cosmicBboy commented 3 years ago

Hi @gianfa, thanks for keeping the momentum on this discussion going! I'm particularly excited about this feature for the ML use case: testing models with fake data at the beginning of the pipeline would be a game-changer for me.

I'm not sure yet about what "statistics" specifically does.

Currently it doesn't do a whole lot. All the register_check_statistics decorator does is give access to the arguments of built-in checks for the purpose of serializing schemas in yml files or scripts, see here.

For example

check = pa.Check.gt(0)
print(check.statistics)
# {'min_value': 0}

What I'm suggesting is to use these statistics as metadata to plug into the data generator inference functionality that you've sketched out above.

In the future we should consider relationships between columns.

This introduces considerable complexity to the system and for sure we'd want to tackle this after we have a working prototype.

Prototype Spec

For the initial implementation of this functionality, I'd like to propose the following assumptions:

  1. columns are generated independently of other columns.
  2. the column dtype, properties (like nullable, allow_duplicates) and checks within a column fully-specify the constraints of the hypothesis strategy
  3. checks within a column can be composed together to yield a hypothesis strategy

Although we are assuming (1), I think we should design the inference logic to at least be flexible enough to easily loosen this assumption eventually. E.g., we'd probably want to use the rows argument here to enforce values across columns.

Here's a very simple example:

import pandera as pa

schema = pa.DataFrameSchema({
    "column": pa.Column(int, pa.Check.gt(0))
})

Under the hood:

class Check:
    ...
    def generator_strategy(self):
        if is_int(self._pandas_dtype):
            return hypothesis.strategies.integers(
                min_value=check.statistics.get("min_value"),
                max_value=check.statistics.get("max_value"),  # this will be None
            )
        # handle other data types
        ...

class SeriesSchemaBase:
     ...

    def generator_strategy(self):
        for check in self.checks:
            strategy = check.generator_strategy()
            ... # ✨ magically compose multiple checks into a single strategy ✨

Of course, the gotcha is to be able to:

Let me know if this makes sense! I think the next steps would be to start spec'ing out a few examples across different data types and see what the capabilities are of hypothesis to compose different check constraints.

Addendum: Checks are Intersectional

This also raises a related, but separate, issue - currently the way checks work is that they have an AND relationship:

schema = pa.DataFrameSchema({
    "column": pa.Column(int, [pa.Check.gt(0), pa.Check.lt(10)])  # column > 0 AND column < 10
})

This is logically implausible because all values must be true for all checks in order for data validation to pass. To have an OR relationship, users would have to write:

schema = pa.DataFrameSchema({
    "column": pa.Column(int, pa.Check(lambda s: (s > 0) | (s < 10)))  # lambda checks don't have check statistics though
})

I don't think we have to tackle this issue here, but for the record it would be nice to be able to do something like this:

schema = pa.DataFrameSchema({
    "column": pa.Column(int, pa.Check.gt(0) | pa.Check.lt(10))  # column > 0 OR column < 10
})
gianfa commented 3 years ago

Hi @cosmicboy,

Ok. I suppose I have a better understanding of check.statistics, now. In other words it is like an accessor to meta characteristics of the column, the same ones we want to inject in schema definition, maybe. As you already said, we'll expand it.

Prototype Spec

Agree.

Catches

Ok we're talking about data consistency here. Your suggested code looks good, IMHO, as long as it is validated at schema level (definition), not at generation level.

About capturing inconsistencies, maybe this is related to your next observation about the logical coherence between Checks. What about to use a hierarchical approach like:

  1. checks accepts a list as AND-linked conditions. (if it is an inclusive OR, we have redundancy of info)
  2. test checks in a sequential/incremental fashion, in order to catch lambdas inconsistencies.

    • Ex: If we have [gt(5), lt(10), lambda x: 4] than [4,4,4] it's ok, while [4,5,7,] is not. And it's validated cumulating conditions from left to right. It can be run simply like:
#pseudo

df = pd.DataFrame([4,4,5])
checks = [gt(5), lt(10), lambda x: 4]

def do_something(df, condition:bool): # generic func to test consistency
    return df[condition]

def perform_checks(check1, check2):
    try:
        do_something(df, (check1) and (check2) )
        return (check1) & (check2)
    except Exception as e:
        raise InconsistentCheckException(f'"{check2}" check is incompatible with the previous ones!')

reduce(perform_checks, checks)

I'm a little bit worried about freely using logical operators in instantiation, since it can allow users to pass whatever data into columns, misunderstanding the validation operation itself and maybe adding not-very-needed complexity to the goal. Please let me know your opinion about it

cosmicBboy commented 3 years ago

fixed in https://github.com/pandera-dev/pandera/pull/344