Schema Versioning? - Githubissues

unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

https://www.union.ai/pandera

MIT License

3.37k stars 310 forks source link

Schema Versioning? #406

Closed Voyz closed 3 years ago

Voyz commented 3 years ago

Does Pandera support Schema Versioning?

I'd like to have a way to reliably version the schema using Pandera and I couldn't find anything in the documentation. I'd imagine this would look along these lines:

#### THIS IS NOT VALID PANDERA CODE
schema = pa.DataFrameSchema({
    "1.0": {
        "column1": pa.Column(int, checks=pa.Check.le(10)),
        "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
        "column3": pa.Column(str, checks=[
            pa.Check.str_startswith("value_"),
            pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
        ]),
    },
    "2.0": {
        "column1": pa.Column(int, checks=pa.Check.le(5)),
        "column2": pa.Column(float, checks=pa.Check.lt(-5.5)),
        "column3": pa.Column(str, checks=[
            pa.Check.str_startswith("key_"),
            pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 3)
        ]),
    }
})

#### THIS IS NOT VALID PANDERA CODE
schema = pa.VersionedSchema({
    "1.0": pa.DataFrameSchema({
        "column1": pa.Column(int, checks=pa.Check.le(10)),
        "column2": pa.Column(float, checks=pa.Check.lt(-1.2)),
        "column3": pa.Column(str, checks=[
            pa.Check.str_startswith("value_"),
            pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 2)
        ]),
    }),
    "2.0": pa.DataFrameSchema({
        "column1": pa.Column(int, checks=pa.Check.le(5)),
        "column2": pa.Column(float, checks=pa.Check.lt(-5.5)),
        "column3": pa.Column(str, checks=[
            pa.Check.str_startswith("key_"),
            pa.Check(lambda s: s.str.split("_", expand=True).shape[1] == 3)
        ]),
    })
})

Then the validation along the lines of:

validated_df = schema(df, '1.0')

I know this could be achieved by storing various schemas in a dictionary and writing the versioning logic myself - I merely wonder whether there is already such a functionality in place or would you be up for extending Pandera in such a way. Thanks, Voy

abyz0123 commented 3 years ago

To my knowledge, this is not a feature. There was some talk about adding a version meta feature #331, though I'm not sure where that stands.

Fwiw, I use the name attribute for such cases where I want to have closely-related schema versions, for example at different points of a data pipeline where features may have evolved.

import pandera as pa

schema0 = pa.DataFrameSchema(
    {'col1':pa.Column(pa.String)},
    name = 'v0'
)
schema1 = schema0.update_columns({'col1':{'pandas_dtype':pa.Float}})
schema1.name = 'v1'

But yeah, you still need to persist these on your own.

cosmicBboy commented 3 years ago

thanks for this question @Voyz, and that is a good suggestion @ktroutman!

I think before digging into this particular solution I'd like to understand the use case/problem statement implied by your proposal a little better.

So far, the way I personally use pandera, I let the versioning system (i.e. git) handle versioning of a schema, so whenever I update my schema definition those changes are tracked by git and I generally don't need to maintain both the old and new versions simultaneously. One thing that would help me understand your use case @Voyz is if you can describe in a little more detail the why behind your proposed solution.

As @ktroutman suggested, dataframe schemas can be updated, so one idea to make the proposal a little more terse would be to add an update_name method to make the UX a little slicker:

import pandera as pa

schema0 = pa.DataFrameSchema({'col1': pa.Column(float, checks=pa.Check.lt(-1.2))}, name='v0')
schema1 = schema0.update_columns({'col1': {'checks': pa.Check.lt(-5.5)}}).update_name("v1")

Another approach would be:

import pandera as pa

schemas = {"v0":  pa.DataFrameSchema({'col1': pa.Column(float, checks=pa.Check.lt(-1.2))})}
schemas["v1"] = schemas["v0"].update_columns({'col1': {'checks': pa.Check.lt(-5.5)}})

There are probably other ways of implementing a more native-feeling interface for this, for example versioned updates implemented as a version kwarg for each schema method that created updated copies:

# sketch code, not currently supported by pandera
schema = pa.DataFrameSchema({'col1': pa.Column(float, checks=pa.Check.lt(-1.2))}, version="v0")
# updating the schema keeps track of older versions
schema = schema.update_columns({'col1': {'checks': pa.Check.lt(-5.5)}}, version="v1")

schema.validate(old_df, version="v0")
schema.validate(new_df, version="v1")

But I'm not really sure yet whether this syntactic sugar beats assigning versions to variables or maintaining the versions in a dictionary.

Voyz commented 3 years ago

Thanks for great answers @ktroutman and @cosmicBboy 😊

describe in a little more detail the why behind your proposed solution.

Sure thing, sorry for not being more specific before. I'm trying to tackle issue of outdated data, caused by either changes to schema definition, changing data sources or changes in underlying framework. Similarly to how the docker-compose file contains a version number that corresponds to a particular version of Docker, I'm considering that it would be useful to have a versioned schema that would correspond to a particular version of data parsing in my system. Therefore, should my schema definitions change in the future, I'd still have a reliable way of distinguishing which schema data belongs to and of ensuring that this data gets validated correctly.

While Git versioning is is a partial solution to this problem, it wouldn't allow multiple versions of a schema to live at the same time, while I would find it useful to say something like:

df = pd.read_csv('my_data') # read all the data

### Separate by schema version
old_df = df[df.version < 2]
new_df = df[df.version >= 2]

### Ensure the data is correct
old_df = schema(old_df, '1.0')
new_df = schema(new_df, '2.0')

### Process the data
# Since I validated the correct version, as long as I  keep 
# the processors I can be sure the data will get processed correctly
old_processed = my_data_processor(old_df, '1.0')
new_processed = my_data_processor(new_df, '2.0')

### Merge the data, ready to be used
my_data = pd.concat([old_processed, new_processed])

For instance, imagine that the old data - version 1.0 - doesn't contain the field parent, as it was only added in version 2.0. The my_data_processor version 1.0 can anticipate this and add that field with a default value (or infer this value from some other columns contained in the old data). The resulting my_data will have the field parent for all versions of data read from the csv file. By the end of it I have a reliable way of ensuring that all data can be brought to a usable form, no matter when it was collected and stored.

This old data appending/migration can and eventually should be written back to the database (or the csv file, or whatever). However, I believe that by keeping versioned schemas I should be able to introduce more reliability and flexibility to the system, while making data less dependant on the necessity to constantly migrate upon schema changes.

If you see any flaws in this logic though, please point them out as I'd be happy to verify these assumptions.

Naturally, the second approach you suggest is one way to go about it. I merely wanted to inquiry whether such functionality is or could be integrated into Pandera out of the box - similarly to your syntactic sugar suggestions - as to increase its reliability and make it available to other users. I understand that this may not necessarily go with the direction you may have for the library, so like I said - this is just to figure out how and if you'd see this working with Pandera.

Thanks!

cosmicBboy commented 3 years ago

Thanks for the detailed explanation @Voyz, the problem statement is much clearer now!

I do have a few questions about the overall setup and assumptions of the example code you provided.

df = pd.read_csv('my_data') # read all the data

### Separate by schema version
old_df = df[df.version < 2]
new_df = df[df.version >= 2]

The implication here is that there's a version field in the csv where each row contains values like 1, 2, 3, etc? I've never really encountered this way of versioning data before: what's the rationale behind storing all versions of data in a single file?. My mental model is that dataset versions are applied to datasets as a whole, so e.g. I'd have separate files "my_data_v1", "my_data_v2" or in a directory structure "v1/my_data" and "v2/my_data". This allows for things like hashing the dataset e.g. md5 hash to make sure it hasn't been corrupted.

### Ensure the data is correct
old_df = schema(old_df, '1.0')
new_df = schema(new_df, '2.0')

I think you should feel free to extend pandera with the syntanctic sugar you need for your use case! I'd encourage you to post your solution up in the Discussion section as a resource for other pandera users who might have the same use case, and it'd also provide valuable data to better judge if this feature should be supported in pandera out of the box.

I'm not yet entirely convinced that this syntax adds too much value compared to the solutions that @ktroutman and I suggested above, although I do think "versioned updates" to schemas might be promising.

# sketch code, not currently supported by pandera
schema = pa.DataFrameSchema({'column': pa.Column(...)}, version="0.0")
schema = schema.add_column({'parent': pa.Column(...)}, version="1.0")

schema(old_df, version="0.0")
schema(new_df, version="1.0")

With the class-based API, versions would be encoded in the class definition itself:

import pandera as pa
from pandera.typing import DataFrame, Series

class SchemaV1(pa.SchemaModel):
    column: Series = pa.Field(...)

class SchemaV2(SchemaV1):
    parent: Series = pa.Field(...)

SchemaV1.validate(old_df)
SchemaV2.validate(new_df)

# or in a function:
@pa.check_types
def process_data(old_df: DataFrame[SchemaV1], new_df: DataFrame[SchemaV2]):
    ### Merge the data, ready to be used
    return pd.concat([
        my_data_processor(old_df, '1.0'),
        my_data_processor(new_df, '2.0')
    ])

Voyz commented 3 years ago

My mental model is that dataset versions are applied to datasets as a whole, so e.g. I'd have separate files "my_data_v1", "my_data_v2" or in a directory structure "v1/my_data" and "v2/my_data".

Sure - either way the data gets versioned.

I think you should feel free to extend pandera with the syntanctic sugar you need for your use case!

Cool, thanks for the encouragement!

I'm not yet entirely convinced that this syntax adds too much value compared to the solutions that @ktroutman and I suggested above

Yupp! I'm not sure either! I wanted to start this issue to figure out if this is already possible or would make sense at all - not that I'm endorsing this addition.

The reasoning for why built-in schema versions would be advantageous over a dict-based non-native support, is that from OOP perspective the schema version is a unique property of the schema, therefore maintaining it outside of the schema object would be counter-intuitive.

But then it really needs to serve its purpose in the Pandera environment so I agree that discussing it further first would be the first good step.

Thanks! 👋

jeffzi commented 3 years ago

My mental model is that dataset versions are applied to datasets as a whole

I'm working a lot with data coming from mobile applications, sent to our endpoints via json events. We do have a version field. Sometimes, the new version of the data schema is tied to a new version of the app. Not all users will update right away, if ever. In that case, we will receive a mix of events on multiple version. The endpoint leverages the version field to know which schema it should validate against. Down the road, the database has an extended schema. We mostly add/remove columns. Removed columns can be left empty.

The version field is used to look up the correct schema in a dictionary or elsewhere.

@Voyz I'd be interested to hear about an alternative solution if you find one in the future !

cosmicBboy commented 3 years ago

Based on the discussion so far, it feels like the term "version" is overloaded, so I'll take a crack at disambiguating.

From https://github.com/pandera-dev/pandera/issues/406#issuecomment-777251536

Similarly to how the docker-compose file contains a version number that corresponds to a particular version of Docker, I'm considering that it would be useful to have a versioned schema that would correspond to a particular version of data parsing in my system

I think it may be helpful to distinguish between software version and data version. The pandera yaml schema does have a notion of pandera version, but it's not really a first-class citizen in the python schema definitions.

Since version is an actual field in the data, I wonder if it should be treated like any other field in the schema. This would be another approach to the problem: basically encoding assumptions of different data versions in the same schema.

schema = pa.DataFrameSchema(
    {
        "version": pa.Column(str),
        "field": pa.Column(int, nullable=True),
    }
)

I think pandera could improve the way it handles conditional checks, such that different checks apply to different rows based on the value of version. Currently you'd have to use the groupby check keyword arg and even then it's a little clunky for sure:

schema = pa.DataFrameSchema(
    {
        "version": pa.Column(str),
        "field": pa.Column(
            int,
            nullable=True,
            checks=[
                pa.Check(lambda groups: groups["1.0"].isna().all(), groupby="version", name="version 1.0 check"),
                pa.Check(lambda groups: groups["2.0"].notna().all(), groupby="version", name="version 2.0 check"),
            ]
        ),
    }
)

Voyz commented 3 years ago

@cosmicBboy that's some great progress there, thanks for tackling this!

I think it may be helpful to distinguish between software version and data version.

Naturally, you're right. In a sense though - in the reasoning I brought up each data version corresponds to a particular my_data_processor, so in a sense some version of the software. These are naturally disjoint from an actual version of the project itself, but it seems to me that there's both some overlap and some discrepancy. In general though I agree with your point here and I think its a good idea you brought it up.

The pandera yaml schema does have a notion of pandera version, but it's not really a first-class citizen in the python schema definitions.

Yes! I did notice that, but indeed couldn't see any correlation with the schemas itself.

Since version is an actual field in the data, I wonder if it should be treated like any other field in the schema

Interesting idea! Does this assumption make the implementation easier? On the other hand, I'm considering that just as much as in the example I brought up (and one @jeffzi kindly outlined - thanks ❤️) the version is a field, I can recognise that the mental model you proposed @cosmicBboy would be used instead frequently too. I would see a benefit of keeping the schema version not as a field, as to not enforce the model in which version is a field in each data entry. This is just me thinking out loud - would love to hear your thoughts on this point.

Currently you'd have to use the groupby check keyword arg and even then it's a little clunky for sure:

Could you expand on how this would be used? Would it split the data somehow upon calling the schema?

cosmicBboy commented 3 years ago

closing this discussion for now, @Voyz feel free to re-open if you had additional questions/ideas about this