Moving columns to multiindex

unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library

https://www.union.ai/pandera

MIT License

3.27k stars 305 forks source link

Moving columns to multiindex #287

Closed abyz0123 closed 3 years ago

abyz0123 commented 3 years ago

I would like to be able to do input and output validation at sensible points in the pipeline. However, small changes occur in the schema, such as the index.

What would be the proper use for the following case:

You have created an input schema to validate an incoming dataset:

pd.DataFrame({'ID':[0,1,2,3],'AMOUNT':[100,200,300,400]}).to_csv('sample.csv')
schema = pa.DataFrameSchema(columns={'ID': pa.Column(pa.Int),'AMOUNT':pa.Column(pa.Int)})
schema.validate(pd.read_csv('sample.csv'))

Later in the pipeline the ID column is assigned to be an index column (in actual use case using a multiindex).

Other than that, the schema is the same.

What would be the proper way to migrate the column from column to index, or vice versa? Is it to remove the column from columns and put into index?

schema = DataFrameSchema(columns=schema.remove_columns('ID').columns,index=pa.Index(pa.Int,name='ID'))

Or is this capability in one of the methods that I haven't found? Is there like a column_to_index parameter or something that i could pass to update_column?

cosmicBboy commented 3 years ago

thanks @ktroutman, I think having column-to-index and index-to-column methods is a great idea!

The API currently doesn't support an elegant way of doing this.

DataFrameSchema(columns=schema.remove_columns('ID').columns,index=pa.Index(pa.Int,name='ID'))

This is pretty much how it has to be done.

The implementation for column-to-index and index-to-column methods would be fairly straight-forward, though there is potentially some subtlety with the interface, and I think this would also be a good opportunity to build in some more basic methods in the API, e.g.:

Column.to_index(): convert column object to pandera Index
Index.to_column(): convert pandera Index to a Column
DataFrameSchema.set_index(): basically the same semantics as the pandas method, but operates on the schema definition.
DataFrameSchema.reset_index(): basically the same semantics as the pandas method, but operates on the schema definition.

A sketch of the implementation would look like:

def set_index(self, column_name: Union[str, List[str]]):  # should this be able to stack existing index to make a multiindex?
    column_names = [column_name] if isinstance(column_name, str) else column_name
    schema_copy = copy.deepcopy(self)
    schema_copy.index = [
        self.columns[col_name].to_index()  # assuming Column.to_index() exists
        for col_name in column_names
    ]
    return schema_copy.remove_columns(column_names)

def reset_index(self, index_name: Union[str, List[str]],  ...):  # would drop=True be useful here?
    index_names = [self.index] if isinstance(index_name, str) else index_name
    schema_copy = copy.deepcopy(self)
    indexes = [self.index] if isinstance(self.index, pa.Index) else self.index.indexes
    new_columns = {
        index.name: index.to_column()  # assuming Index.to_column() exists
        for index in indexes
        if index.name in index_names
    }
    new_indexes = [index for index in indexes if index.name not in index_names]
    if not new_indexes:
        new_indexes = None
    schema_copy.columns = {**self.columns, **new_columns}
    schema_copy.index = new_indexes
    return schema_copy

Let me know if you have the appetite to contribute these new methods!

abyz0123 commented 3 years ago

thanks for the detailed response @cosmicBboy !!

As far as building in some functionality, I find that in using the API, the distinction between the Column and Index objects to be tough to work with sometimes. Even if the reset_index or set_index methods were built in, then you'd still be doubling up on some methods such as update between column and index.

A preference would be something like an index parameter on the Column object (similar to sqlalchemy), which would then allow you to just use schema.columns.dtypes for read_csv functions, as well as update_columns to cover all changes to the existing schema, including moving columns to the index and vice versa.

If the Column-Index object distinct will persist, I'd be happy to try to get some time put something together .

cosmicBboy commented 3 years ago

I think collapsing the Column and Index abstraction into one is an interesting idea! After thinking about it a little bit, I'd be inclined to maintain the distinction between these two things, as pandera's main design goal is to mirror pandas semantics to make it as intuitive as possible to pandas users.

The main reason is that in sql-like systems, there's no conceptual difference between an index and a column: an index is just a column with special properties, like PRIMARY KEY, FOREIGN KEY, UNIQUE, etc.

On the other hand, indexes in pandas have distinct semantics (e.g. .loc, .iloc, resample on DatetimeIndex, etc) where the index is a potentially multi-valued key in a key-value table structure.

you'd still be doubling up on some methods such as update between column and index.

I think it's worth doubling up on methods for manipulating Columns and Indexes in the DataFrameSchema class, as long as they're named well, intuitive to use, and well-documented. Based on your needs it also sounds like a DataFrameSchema.update_index would also be useful.

Out of curiosity, I'd like to better understand your use case in the code snippet that you provided. Is there a particular reason the ID column isn't being read in as an index?

schema.validate(pd.read_csv('sample.csv', index_col="ID"))

jeffzi commented 3 years ago

If you know in advance, before runtime, which columns will be transformed into indices, you could consider the upcoming class-api (pandera 0.5):

import pandera as pa
from pandera.typing import Series, Index

class A(pa.SchemaModel):
    ID: Series[int]
    AMOUNT: Series[int]

print(A.to_schema())
#> DataFrameSchema(
#>     columns={
#>         "ID": "<Schema Column: 'ID' type=<class 'int'>>",
#>         "AMOUNT": "<Schema Column: 'AMOUNT' type=<class 'int'>>"
#>     },
#>     checks=[],
#>     index=None,
#>     transformer=None,
#>     coerce=False,
#>     strict=False
#> )

class B(A):
    ID: Index[int]

print(B.to_schema())
#> DataFrameSchema(
#>     columns={
#>         "AMOUNT": "<Schema Column: 'AMOUNT' type=<class 'int'>>"
#>     },
#>     checks=[],
#>     index=<Schema Index>,
#>     transformer=None,
#>     coerce=False,
#>     strict=False
#> )

^{Created on 2020-10-20 by the reprexpy package}

Behind the scenes, SchemaModel has the concept of FieldInfo that abstracts both Column and Index. It has to_column(), and to_index() methods.

Regarding pandas.read_csv, to build upon @cosmicBboy's answer, this should work:

dtype = dict(schema.dtype.items())

# It's a bit unconvenient that index and multiindex have a different interface.
try:  # multiindex
    indices = {idx.name: idx.dtype for idx in schema.index.indexes}
except AttributeError:  # single index (assume named index)
    indices = {schema.index.name: schema.index.dtype}

dtype.update(indices)

pd.read_csv(..., dtype=dtype, index_col=list(indices.keys()))

pandera could have its own read_csv_with_schema() function that adds bells and whistles like using read_csv()'s parse_date argument. Not sure @cosmicBboy wants to go down that rabbit hole now, but I've built a similar function for work.

abyz0123 commented 3 years ago

Thanks for the clarification @cosmicBboy

My last point on the index thing is that as far as I have used it, I have felt a need to abstract away the multiindex/index from the user-facing part of the api. So i wrote a class internally to basically do that, so that other users can just enter all columns (or loop through them) creating the schema. The extra parameter, index, just builds the index in pandera down a layer of abstraction. Like, when you move from an single index to a multiindex, that requires a bit of work on pandera api, instead of just being able to change the index parameter and have the change from Index to Multiindex happen by itself.

But I def see your point on the desirability hewing closely to pandas api. Plus I have encountered some sticky corners with the wrapper.

As far as I have used, an update, remove, add for both the column and index objects would be desirable. Also (maybe separate issue), an update_columnS (similar to rename) would be good to have. Maybe the index needs extra parameters to ensure the order of multiindex is as expected, and have a sensible default (unless this is already covered which i didnt catch).

On the read_csv syntax, i was just trying to simulate a change in the index. But yes, if you definitely always ingest the data from csv it makes sense to specify on-read.

cosmicBboy commented 3 years ago

@ktroutman these are good points, I think I'm coming around to the simpler API you suggest where all fields are specified at once. As @jeffzi mentioned, the class-based API coming out in the next release 0.5.0 would allow for a concise way of expressing this https://github.com/pandera-dev/pandera/issues/287#issuecomment-712680779.

I'm still not entirely sure if this should be supported, but I'm wondering if we can figure out a backwards-compatible (and non-confusing way) of supporting this for the object-based API:

schema = pa.DataFrameSchema(
    columns={
        "ID": Column(...),
        "AMOUNT": Column(...),
    }
)

schema_mod = schema.update_column("ID", index=True)

The simplest case is easy, but I'm wondering how to handle the MultiIndex case. Since pandera only supports 3.6+, dicts are ordered, which provides a natural way to handle it at initialization:

# multindex with levels in the order specified: ["idx1", "idx2", "idx3"]
schema = pa.DataFrameSchema(
    columns={
        "col": Column(...),
        "idx1": Column(..., index=True),
        "idx2": Column(..., index=True),
        "idx3": Column(..., index=True),
    }
)

But the trick might be when updating a schema with an existing index. Does the new index replace the existing one, or does the new index stack into a MultiIndex?

schema = pa.DataFrameSchema(
    columns={
        "col": Column(...),
        "bar": Column(..., index=False),
        "foo": Column(..., index=True),
    }
)

schema_mod = schema.update("bar", index=True)

is the schema_mod index now a multiindex with levels ["foo", "bar"], or should we respect the column order ["bar", "foo"] as originally specified?
should there be reorder_index() method to handle re-ordering Column(index=True) fields?
perhaps for any method call that creates a modified copy of the schema, the columns and indexes should be automatically partitioned for human-readability such that columns are listed in __str__/__repr__ before indexes.
... other issues?

jeffzi commented 3 years ago

As far as I have used, an update, remove, add for both the column and index objects would be desirable

I agree.

Regarding MultiIndex, you could allow index=True/False in the constructor of a Column but not in DataFrameSchema.update(). Then a reset_index and set_index combination would be the only way to modify the composition of the index after the schema is initialized. set_index defines an order and solves your problem.

abyz0123 commented 3 years ago

I'm coming around to @jeffzi idea of pandas parallel methods of re/set_index to manage the indices. It gets us around this sticky issue of ordering when updating the index and it will be immediately obvious for any pandas user how it works.

So it might look like:

schema = pa.DataFrameSchema(
    columns={
        "col0": Column(pa.String),
        "col1": Column(pa.String),
        "col2": Column(pa.String)
    }
).set_index(["col0","col1"])
print(schema)

DataFrameSchema(
    columns={'col2': <Schema Column: 'col2' type=string>}, 
    index=MultiIndex(
        columns={
            "col0": "<Schema Column: 'col0' type=string>",
            "col1": "<Schema Column: 'col1' type=string>"
        },...
    )

This also alleviates the awkwardness of being able to name your columns with the dictionary key for the Column object, but then for a Multiindex, which is just a list of Index objects, having to specify name. Also, I see this as a smaller break with the current api.

cosmicBboy commented 3 years ago

cool! thanks for your feedback and comments @ktroutman @jeffzi, I think the first step to take on this would be to implement the re/set_index() methods and see how that feels, and then reassess if we still want th Column(..., index=True) functionality. I only hesitate because of this reason:

you could allow index=True/False in the constructor of a Column but not in DataFrameSchema.update()

I think introducing these kinds of exceptions in the method behavior introduce a sort of inconsistent user experience, but I wouldn't write it off if it gets us better usability. Another proposal would be to have a fields kwarg in the DataFrameSchema constructor where you can specify both Column and Index objects:

DataFrameSchema(
    fields={
        "col1": pa.Column(),
        "col2": pa.Column(),
        "col3": pa.Column(),
        "index1": pa.Index(),
        "index2": pa.Index(),
    }
)

Where the behavior for how the index (and potentially multiindex) is handled is similar to the way it's done with the class-based SchemaModel API. This would be an alternative way to construct dataframe schemas, where a SchemaInit error would be thrown if fields were specified with any of columns or index.

@ktroutman let me know if you'd still like to make a contribution, for the re/set_index() methods! I'm working on #200, which is turning out to be fairly involved, so I won't be able to get to this for a few weeks at least.

abyz0123 commented 3 years ago

Sure thing -- I will try to put something together later this week.