unionai-oss / pandera

A light-weight, flexible, and expressive statistical data testing library
https://www.union.ai/pandera
MIT License
3.17k stars 298 forks source link

Add option to return dataframe with columns in order specified in DataFrameSchema. #1317

Open Smartitect opened 11 months ago

Smartitect commented 11 months ago

Problem

When handling writing Spark dataframes to datalake storage, the order of the columns in the dataframe is important. For example if a pipeline is appending parquet files in the lake, if the columns are not written in a consistent order this can lead to corrupt data on read. We can use Pandera to check when columns are not in the right order, but it would be really useful to also use Pandera to "coerce" the columns into the correct order during validation to avoid having to fail the pipeline altogether or add additional steps into the process to ensure the columns are in a consistent order.

Solution that I'd like

Having a further option on the DataFrameSchema ordered parameter as follows:

ordered (Union[bool, Literal[‘coerce’]]) – whether or not to validate the columns order. If set to ‘coerce’, the columns validated dataframe will be returned in the order specified in the schema.

Alternatives considered

Adding an extra step to pyspark.sql("SELECT ...") all columns in the correct order from the dataframe before writing to the lake. This can use a helper function to build the pyspark.sql("SELECT ...") statement based on the column name order as they appear in the Pandera DataFrameSchema to avoid duplication of schema definition. But it would be good to avoid this step!

Additional context

Currently writing blog about using Pandera on Azure Synapse and Microsoft Fabric. This is one of the features I spotted that would be very useful.

AndreaPiccione commented 8 months ago

This would be really useful. There are also some cases in which during inference one or more columns are missing. It would be great to make sure that if add_missing_columns=True and ordered="coerce", we first add the missing ones and then reorder them.

Samreay commented 8 months ago

Yeah this would be a massive QoL improvement. I'd love to delete the section of our code which does the manual column sorting and just allow pandera to enforce it

JeremyL-01 commented 4 months ago

This would be great. We always have to add this line after validation - would be nice to just have it baked into the schema definition. df = df[schema.columns.keys()]

cosmicBboy commented 4 months ago

A contribution would have my blessing! 🙏

So the strict_filter_columns parser uses the ordered=True kwarg to raise an error. Would just have to add the ordering functionality to that method. Open to ideas on how this would behave... would simply strict="<some_option>" make sense to add? Or could we just assume we want an ordered dataframe if strict=True|"filter" + ordered=True?

JeremyL-01 commented 4 months ago

@cosmicBboy I think the OP's idea of having ordered="coerce" sounded sensible to me. strict="filter" + ordered="coerce' effectively is df = df[schema.columns.keys()]. Both of these options imply transformations: 1) filter columns to match schema, 2) coerce order of columns to match schema.

strict=False + ordered="coerce" I imagine would just order the columns in the schema, pushing the rest of the columns to the end?

cosmicBboy commented 4 months ago

I kinda don't want to re-use the term "coerce" here. coerce=True is already used to convert datatypes. Is there another verb we can use?

JeremyL-01 commented 4 months ago

@cosmicBboy... makes sense.... here are some other ideas: ordered="reorder", ordered="match", ordered="arrange", ordered="sync", ordered="conform", ordered="align"

cosmicBboy commented 4 months ago

thoughts on the proposals ^^ @Smartitect @AndreaPiccione @Samreay ?

Samreay commented 4 months ago

Any one of those synonyms makes sense to me, though in general my preference as a user would be to limit str-like kwargs going in. Type hinting and IDE completion are easier when you don't have to look up valid string inputs, so I'd also be happy with a new kwarg reorder_columns=True just like you have add_missing_columns

This also has the benefit of not requiring any changing to the simple type hint and backward compatibility of the ordered kwarg.

AndreaPiccione commented 4 months ago

I agree with @Samreay. Having a str-like kwarg would be intuitive, but would be another string to remember when creating the schema.

cosmicBboy commented 4 months ago

+1 to reorder_columns=True.

On a related note, I'd love if strict="filter" would be deprecated in favor of remove_extra_columns=True

shaoyucheng commented 2 months ago

I need this function too, hope it will be completed soon.

ahsin-s commented 1 month ago

any update on this functionality? I recently came across the need for it and it would be a big help.

cosmicBboy commented 1 month ago

I don't believe anyone's working on this. Anyone who wants to make a PR for the reorder_columns=True solution has my blessing!