Open Smartitect opened 11 months ago
This would be really useful. There are also some cases in which during inference one or more columns are missing. It would be great to make sure that if add_missing_columns=True
and ordered="coerce"
, we first add the missing ones and then reorder them.
Yeah this would be a massive QoL improvement. I'd love to delete the section of our code which does the manual column sorting and just allow pandera to enforce it
This would be great. We always have to add this line after validation - would be nice to just have it baked into the schema definition.
df = df[schema.columns.keys()]
A contribution would have my blessing! 🙏
So the strict_filter_columns
parser uses the ordered=True
kwarg to raise an error. Would just have to add the ordering functionality to that method. Open to ideas on how this would behave... would simply strict="<some_option>"
make sense to add? Or could we just assume we want an ordered dataframe if strict=True|"filter"
+ ordered=True
?
@cosmicBboy I think the OP's idea of having ordered="coerce"
sounded sensible to me.
strict="filter"
+ ordered="coerce'
effectively is df = df[schema.columns.keys()]
. Both of these options imply transformations: 1) filter columns to match schema, 2) coerce order of columns to match schema.
strict=False
+ ordered="coerce"
I imagine would just order the columns in the schema, pushing the rest of the columns to the end?
I kinda don't want to re-use the term "coerce" here. coerce=True
is already used to convert datatypes. Is there another verb we can use?
@cosmicBboy... makes sense.... here are some other ideas: ordered="reorder"
, ordered="match"
, ordered="arrange"
, ordered="sync"
, ordered="conform"
, ordered="align"
thoughts on the proposals ^^ @Smartitect @AndreaPiccione @Samreay ?
Any one of those synonyms makes sense to me, though in general my preference as a user would be to limit str-like kwargs going in. Type hinting and IDE completion are easier when you don't have to look up valid string inputs, so I'd also be happy with a new kwarg reorder_columns=True
just like you have add_missing_columns
This also has the benefit of not requiring any changing to the simple type hint and backward compatibility of the ordered
kwarg.
I agree with @Samreay. Having a str-like kwarg would be intuitive, but would be another string to remember when creating the schema.
+1 to reorder_columns=True
.
On a related note, I'd love if strict="filter"
would be deprecated in favor of remove_extra_columns=True
I need this function too, hope it will be completed soon.
any update on this functionality? I recently came across the need for it and it would be a big help.
I don't believe anyone's working on this. Anyone who wants to make a PR for the reorder_columns=True
solution has my blessing!
Problem
When handling writing Spark dataframes to datalake storage, the order of the columns in the dataframe is important. For example if a pipeline is appending parquet files in the lake, if the columns are not written in a consistent order this can lead to corrupt data on read. We can use Pandera to check when columns are not in the right order, but it would be really useful to also use Pandera to "coerce" the columns into the correct order during validation to avoid having to fail the pipeline altogether or add additional steps into the process to ensure the columns are in a consistent order.
Solution that I'd like
Having a further option on the
DataFrameSchema
ordered
parameter as follows:Alternatives considered
Adding an extra step to
pyspark.sql("SELECT ...")
all columns in the correct order from the dataframe before writing to the lake. This can use a helper function to build thepyspark.sql("SELECT ...")
statement based on the column name order as they appear in the PanderaDataFrameSchema
to avoid duplication of schema definition. But it would be good to avoid this step!Additional context
Currently writing blog about using Pandera on Azure Synapse and Microsoft Fabric. This is one of the features I spotted that would be very useful.