kaiko-ai / typedspark

Column-wise type annotations for pyspark DataFrames
Apache License 2.0
65 stars 4 forks source link

Improve the error message when a column is missing from the schema #216

Closed nanne-aben closed 11 months ago

nanne-aben commented 12 months ago

The message now reads something like:

Data contains the following columns not present in schema B: {'c', 'd', 'a'}.

If you believe these columns should be part of the schema, consider adding the following lines to it.

class B(Schema):
    c: Column[ArrayType[StringType]]
    d: Column[StructType[D]]
    a: Column[IntegerType]

class D(Schema):
    e: Column[IntegerType]

This is particularly useful in automated pipelines. If new columns are added to the underlying data, it's nice to know what needs to be changed in the schema.