mitchelllisle / sparkdantic

✨ A Pydantic to PySpark schema library
https://mitchelllisle.github.io/sparkdantic/
MIT License
57 stars 10 forks source link

mypy complains about spark_type kwarg for Field #534

Open boblannon opened 2 days ago

boblannon commented 2 days ago

Love sparkdantic, and continue to rely on it. Thank you for your work!

One minor annoyance is that mypy always yells at me for using the spark_type kwarg to be explicit about the native spark type that should be used. For example:

class Product(SparkModel):
    name: Field(description="the name of the product", spark_type=StringType)
    quantity: Field(description="number available", spark_type=LongType)

This raises two warnings, one for each property:

Unexpected keyword argument "spark_type" for "Field"
Unexpected keyword argument "spark_type" for "Field"

This is because, of course, pydantic.Field isn't expecting this kwarg. Currently it seems like sparkdantic relies on pydantic's behavior, which throws any unexpected kwarg into FieldInfo.json_extra_schema. https://github.com/mitchelllisle/sparkdantic/blob/083c4f32212da4de76725b5f6d321a6dcfa4d113/src/sparkdantic/model.py#L150

I'm not necessarily arguing against this choice, but I wonder if we could find a way that makes the type checker happy as well. Perhaps a subclass of pydantic.Field like pydantic.SparkField that handles sparkdantic-specific kwargs like this?

I'm happy to lend a hand in implementation, too.

boblannon commented 2 days ago

for my specific problems with spark_type, I find that I can satisfy mypy with a really thin shim:

edit: Turns out pydantic.Field is a function with a lot of complex overloading. had to change my shim to reflect this

from typing import Optional, Type
from pyspark.sql.types import DataType

def SparkField(*args, spark_type: Optional[Type[DataType]] = None, **kwargs) -> Any:
    if spark_type is not None:
        kwargs["spark_type"] = spark_type
    return Field(*args, **kwargs)
mitchelllisle commented 1 day ago

Hi @boblannon Thanks for the suggestion. This makes sense to me and does feel a bit clearer than using Field directly.

I'm happy to implement this suggestion but if you'd like to contribute it directly send me the PR and I'll happily cut a new release.

Thanks again for the suggestion!