Open kevinleahy-switchdin opened 1 month ago
Hi @kevinleahy-switchdin !
Hey - really like this library! Unfortunately we have migrated to polars, so I've been looking at implementing something similar for my use-case.
I'm happy you like the library so much that you're planning to implement something similar for Polars!
the docs discourage the use of __class_getitem__ outside of type hinting
Yeah, that's true, I agree that that's not best practice. I also don't know how to do it otherwise. I decided that it was acceptable as long as I'd write an extensive set of unit tests that is run frequently (thanks to Depandabot). I just have to make sure to add new Python versions to the CI when they are released, and keep an eye out for errors.
It feels weird setting a generic type (specific to an instance) at the class level (even though it later gets reset to the default value)
Agree. We actually had a different implementation before, but then we ran into this issue, so we changed it to the version you see now. Linking both here, because you might find them interesting.
Related to the above, I get pyright complaining about if DataSet._schema_annotations is None: : Access to generic instance variable through class is ambiguous
Hmmm, we might need to add a # type: ignore
there. Although I'd need to think a little bit more what's going on there.
I actually also developed another package that is very similar, but for pyspark, called typedspark. Over there, we do check with pyright (I'm actually a bit surprised we don't do that for strictly_typed_pandas, but that's for another time). I remember we had this issue at some point as well, linking it here in case it's relevant.
Lemme think a bit on it.
using DataSet[Schema] as a type hint anywhere in your code sets the _schema_annotations column to be Schema, e.g. see below code snippet (maybe this isn't an issue since it gets reset every time a new class is initialised, and that's the only context that attribute is being used)
Riiiight, that's not good. I remember thinking about these kinds of cases and trying to specify a test, but it seems I missed you case. Need to think a bit on that too.
Of note, I see the test defined in typedspark is slightly different.
Of topic, typedspark is a little bit more extensive than strictly_typed_pandas, it might be a good source of inspiration if you're planning to make something for Polars!
Alright, I looked into it a bit further. The problem you describe actually doesn't seem to happen in typedspark:
from pyspark.sql.types import BooleanType, DoubleType, IntegerType, StringType
from pyspark.sql import SparkSession
from typedspark import DataSet, Column, Schema, create_partially_filled_dataset
class SchemaA(Schema):
column_a: Column[IntegerType]
column_b: Column[StringType]
class SchemaB(Schema):
column_c: Column[DoubleType]
column_d: Column[BooleanType]
spark = SparkSession.Builder().getOrCreate()
df_a = create_partially_filled_dataset(
spark,
SchemaA,
{
SchemaA.column_a: [1, 2, 3],
SchemaA.column_b: ["A", "B", "C"],
},
)
print("before func def is", df_a._schema_annotations.get_schema_name())
def function_for_df_b(df: DataSet[SchemaB]):
print(df._schema_annotations)
print("after func def is", df_a._schema_annotations.get_schema_name())
# before func def is SchemaA
# after func def is SchemaA
This is probably because in typedspark we actually subclass the schema when we create a new class instance (source):
def __class_getitem__(cls, item):
"""Allows us to define a schema for the ``DataSet``.
To make sure that the DataSet._schema_annotations variable isn't reused globally, we
generate a subclass of the ``DataSet`` with the schema annotations as a class variable.
"""
subclass_name = f"{cls.__name__}[{item.__name__}]"
subclass = type(subclass_name, (cls,), {"_schema_annotations": item})
return subclass
It's a bit convoluted, but it makes sure that the original DataSet._schema_annotations
is never overwritten.
In typedspark this is specifically important, because we actually expose _schema_annotations
through the typedspark_schema
property.
@property
def typedspark_schema(self) -> Type[_Implementation]:
"""Returns the ``Schema`` of the ``DataSet``."""
return self._schema_annotations
In strictly_typed_pandas, we don't expose _schema_annotations
anywhere; it's only used when initialising the class. So you could argue that it's just a private attribute that people should only touch at their own risk. That said, I think we should probably also implement this subclass behaviour here.
Hope this is informative. Lemme know if you have any other questions!
Hey @nanne-aben - really appreciate the detailed reply. Apologies for delayed response - got lost in the sea of github notifications.
Ah - the typedspark
way is a clever way of getting around that problem! I agree it's convoluted, but, as you said, there doesn't seem to be another way to do it. It's a pity there isn't a better way of doing this at the moment in python - from googling it seems to be something that has popped up elsewhere as well.
I had started digging into the typedspark
code and sketching out a few ideas. As a stop-gap I settled on the following approach, where we create different subclasses for differently schema'd dataframes. If I can get it working smoothly as a parameterised Generic
using the typedspark
approach I'll let you know!
from typing import ClassVar
import polars as pl
from polars._typing import SchemaDefinition
class _TypedDataFrameBase(pl.DataFrame):
_expected_schema: ClassVar[SchemaDefinition]
def __init__(self, *args, **kwargs):
if (len(args) > 1 and args[1] is not None) or ("schema" in kwargs and kwargs["schema"] is not None):
raise ValueError("schema must not be passed as an arg - default value set at class level")
kwargs["schema"] = self._expected_schema
super().__init__(*args, **kwargs)
# TODO: override _from_pydf to make sure we always run the validation checks
@classmethod
def deserialize(cls, *args, **kwargs):
raise NotImplementedError("This method is not supported for TypedDataFrame objects")
@classmethod
def _from_pandas(cls, *args, **kwargs):
raise NotImplementedError("This method is not supported for TypedDataFrame objects")
@classmethod
def _from_arrow(cls, *args, **kwargs):
raise NotImplementedError("This method is not supported for TypedDataFrame objects")
schema_a = pl.Schema({"a_1": pl.Int64(), "a_2": pl.Float64()})
class TypedDataFrameA(_TypedDataFrameBase):
_expected_schema: ClassVar[SchemaDefinition] = schema_a
Cool, please keep me updated, curious to know more!
Hey @nanne-aben, thought I'd share an update! We decided that we would avoid the approach of needing to hook into __class_getitem__
, and instead use a purely generic-based approach, by passing the schema object itself.
The bones of this are as so:
DataTypeT = TypeVar("DataTypeT", bound=pl.DataType)
TypedFrameSchemaT = TypeVar("TypedFrameSchemaT", bound="TypedFrameSchema")
# ColumnSpecs are essentially dataclasses
@attrs.frozen
class ColumnSpec(Generic[DataTypeT]):
"""Used in TypedFrameSchema to describe datatypes of columns. Child classes may contain other properties such as
timeseries aggregation properties"""
dtype: DataTypeT
class TypedFrameSchema:
"""
TypedFrame Schema definitions must inherit from this and have class attributes mapping column names to ColumnSpec
instances
Example:
>>> class Person(TypedFrameSchema):
... name: ColumnSpec[pl.String] = ColumnSpec(pl.String())
"""
@classmethod
def polars_schema(cls) -> pl.Schema:
schema_dict = {key: val.dtype for key, val in cls._get_classvar_val_mapping().items()}
return pl.Schema(schema_dict)
# I've left out a bunch of methods here for validating the schema itself. We hand over the runtime validation against the
# dataframe names/types to polars. ColumnSchema subclasses can have additional metadata which can also be
# used in runtime validation (e.g. timeseries-specific stuff)
class TypedFrame(pl.DataFrame, Generic[TypedFrameSchemaT]):
"""
A polars.DataFrame subclass that enforces a schema defined by a TypedFrameSchema subclass.
This allows both static type checking (eg in function annotations) and runtime validation of the underlying data.
The runtime validation includes:
* Checking that the column names and dtypes match the schema in the same way that polars does (i.e. attempting to
cast to the desired type if needed)
* Checking that the schema itself is a valid schema with polars dtypes
Example:
>>> class Person(TypedFrameSchema):
... name: ColumnSpec[pl.String] = ColumnSpec(pl.String())
>>> df = pl.DataFrame({"name": ["Alice", "Bob"]})
>>> person_tdf = TypedFrame(df, Person)
Pyright correctly infers that `typed_df` is a TypedFrame[Person].
>>> def show_people(person_tdf: TypedFrame[Person]) -> None:
... print(df)
>>> show_people(person_tdf)
Shows no type errors
>>> class PersonWithAge(Person):
... age: ColumnSpec[pl.Int64] = ColumnSpec(pl.Int64())
>>> person_w_age_tdf = TypedFrame(df, PersonWithAge)
KeyError: 'age'
>>> person_w_age_tdf = TypedFrame(pl.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]}), PersonWithAge)
>>> show_people(person_w_age_tdf)
^^^^^^^^^^^^^^^^
Pyright: Argument of type "TypedFrame[PersonWithAge]" cannot be assigned to parameter
"person_tdf" of type "TypedFrame[Person]" in function "show_people"
"""
def __init__(self, df: pl.DataFrame, tf_schema: type[TypedFrameSchemaT]) -> None:
tf_schema.validate_schema(df)
self._tf_schema = tf_schema
# pl.DataFrame.__init__ has an unknown type in its arguments according to pyright (pl._typing.FrameInitTypes)
super().__init__(data=df, schema=tf_schema.polars_schema()) # type: ignore
Hey - really like this library! Unfortunately we have migrated to polars, so I've been looking at implementing something similar for my use-case.
I had a couple of queries around this code snippet in
DataSet
:To summarise:
_schema_annotations
(the generic instance type) at class-level in__class_getitem__
__init__
My queries/concerns are:
__class_getitem__
outside of type hintingif DataSet._schema_annotations is None:
:Access to generic instance variable through class is ambiguous
DataSet[Schema]
as a type hint anywhere in your code sets the_schema_annotations
column to beSchema
, e.g. see below code snippet (maybe this isn't an issue since it gets reset every time a new class is initialised, and that's the only context that attribute is being used)I'm curious to hear your thought processes around this design @nanne-aben ? I don't necessarily disagree with it - I've tried to find an alternative method and I think the way you've done it is probably the cleanest possible way to do it. But curious all the same to hear if you had those same concerns or if I'm overthinking it a bit!
Cheers!