Discussion around `__class_get_item__`

kevinleahy-switchdin commented 1 month ago

Hey - really like this library! Unfortunately we have migrated to polars, so I've been looking at implementing something similar for my use-case.

I had a couple of queries around this code snippet in DataSet:

    _schema_annotations = None

    def __class_getitem__(cls, item):
        """Allows us to define a schema for the ``DataSet``."""
        cls = super().__class_getitem__(item)
        cls._schema_annotations = item
        return cls

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)

        if DataSet._schema_annotations is None:
            return

        schema_expected = get_type_hints(DataSet._schema_annotations)
        DataSet._schema_annotations = None

To summarise:

set _schema_annotations (the generic instance type) at class-level in __class_getitem__
store this as an internal function variable and reset the class-level variable in __init__

My queries/concerns are:

the docs discourage the use of __class_getitem__ outside of type hinting
It feels weird setting a generic type (specific to an instance) at the class level (even though it later gets reset to the default value)
Related to the above, I get pyright complaining about if DataSet._schema_annotations is None: : Access to generic instance variable through class is ambiguous
using DataSet[Schema] as a type hint anywhere in your code sets the _schema_annotations column to be Schema, e.g. see below code snippet (maybe this isn't an issue since it gets reset every time a new class is initialised, and that's the only context that attribute is being used)

from strictly_typed_pandas import DataSet

class SchemaA:
    column_a: int
    column_b: str

class SchemaB:
    column_c: float
    column_d: bool

df_a = DataSet[SchemaA]({"column_a": [1, 2, 3], "column_b": ["A", "B", "C"]})

print("before func def is", df_a._schema_annotations)

def function_for_df_b(df: DataSet[SchemaB]):
    print(df._schema_annotations)

print("after func def is", df_a._schema_annotations)

# before func def is None
# after func def is <class '__main__.SchemaB'>

I'm curious to hear your thought processes around this design @nanne-aben ? I don't necessarily disagree with it - I've tried to find an alternative method and I think the way you've done it is probably the cleanest possible way to do it. But curious all the same to hear if you had those same concerns or if I'm overthinking it a bit!

Cheers!

nanne-aben commented 1 month ago

Hi @kevinleahy-switchdin !

Hey - really like this library! Unfortunately we have migrated to polars, so I've been looking at implementing something similar for my use-case.

I'm happy you like the library so much that you're planning to implement something similar for Polars!

the docs discourage the use of __class_getitem__ outside of type hinting

Yeah, that's true, I agree that that's not best practice. I also don't know how to do it otherwise. I decided that it was acceptable as long as I'd write an extensive set of unit tests that is run frequently (thanks to Depandabot). I just have to make sure to add new Python versions to the CI when they are released, and keep an eye out for errors.

It feels weird setting a generic type (specific to an instance) at the class level (even though it later gets reset to the default value)

Agree. We actually had a different implementation before, but then we ran into this issue, so we changed it to the version you see now. Linking both here, because you might find them interesting.

Related to the above, I get pyright complaining about if DataSet._schema_annotations is None: : Access to generic instance variable through class is ambiguous

Hmmm, we might need to add a # type: ignore there. Although I'd need to think a little bit more what's going on there.

I actually also developed another package that is very similar, but for pyspark, called typedspark. Over there, we do check with pyright (I'm actually a bit surprised we don't do that for strictly_typed_pandas, but that's for another time). I remember we had this issue at some point as well, linking it here in case it's relevant.

Lemme think a bit on it.

using DataSet[Schema] as a type hint anywhere in your code sets the _schema_annotations column to be Schema, e.g. see below code snippet (maybe this isn't an issue since it gets reset every time a new class is initialised, and that's the only context that attribute is being used)

Riiiight, that's not good. I remember thinking about these kinds of cases and trying to specify a test, but it seems I missed you case. Need to think a bit on that too.

Of note, I see the test defined in typedspark is slightly different.

Of topic, typedspark is a little bit more extensive than strictly_typed_pandas, it might be a good source of inspiration if you're planning to make something for Polars!

nanne-aben commented 1 month ago

Alright, I looked into it a bit further. The problem you describe actually doesn't seem to happen in typedspark:

from pyspark.sql.types import BooleanType, DoubleType, IntegerType, StringType
from pyspark.sql import SparkSession
from typedspark import DataSet, Column, Schema, create_partially_filled_dataset

class SchemaA(Schema):
    column_a: Column[IntegerType]
    column_b: Column[StringType]

class SchemaB(Schema):
    column_c: Column[DoubleType]
    column_d: Column[BooleanType]

spark = SparkSession.Builder().getOrCreate()
df_a = create_partially_filled_dataset(
    spark,
    SchemaA,
    {
        SchemaA.column_a: [1, 2, 3],
        SchemaA.column_b: ["A", "B", "C"],
    },
)

print("before func def is", df_a._schema_annotations.get_schema_name())

def function_for_df_b(df: DataSet[SchemaB]):
    print(df._schema_annotations)

print("after func def is", df_a._schema_annotations.get_schema_name())

# before func def is SchemaA
# after func def is SchemaA

This is probably because in typedspark we actually subclass the schema when we create a new class instance (source):

def __class_getitem__(cls, item):
    """Allows us to define a schema for the ``DataSet``.

    To make sure that the DataSet._schema_annotations variable isn't reused globally, we
    generate a subclass of the ``DataSet`` with the schema annotations as a class variable.
    """
    subclass_name = f"{cls.__name__}[{item.__name__}]"
    subclass = type(subclass_name, (cls,), {"_schema_annotations": item})
    return subclass

It's a bit convoluted, but it makes sure that the original DataSet._schema_annotations is never overwritten.

In typedspark this is specifically important, because we actually expose _schema_annotations through the typedspark_schema property.

@property
def typedspark_schema(self) -> Type[_Implementation]:
    """Returns the ``Schema`` of the ``DataSet``."""
    return self._schema_annotations

In strictly_typed_pandas, we don't expose _schema_annotations anywhere; it's only used when initialising the class. So you could argue that it's just a private attribute that people should only touch at their own risk. That said, I think we should probably also implement this subclass behaviour here.

Hope this is informative. Lemme know if you have any other questions!

kevinleahy-switchdin commented 1 month ago

Hey @nanne-aben - really appreciate the detailed reply. Apologies for delayed response - got lost in the sea of github notifications.

Ah - the typedspark way is a clever way of getting around that problem! I agree it's convoluted, but, as you said, there doesn't seem to be another way to do it. It's a pity there isn't a better way of doing this at the moment in python - from googling it seems to be something that has popped up elsewhere as well.

I had started digging into the typedspark code and sketching out a few ideas. As a stop-gap I settled on the following approach, where we create different subclasses for differently schema'd dataframes. If I can get it working smoothly as a parameterised Generic using the typedspark approach I'll let you know!

from typing import ClassVar

import polars as pl
from polars._typing import SchemaDefinition

class _TypedDataFrameBase(pl.DataFrame):
    _expected_schema: ClassVar[SchemaDefinition]

    def __init__(self, *args, **kwargs):
        if (len(args) > 1 and args[1] is not None) or ("schema" in kwargs and kwargs["schema"] is not None):
            raise ValueError("schema must not be passed as an arg - default value set at class level")
        kwargs["schema"] = self._expected_schema

        super().__init__(*args, **kwargs)

    # TODO: override _from_pydf to make sure we always run the validation checks

    @classmethod
    def deserialize(cls, *args, **kwargs):
        raise NotImplementedError("This method is not supported for TypedDataFrame objects")

    @classmethod
    def _from_pandas(cls, *args, **kwargs):
        raise NotImplementedError("This method is not supported for TypedDataFrame objects")

    @classmethod
    def _from_arrow(cls, *args, **kwargs):
        raise NotImplementedError("This method is not supported for TypedDataFrame objects")

schema_a = pl.Schema({"a_1": pl.Int64(), "a_2": pl.Float64()})

class TypedDataFrameA(_TypedDataFrameBase):
    _expected_schema: ClassVar[SchemaDefinition] = schema_a

nanne-aben commented 1 month ago

Cool, please keep me updated, curious to know more!

kevinleahy-switchdin commented 1 month ago

Hey @nanne-aben, thought I'd share an update! We decided that we would avoid the approach of needing to hook into __class_getitem__, and instead use a purely generic-based approach, by passing the schema object itself.

The bones of this are as so:


DataTypeT = TypeVar("DataTypeT", bound=pl.DataType)
TypedFrameSchemaT = TypeVar("TypedFrameSchemaT", bound="TypedFrameSchema")

# ColumnSpecs are essentially dataclasses
@attrs.frozen
class ColumnSpec(Generic[DataTypeT]):
    """Used in TypedFrameSchema to describe datatypes of columns. Child classes may contain other properties such as
    timeseries aggregation properties"""

    dtype: DataTypeT

class TypedFrameSchema:
    """
    TypedFrame Schema definitions must inherit from this and have class attributes mapping column names to ColumnSpec
    instances

    Example:

        >>> class Person(TypedFrameSchema):
        ...     name: ColumnSpec[pl.String] = ColumnSpec(pl.String())
    """

    @classmethod
    def polars_schema(cls) -> pl.Schema:
        schema_dict = {key: val.dtype for key, val in cls._get_classvar_val_mapping().items()}
        return pl.Schema(schema_dict)

  # I've left out a bunch of methods here for validating the schema itself. We hand over the runtime validation against the
  # dataframe names/types to polars. ColumnSchema subclasses can have additional metadata which can also be
  # used in runtime validation (e.g. timeseries-specific stuff)

class TypedFrame(pl.DataFrame, Generic[TypedFrameSchemaT]):
    """
    A polars.DataFrame subclass that enforces a schema defined by a TypedFrameSchema subclass.

    This allows both static type checking (eg in function annotations) and runtime validation of the underlying data.

    The runtime validation includes:
    * Checking that the column names and dtypes match the schema in the same way that polars does (i.e. attempting to
      cast to the desired type if needed)
    * Checking that the schema itself is a valid schema with polars dtypes

    Example:
        >>> class Person(TypedFrameSchema):
        ...     name: ColumnSpec[pl.String] = ColumnSpec(pl.String())

        >>> df = pl.DataFrame({"name": ["Alice", "Bob"]})
        >>> person_tdf = TypedFrame(df, Person)

        Pyright correctly infers that `typed_df` is a TypedFrame[Person].

        >>> def show_people(person_tdf: TypedFrame[Person]) -> None:
        ...     print(df)
        >>> show_people(person_tdf)

        Shows no type errors

        >>> class PersonWithAge(Person):
        ...     age: ColumnSpec[pl.Int64] = ColumnSpec(pl.Int64())
        >>> person_w_age_tdf = TypedFrame(df, PersonWithAge)
        KeyError: 'age'

        >>> person_w_age_tdf = TypedFrame(pl.DataFrame({"name": ["Alice", "Bob"], "age": [25, 30]}), PersonWithAge)
        >>> show_people(person_w_age_tdf)
                        ^^^^^^^^^^^^^^^^
                        Pyright: Argument of type "TypedFrame[PersonWithAge]" cannot be assigned to parameter
                        "person_tdf" of type "TypedFrame[Person]" in function "show_people"
    """

    def __init__(self, df: pl.DataFrame, tf_schema: type[TypedFrameSchemaT]) -> None:
        tf_schema.validate_schema(df)
        self._tf_schema = tf_schema

        # pl.DataFrame.__init__ has an unknown type in its arguments according to pyright (pl._typing.FrameInitTypes)
        super().__init__(data=df, schema=tf_schema.polars_schema())  # type: ignore

nanne-aben / strictly_typed_pandas

Discussion around `__class_get_item__` #274