kaiko-ai / typedspark

Column-wise type annotations for pyspark DataFrames
Apache License 2.0
65 stars 4 forks source link

Using Generics with typed DataFrame #266

Closed Eldar1205 closed 9 months ago

Eldar1205 commented 10 months ago

Hi,

Thank you for this project, really helpful for people using type hints! I'd like to know if there's a way to annotate a Struct column that can have a varying schema? All examples I've seen in docs indicate a Struct column needs to have a particular schema.

For example, I'd like to have a Resource[T] dataframe, with a struct column resource_properties of type T, such that T is a Python TypeVar, or at the very least be able to have a column with type Any so that the type linter ignores it and the developers will know how to treat the values.

nanne-aben commented 10 months ago

Thanks, glad you like the package!

Intriguing question! I think you mean something like this, right?

from typing import Generic, TypeVar
from pyspark.sql.types import DoubleType
from typedspark import Column, Schema, StructType

class Wood(Schema):
    flexibility: Column[DoubleType]

class Stone(Schema):
    hardness: Column[DoubleType]

T = TypeVar("T", bound=Schema)

class Resource(Schema, Generic[T]):
    resource: Column[StructType[T]]

Resource[Wood].resource.dtype.schema.flexibility

In terms of linting this is indeed possible. It even autocompletes the column name of the correct subschema!

afbeelding

However, if we run this code there are several problems... The runtime typecheck won't work with this. And currently the above code somehow doesn't recognize the Generic[T] (maybe something to do with the MetaSchema metaclass?).

We could implement something in typedspark such that it does work. I'm currently on vacation however, so it will take a while before I could do it. If you're willing to make a contribution, it can be there faster :) Lemme know if you're up for that, happy to help with some pointers, of course!

Eldar1205 commented 9 months ago

Thank you, enjoy your vacation!

The thing I'm wondering is how to even represent such a DataFrame in Spark to begin with? To my knowledge, all rows need to have a unified schema, and there's no way for different rows to have different schemas.

On Fri, Dec 29, 2023 at 10:14 AM nanne-aben @.***> wrote:

Thanks, glad you like the package!

Intriguing question! I think you mean something like this, right?

from typing import Generic, TypeVarfrom pyspark.sql.types import DoubleTypefrom typedspark import Column, Schema, StructType

class Wood(Schema): flexibility: Column[DoubleType]

class Stone(Schema): hardness: Column[DoubleType]

T = TypeVar("T", bound=Schema)

class Resource(Schema, Generic[T]): resource: Column[StructType[T]]

Resource[Wood].resource.dtype.schema.flexibility

In terms of linting this is indeed possible. It even autocompletes the column name of the correct subschema! afbeelding.png (view on web) https://github.com/kaiko-ai/typedspark/assets/47976799/88ffe1f2-25c1-4a0d-92c7-0e0265987767

However, if we run this code there are several problems... The runtime typecheck won't work with this. And currently the above code somehow doesn't recognize the Generic[T] (maybe something to do with the MetaSchema metaclass?).

We could implement something in typedspark such that it does work. I'm currently on vacation however, so it will take a while before I could do it. If you're willing to make a contribution, it can be there faster :) Lemme know if you're up for that, happy to help with some pointers, of course!

— Reply to this email directly, view it on GitHub https://github.com/kaiko-ai/typedspark/issues/266#issuecomment-1871829261, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIFRN2VAZFZ4ONW7AHOWRLYLZ3VNAVCNFSM6AAAAABBEYM6DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZRHAZDSMRWGE . You are receiving this because you authored the thread.Message ID: @.***>

nanne-aben commented 9 months ago

Oh, right, if different rows would have a different StructType[T], then that wouldn't work indeed. I think you're right: spark doesn't support it. I thought you wanted to be able to define different DataSet[Resource[T]] on a DataSet level, for example as an input to a function

def foo(df: DataSet[Resource[T]]) -> DataSet[Resource[T]]:
   ...

Such that if we'd input DataSet[Resource[Stone]] into foo(), the type checker would know that the return type would also be DataSet[Resource[Stone]] (and ditto for DataSet[Resource[Wood]]).

Anyway, if that's not what you were looking for... Mind if I ask you what you had in mind? With a small example? Maybe we can think of a different way to solve it.

nanne-aben commented 9 months ago

Hi @Eldar1205 ! Are you still interested in the above? Otherwise I'll close the ticket.

Eldar1205 commented 9 months ago

Hi, you may close the ticket, I realized this use case isn't really applicable as is.

Thanks for the ping :)

On Wed, Jan 17, 2024, 15:46 nanne-aben @.***> wrote:

Hi @Eldar1205 https://github.com/Eldar1205 ! Are you still interested in the above? Otherwise I'll close the ticket.

— Reply to this email directly, view it on GitHub https://github.com/kaiko-ai/typedspark/issues/266#issuecomment-1895846828, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIFRNZT5ASWEAYX6MKIKV3YO7I4PAVCNFSM6AAAAABBEYM6DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJVHA2DMOBSHA . You are receiving this because you were mentioned.Message ID: @.***>

nanne-aben commented 9 months ago

You're welcome! Feel free to reach out again!