Closed Eldar1205 closed 9 months ago
Thanks, glad you like the package!
Intriguing question! I think you mean something like this, right?
from typing import Generic, TypeVar
from pyspark.sql.types import DoubleType
from typedspark import Column, Schema, StructType
class Wood(Schema):
flexibility: Column[DoubleType]
class Stone(Schema):
hardness: Column[DoubleType]
T = TypeVar("T", bound=Schema)
class Resource(Schema, Generic[T]):
resource: Column[StructType[T]]
Resource[Wood].resource.dtype.schema.flexibility
In terms of linting this is indeed possible. It even autocompletes the column name of the correct subschema!
However, if we run this code there are several problems... The runtime typecheck won't work with this. And currently the above code somehow doesn't recognize the Generic[T]
(maybe something to do with the MetaSchema
metaclass?).
We could implement something in typedspark such that it does work. I'm currently on vacation however, so it will take a while before I could do it. If you're willing to make a contribution, it can be there faster :) Lemme know if you're up for that, happy to help with some pointers, of course!
Thank you, enjoy your vacation!
The thing I'm wondering is how to even represent such a DataFrame in Spark to begin with? To my knowledge, all rows need to have a unified schema, and there's no way for different rows to have different schemas.
On Fri, Dec 29, 2023 at 10:14 AM nanne-aben @.***> wrote:
Thanks, glad you like the package!
Intriguing question! I think you mean something like this, right?
from typing import Generic, TypeVarfrom pyspark.sql.types import DoubleTypefrom typedspark import Column, Schema, StructType
class Wood(Schema): flexibility: Column[DoubleType]
class Stone(Schema): hardness: Column[DoubleType]
T = TypeVar("T", bound=Schema)
class Resource(Schema, Generic[T]): resource: Column[StructType[T]]
Resource[Wood].resource.dtype.schema.flexibility
In terms of linting this is indeed possible. It even autocompletes the column name of the correct subschema! afbeelding.png (view on web) https://github.com/kaiko-ai/typedspark/assets/47976799/88ffe1f2-25c1-4a0d-92c7-0e0265987767
However, if we run this code there are several problems... The runtime typecheck won't work with this. And currently the above code somehow doesn't recognize the Generic[T] (maybe something to do with the MetaSchema metaclass?).
We could implement something in typedspark such that it does work. I'm currently on vacation however, so it will take a while before I could do it. If you're willing to make a contribution, it can be there faster :) Lemme know if you're up for that, happy to help with some pointers, of course!
— Reply to this email directly, view it on GitHub https://github.com/kaiko-ai/typedspark/issues/266#issuecomment-1871829261, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIFRN2VAZFZ4ONW7AHOWRLYLZ3VNAVCNFSM6AAAAABBEYM6DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZRHAZDSMRWGE . You are receiving this because you authored the thread.Message ID: @.***>
Oh, right, if different rows would have a different StructType[T]
, then that wouldn't work indeed. I think you're right: spark doesn't support it. I thought you wanted to be able to define different DataSet[Resource[T]]
on a DataSet
level, for example as an input to a function
def foo(df: DataSet[Resource[T]]) -> DataSet[Resource[T]]:
...
Such that if we'd input DataSet[Resource[Stone]]
into foo()
, the type checker would know that the return type would also be DataSet[Resource[Stone]]
(and ditto for DataSet[Resource[Wood]]
).
Anyway, if that's not what you were looking for... Mind if I ask you what you had in mind? With a small example? Maybe we can think of a different way to solve it.
Hi @Eldar1205 ! Are you still interested in the above? Otherwise I'll close the ticket.
Hi, you may close the ticket, I realized this use case isn't really applicable as is.
Thanks for the ping :)
On Wed, Jan 17, 2024, 15:46 nanne-aben @.***> wrote:
Hi @Eldar1205 https://github.com/Eldar1205 ! Are you still interested in the above? Otherwise I'll close the ticket.
— Reply to this email directly, view it on GitHub https://github.com/kaiko-ai/typedspark/issues/266#issuecomment-1895846828, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABIFRNZT5ASWEAYX6MKIKV3YO7I4PAVCNFSM6AAAAABBEYM6DCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJVHA2DMOBSHA . You are receiving this because you were mentioned.Message ID: @.***>
You're welcome! Feel free to reach out again!
Hi,
Thank you for this project, really helpful for people using type hints! I'd like to know if there's a way to annotate a Struct column that can have a varying schema? All examples I've seen in docs indicate a Struct column needs to have a particular schema.
For example, I'd like to have a
Resource[T]
dataframe, with a struct column resource_properties of type T, such that T is a Python TypeVar, or at the very least be able to have a column with typeAny
so that the type linter ignores it and the developers will know how to treat the values.