kaiko-ai / typedspark

Column-wise type annotations for pyspark DataFrames
Apache License 2.0
65 stars 4 forks source link

Contribution request #60

Closed mohitCodepy closed 1 year ago

mohitCodepy commented 1 year ago

I would like to contribute on this organisation, If you have any bugs or want the implementation of the feature then I'm available.

Thank you

nanne-aben commented 1 year ago

Ah, yes, that would be great!

Sorry for my late response btw, I had misconfigured my email notification settings for this repo.

Right now, we only have one open task. Consider the following code:

from typedspark import transform_to_schema
from typedspark import register_schema_to_dataset
from typedspark import Column, Schema, create_partially_filled_dataset
from pyspark.sql.types import IntegerType, StringType
from pyspark.sql import SparkSession

spark = SparkSession.Builder().getOrCreate()

class Person(Schema):
    id: Column[IntegerType]
    name: Column[StringType]
    age: Column[IntegerType]

class Job(Schema):
    id: Column[IntegerType]
    salary: Column[IntegerType]

df_a = create_partially_filled_dataset(spark, Person, {Person.id: [1, 2, 3]})
df_b = create_partially_filled_dataset(spark, Job, {Job.id: [1, 2, 3]})

person = register_schema_to_dataset(df_a, Person)
job = register_schema_to_dataset(df_b, Job)

class PersonWithJob(Person, Job):
    pass

(
    transform_to_schema(
        df_a
        .join(df_b, person.id == job.id),
        PersonWithJob
    )
    .show()
)

This currently yields an AnalysisException. The problem here is that transform_to_schema() tries to subset df_a.join(df_b, person.id == job.id) to all columns defined in PersonWithJob, but it then finds that the id column is not uniquely defined. And it's indeed unclear whether it should be PersonWithJob.id = person.id or PersonWithJob.id = job.id.

The current solution is to drop one of the columns.

(
    transform_to_schema(
        df_a
        .join(df_b, person.id == job.id)
        .drop(person.id),
        PersonWithJob
    )
    .show()
)

That works, but it has some downsides:

I think the solution is two-fold.

First: can we detect ourselves whether any of the columns from the provided schema are ambiguous? And if so, raise a more informative error message? (Rather than the generic AnalysisException?

Second: can we define a more intuitive solution, for example:

(
    transform_to_schema(
        df_a
        .join(df_b, person.id == job.id),
        PersonWithJob,
        {
            PersonWithJob.id: job.id,
        }
    )
    .show()
)

It would be great if we could use something like this to signal that we want PersonWithJob.id = job.id rather than PersonWithJob.id = person.id.

Would you be interested in working on (one of) these? Lemme know if you have any questions!

mohitCodepy commented 1 year ago

@nanne-aben ,

Yes, I'm highly interested, I would love to work on this, for clarification, I haven't worked on this, but I have a better understanding of Python and after reading and understanding the above detail I am able to understand it. I'm ready to explore it more also Is there any public channel or group so we can communicate easily from there?

I'm a beginner to open source, but I have the potential to do complex stuff, also I'll need some help on this. BTW thanks for your explanation, I'll try to reproduce it.

Thanks

nanne-aben commented 1 year ago

Great! Sure, happy to help!

I think Github Discussions would be a good place for this. I've enabled it for this repo, it should show up as a tab now. Does that work for you?

mohitCodepy commented 1 year ago

Yes, thank you, It is visible :smile:

nanne-aben commented 1 year ago

Hi mohitCodepy! How's it going? Do you need any help?

mohitCodepy commented 1 year ago

@nanne-aben , apologise for the inactivity, was busy in college work but I'm available now

nanne-aben commented 1 year ago

No worries, just checking!

Lemme know if I can help you with anything!

nanne-aben commented 1 year ago

Hi mohitCodepy! I needed to have this functionality for another project, so I implemented it in #172. Hope you don't mind me taking it of your hands!

mohitCodepy commented 1 year ago

Hi mohitCodepy! I needed to have this functionality for another project, so I implemented it in #172. Hope you don't mind me taking it of your hands!

@nanne-aben

I'm really sorry about that, can you please add me as a reviewer in #172 ?, so that I can also understand about the code. Thanks

nanne-aben commented 1 year ago

I think you should be able to comment on the PR already, right?

Lemme know if it doesn't work, then I'll dive into the settings a bit more.

mohitCodepy commented 1 year ago

@nanne-aben , Sure