Closed mohitCodepy closed 1 year ago
Ah, yes, that would be great!
Sorry for my late response btw, I had misconfigured my email notification settings for this repo.
Right now, we only have one open task. Consider the following code:
from typedspark import transform_to_schema
from typedspark import register_schema_to_dataset
from typedspark import Column, Schema, create_partially_filled_dataset
from pyspark.sql.types import IntegerType, StringType
from pyspark.sql import SparkSession
spark = SparkSession.Builder().getOrCreate()
class Person(Schema):
id: Column[IntegerType]
name: Column[StringType]
age: Column[IntegerType]
class Job(Schema):
id: Column[IntegerType]
salary: Column[IntegerType]
df_a = create_partially_filled_dataset(spark, Person, {Person.id: [1, 2, 3]})
df_b = create_partially_filled_dataset(spark, Job, {Job.id: [1, 2, 3]})
person = register_schema_to_dataset(df_a, Person)
job = register_schema_to_dataset(df_b, Job)
class PersonWithJob(Person, Job):
pass
(
transform_to_schema(
df_a
.join(df_b, person.id == job.id),
PersonWithJob
)
.show()
)
This currently yields an AnalysisException
. The problem here is that transform_to_schema()
tries to subset df_a.join(df_b, person.id == job.id)
to all columns defined in PersonWithJob
, but it then finds that the id column is not uniquely defined. And it's indeed unclear whether it should be PersonWithJob.id = person.id
or PersonWithJob.id = job.id
.
The current solution is to drop one of the columns.
(
transform_to_schema(
df_a
.join(df_b, person.id == job.id)
.drop(person.id),
PersonWithJob
)
.show()
)
That works, but it has some downsides:
I think the solution is two-fold.
First: can we detect ourselves whether any of the columns from the provided schema are ambiguous? And if so, raise a more informative error message? (Rather than the generic AnalysisException
?
Second: can we define a more intuitive solution, for example:
(
transform_to_schema(
df_a
.join(df_b, person.id == job.id),
PersonWithJob,
{
PersonWithJob.id: job.id,
}
)
.show()
)
It would be great if we could use something like this to signal that we want PersonWithJob.id = job.id
rather than PersonWithJob.id = person.id
.
Would you be interested in working on (one of) these? Lemme know if you have any questions!
@nanne-aben ,
Yes, I'm highly interested, I would love to work on this, for clarification, I haven't worked on this, but I have a better understanding of Python and after reading and understanding the above detail I am able to understand it. I'm ready to explore it more also Is there any public channel or group so we can communicate easily from there?
I'm a beginner to open source, but I have the potential to do complex stuff, also I'll need some help on this. BTW thanks for your explanation, I'll try to reproduce it.
Thanks
Great! Sure, happy to help!
I think Github Discussions would be a good place for this. I've enabled it for this repo, it should show up as a tab now. Does that work for you?
Yes, thank you, It is visible :smile:
Hi mohitCodepy! How's it going? Do you need any help?
@nanne-aben , apologise for the inactivity, was busy in college work but I'm available now
No worries, just checking!
Lemme know if I can help you with anything!
Hi mohitCodepy! I needed to have this functionality for another project, so I implemented it in #172. Hope you don't mind me taking it of your hands!
Hi mohitCodepy! I needed to have this functionality for another project, so I implemented it in #172. Hope you don't mind me taking it of your hands!
@nanne-aben
I'm really sorry about that, can you please add me as a reviewer in #172 ?, so that I can also understand about the code. Thanks
I think you should be able to comment on the PR already, right?
Lemme know if it doesn't work, then I'll dive into the settings a bit more.
@nanne-aben , Sure
I would like to contribute on this organisation, If you have any bugs or want the implementation of the feature then I'm available.
Thank you