mrpowers-io / quinn

pyspark methods to enhance developer productivity 📣 👯 🎉
https://mrpowers-io.github.io/quinn/
Apache License 2.0
639 stars 98 forks source link

Broadcast variable helper #94

Open MrPowers opened 1 year ago

MrPowers commented 1 year ago

From a Redditor on this thread:

I’d love to see a function to check if your df is small enough to use a broadcast join. At the moment I take a 10% sample, convert that to pandas and then estimate memory size from that using a pd function. Then if the df is small enough I’ll use a broadcast join to improve speed.

puneetsharma04 commented 1 year ago

@MrPowers : I would like to contribute on this, could you please assign this issue to me.

kunaljubce commented 8 months ago

@MrPowers @SemyonSinchenko Did we ever brainstorm on this? I have lost count of the number of times I would have loved a functionality like this. Would love to take this up.

SemyonSinchenko commented 8 months ago

@kunaljubce Because of spark-connect we cannot use _jvm here. So, the only known for me option was to parse the plan. But @MrPowers does not like this idea (see arguments here: https://github.com/MrPowers/quinn/pull/159).

JFYI: This function do exactly this job -- it estimates the size of DF in bytes (megabytes) without computation.

SemyonSinchenko commented 8 months ago

So, in my opinion, there is no way to do it (except collection to driver that is a terrible option). @kunaljubce If you have other vision how it may be implemented or you have new arguments for my discussion with @MrPowers (his arg was that the plan representation is very unstable) we can raise this topic again!