vincenzobaz / spark-scala3

Apache License 2.0
84 stars 15 forks source link

Possibly use Spark Connect to build the Scala 3 implementation #59

Open MrPowers opened 2 months ago

MrPowers commented 2 months ago

Spark Connect decouples the client and the Spark driver, so the client doesn't need the same software versions as the driver.

A Spark Connect Scala 3 should provide a similar user experience.

spark-connect-go, spark-connect-rs, and Spark Connect C# already exist.

Would it make sense to possibly use Spark Connect instead?

michael72 commented 2 months ago

Hey @MrPowers - that sounds like a cool project!

However I am not sure how much effort we would have to put into that. On the other hand it is frustrating not to be able to use scala 3 on spark client side. I am not sure if this can be part of spark-scala3 or if spark-scala3 could somehow be rewritten to create DataFrame objects and use them via spark connect. Only it seems like spark connect does not support all APIs. :pensive:

Unfortunately I think we've hit some limits with spark-scala3. For instance I have seen that there are still problems serializing scala 3 closures - the code for that is called inside the spark libraries. When that happens the exception doesn't even show in the logs - so it just hangs without a message. So a somewhat simple call to map doesn't work with spark-scala3 and probably there are other functions that cannot simply be called. There is a lack of unit testing done in this project for available spark functions so we actually don't know.

MrPowers commented 2 months ago

@michael72 - Yea, it might make sense to just create another repo. The Spark core devs are definitely excited about a Spark Connect + Scala 3 project. You're right that Spark Connect doesn't support all the Spark APIs, but I think it's a better architecture and am hopeful we could build something that's production grade.

grundprinzip commented 2 months ago

As you've already pointed out closure serialization is an issue because the closures are injected into the target JVM. I'm no Scala expert ans thus don't know if there is any chance at all to be able to load the compiled code without rebuilding all of Spark in Scala 3. Maybe some clever class loader isolation is enough?

On the client side however, you can have the full freedom to use Scala 3. An interesting approach might be to try to simply copy / use the existing JVM client and see how far you come without UDFs and closures?