Closed andrew-coleman closed 5 months ago
FWIW, we also have an internal fork of this, and I've done some amount of work on it. If this gets included here, I'll be happy to add in my changes as relevant; otherwise I'll look at some point at opensourcing our fork.
It's worth noting that this will be the first instance of Scala in substrait-java. I'll let those that use this code more often decide if that's a bad thing or not.
There are likely some annoyances it may cause (like deciding/aligning the code formatting for both), but as long as this is a separate subproject it should not affect the Java-only parts in any way really afaik.
Side question: If this code was magically available in Java would it still be usable by existing callers?
Probably yes, but given Spark (esp. Catalyst) is written mostly in Scala, it's just much easier to traverse the Spark plans in Scala than in Java. The current Scala implementation is directly usable from within Java consumers, so that's not an issue.
FWIW, we also have an internal fork of this, and I've done some amount of work on it. If this gets included here, I'll be happy to add in my changes as relevant; otherwise I'll look at some point at opensourcing our fork.
Awesome! Then we should combine our efforts.
Probably worth adding that I've got several examples of using this library with Spark; need to get these approved to be able to share them, currently writing up docs for this at present.
Pleased to see the positive discussion in the 2024-06-19 Substrait Community Sync Meeting regarding this PR. Is there anything I can do to help progress this?
I've rebased this on latest commit in main
to fix the last build break (getDfsNames() was removed recently)
As discussed in the chain of comments here, I’m offering up this PR for your consideration.
I can take no credit for its implementation, it is a copy of the
substrait-spark
module that was part of the gluten repo before it was removed, with a few minor corrections and additions. We are using this as part of a project here (IBM) and would like to see it remain available under the Apache ecosystem.It is useful for converting spark query plans to and from substrait.