Port of substrait-spark module from Gluten

andrew-coleman commented 2 weeks ago

As discussed in the chain of comments here, I’m offering up this PR for your consideration.

I can take no credit for its implementation, it is a copy of the substrait-spark module that was part of the gluten repo before it was removed, with a few minor corrections and additions. We are using this as part of a project here (IBM) and would like to see it remain available under the Apache ecosystem.

It is useful for converting spark query plans to and from substrait.

CLAassistant commented 2 weeks ago

All committers have signed the CLA.

Blizzara commented 2 weeks ago

FWIW, we also have an internal fork of this, and I've done some amount of work on it. If this gets included here, I'll be happy to add in my changes as relevant; otherwise I'll look at some point at opensourcing our fork.

It's worth noting that this will be the first instance of Scala in substrait-java. I'll let those that use this code more often decide if that's a bad thing or not.

There are likely some annoyances it may cause (like deciding/aligning the code formatting for both), but as long as this is a separate subproject it should not affect the Java-only parts in any way really afaik.

Side question: If this code was magically available in Java would it still be usable by existing callers?

Probably yes, but given Spark (esp. Catalyst) is written mostly in Scala, it's just much easier to traverse the Spark plans in Scala than in Java. The current Scala implementation is directly usable from within Java consumers, so that's not an issue.

andrew-coleman commented 2 weeks ago

FWIW, we also have an internal fork of this, and I've done some amount of work on it. If this gets included here, I'll be happy to add in my changes as relevant; otherwise I'll look at some point at opensourcing our fork.

Awesome! Then we should combine our efforts.

mbwhite commented 2 weeks ago

Probably worth adding that I've got several examples of using this library with Spark; need to get these approved to be able to share them, currently writing up docs for this at present.

andrew-coleman commented 1 week ago

Pleased to see the positive discussion in the 2024-06-19 Substrait Community Sync Meeting regarding this PR. Is there anything I can do to help progress this?

andrew-coleman commented 1 week ago

I've rebased this on latest commit in main to fix the last build break (getDfsNames() was removed recently)

substrait-io / substrait-java

Port of substrait-spark module from Gluten #271