substrait-io / substrait-java

Apache License 2.0
77 stars 72 forks source link

Port of substrait-spark module from Gluten #271

Closed andrew-coleman closed 5 months ago

andrew-coleman commented 5 months ago

As discussed in the chain of comments here, I’m offering up this PR for your consideration.

I can take no credit for its implementation, it is a copy of the substrait-spark module that was part of the gluten repo before it was removed, with a few minor corrections and additions. We are using this as part of a project here (IBM) and would like to see it remain available under the Apache ecosystem.

It is useful for converting spark query plans to and from substrait.

CLAassistant commented 5 months ago

CLA assistant check
All committers have signed the CLA.

Blizzara commented 5 months ago

FWIW, we also have an internal fork of this, and I've done some amount of work on it. If this gets included here, I'll be happy to add in my changes as relevant; otherwise I'll look at some point at opensourcing our fork.

It's worth noting that this will be the first instance of Scala in substrait-java. I'll let those that use this code more often decide if that's a bad thing or not.

There are likely some annoyances it may cause (like deciding/aligning the code formatting for both), but as long as this is a separate subproject it should not affect the Java-only parts in any way really afaik.

Side question: If this code was magically available in Java would it still be usable by existing callers?

Probably yes, but given Spark (esp. Catalyst) is written mostly in Scala, it's just much easier to traverse the Spark plans in Scala than in Java. The current Scala implementation is directly usable from within Java consumers, so that's not an issue.

andrew-coleman commented 5 months ago

FWIW, we also have an internal fork of this, and I've done some amount of work on it. If this gets included here, I'll be happy to add in my changes as relevant; otherwise I'll look at some point at opensourcing our fork.

Awesome! Then we should combine our efforts.

mbwhite commented 5 months ago

Probably worth adding that I've got several examples of using this library with Spark; need to get these approved to be able to share them, currently writing up docs for this at present.

andrew-coleman commented 5 months ago

Pleased to see the positive discussion in the 2024-06-19 Substrait Community Sync Meeting regarding this PR. Is there anything I can do to help progress this?

andrew-coleman commented 5 months ago

I've rebased this on latest commit in main to fix the last build break (getDfsNames() was removed recently)