substrait-io / substrait-java

Apache License 2.0
72 stars 70 forks source link

StructTypes in a VirtualTableScan #254

Closed Blizzara closed 1 month ago

Blizzara commented 2 months ago

Hey! I've been working on a Spark to Substrait (and back) converter, forked from https://github.com/apache/incubator-gluten/tree/v1.1.0/substrait/substrait-spark / https://github.com/substrait-io/substrait-java/pull/90/files (currently the fork is private but I hope to open source it too)

While aiming to support struct types in LocalRelations/VirtualTables, I ran into the check here: https://github.com/substrait-io/substrait-java/blob/72bab63edf6c4ffb12c3c4b0e4f49d066e0c5524/core/src/main/java/io/substrait/relation/VirtualTableScan.java#L33

If I've understood right, the "names" here should be a flattened list of field names, including column names but also recursively all names from struct types. This also aligns with the code in Isthmus. Overall, that means the check r.fields().size() != names.size() will trigger since there will be more names than top-level fields.

I'm pretty new to Substrait still so I may also be mistaken, but if my understanding is right, would it make sense to either: a) remove the check, b) change the check to confirm that names.size() >= r.fields().size(), or c) iterate through fields to count the sub-fields as well before comparing the sizes?

vbarua commented 2 months ago

If I've understood right, the "names" here should be a flattened list of field names, including column names but also recursively all names from struct types.

I think that's correct, and

Overall, that means the check r.fields().size() != names.size() will trigger since there will be more names than top-level fields.

this is probably a bug.

Weakening the check like you suggest in b makes sense or making it more comprehensive like in c both sound reasonable to me. It would be good to add a test for this case as well to avoid regressing to this behaviour.