substrait-io / substrait-java

Apache License 2.0
75 stars 72 forks source link

Roundtrip of Substrait Plan doesn't provide the same output. #203

Open vibhatha opened 10 months ago

vibhatha commented 10 months ago

I wrote a test case to evaluate the following and once we do the roundrip we don't get the same plan. I observed the diff and seems like the order of extension_uris comes in the opposite order

INPUT

    "extensionUris": [
        {
            "extensionUriAnchor": 2,
            "uri": "/functions_arithmetic.yaml"
        },
        {
            "extensionUriAnchor": 1,
            "uri": "/functions_string.yaml"
        }
    ],

OUTPUT

    "extensionUris": [
        {
            "extensionUriAnchor": 1,
            "uri": "/functions_string.yaml"
        },
        {
            "extensionUriAnchor": 2,
            "uri": "/functions_arithmetic.yaml"
        }
Plan.Builder builder = io.substrait.proto.Plan.newBuilder();
String jsonPlan = readJsonFile(("my_substrait_plan_in.json"));
JsonFormat.parser().merge(jsonPlan, builder);
var inputPlan =  builder.build();
io.substrait.plan.Plan plan = new ProtoPlanConverter(EXTENSION_COLLECTION).from(inputPlan);
PlanProtoConverter planProtoConverter = new PlanProtoConverter();
io.substrait.proto.Plan roundtripPlan = planProtoConverter.toProto(plan);
var roundtripJson = JsonFormat.printer().print(roundtripPlan);
ObjectMapper objectMapper = new ObjectMapper();
JsonNode originalTree = objectMapper.readTree(jsonPlan);
JsonNode roundTripTree = objectMapper.readTree(roundtripJson);
assert !originalTree.equals(roundTripTree);

I don't think this would do any harm execution wise, but one would expect it to be similar in property order.

EpsilonPrime commented 10 months ago

The original source likely ordered the functions in alphabetical order and the Java code is likely ordering based on the order that the functions were discovered.

vibhatha commented 10 months ago

Also I did check, the io.substrait.plan.Plan of corresponding plans would yields to false when we use the equals method.

vibhatha commented 10 months ago

@EpsilonPrime seems like it. Would you recommend that this needs fixing?

EpsilonPrime commented 10 months ago

One way to address this would be to implement a "normalize plan" method. It could sort the functions in place, populate a map with the old and new values, and then walk the protobuf replacing function references with the updated value. It could also do other normalization. I have something like that in substrait-cpp:

https://github.com/substrait-io/substrait-cpp/blob/main/src/substrait/textplan/converter/ReferenceNormalizer.cpp#L279

vbarua commented 10 months ago

There's nothing in the spec that constrains that extensions need to be ordered in a specific way. I'm hesitant to officially encode any behaviour or expectations of ordering. That's not to say we can't provide a utility to normalize plans, it just wouldn't be based on anything in the spec.

Also I did check, the io.substrait.plan.Plan of corresponding plans would yields to false when we use the equals method.

If we consider these plans to be equal, this would be a good argument for a smarter equals method.

vibhatha commented 10 months ago

@vbarua I agree with the standard, this was a mere observation. I think for this case what we should provide is a smarter equals method. Do you think this would be a good feature to be added to a future release?

danepitkin commented 10 months ago

I think a smarter equality comparison would be a good approach.