Open vibhatha opened 1 year ago
The visitor.py
we have a visit_project
method for RelUpdateVisitor
do we have an accurate logic there?
To test this do the following;
Step 1:
Append a new column, run the module
Step 2:
Renamed the appended column, run the module
Step 3:
Drop the renamed column, run the module.
Since there is no direct method to convert a PyArrow Schema into a Substrait schema, we would use a simple hack by creating an empty table with the physical schema of the file with no data. Then we use this table to generate a Substrait plan and from that plan we extract the base_schema.
In extracting the accurate field names, we would compare what is the empty_table's schema vs Iceberg tables schema (current);
This scenario occurs when we have a rename-based schema evolution. The file would have a different name compared to what is in the evolved schema. Since the field name in the evolved schema is not in the reference table's schema, we would find the field from the Iceberg table's schema. This would give us a field in Iceberg. Then use that field to extract the field_id to obtain the relative location in the reference table. This reference table contains the base_schema.
I think this logic would work as long as, we use it with renamed fields and not update new fields. Because if we rename a newly added field.
Let's take the following Scenario
State 1:
State 2:
State 3:
Now let's evaluate our algorithm
This would yield us the value in index 3 of
reference_table
, which would beD
We must validate this logic and check for edge cases.