vibhatha / pyiceberg_substrait

Apache License 2.0
0 stars 0 forks source link

Baseschema creation logic validatiob #29

Open vibhatha opened 1 year ago

vibhatha commented 1 year ago

Since there is no direct method to convert a PyArrow Schema into a Substrait schema, we would use a simple hack by creating an empty table with the physical schema of the file with no data. Then we use this table to generate a Substrait plan and from that plan we extract the base_schema.

In extracting the accurate field names, we would compare what is the empty_table's schema vs Iceberg tables schema (current);

This scenario occurs when we have a rename-based schema evolution. The file would have a different name compared to what is in the evolved schema. Since the field name in the evolved schema is not in the reference table's schema, we would find the field from the Iceberg table's schema. This would give us a field in Iceberg. Then use that field to extract the field_id to obtain the relative location in the reference table. This reference table contains the base_schema.

I think this logic would work as long as, we use it with renamed fields and not update new fields. Because if we rename a newly added field.

Let's take the following Scenario

State 1:

Change: N/A
File Schema : A, B, C
Iceberg Schema: A, B, C
Field Ids: [A: 1, B:2, C: 3]

State 2:

Change: Adding a new column and update values for that column and place it as the second column
File Schema : A, B, C, D
Iceberg Schema: A, D, B, C
Field Ids: [A: 1, B:2, C: 3, D: 4]

State 3:

Change: Rename `D` to `K`
File Schema : A, B, C, D
Iceberg Schema: A, K, B, C
Field Ids: [A: 1, B:2, C: 3, K: 4]

Now let's evaluate our algorithm

SELECT K FROM TABLE1;
reference_table ={ schema:  [A,  B, C, D], data : [...] } # File Schema
current_table_schema = { names : [A, K, B, C], ids: [1, 4, 2, 3] } # Iceberg Schema
field = K
if field not in reference_table:
  ref_field = current_table_schema.find_field(field) #
  field_id = ref_field.field_id # 4
  index = field_id - 1 # 3
  field = reference_table.column_names[ref_field.field_id - 1]

This would yield us the value in index 3 of reference_table, which would be D

We must validate this logic and check for edge cases.

vibhatha commented 1 year ago

The visitor.py we have a visit_project method for RelUpdateVisitor do we have an accurate logic there? To test this do the following;

Step 1:

Append a new column, run the module

Step 2:

Renamed the appended column, run the module

Step 3:

Drop the renamed column, run the module.