Closed jankatins closed 4 years ago
IMO the logic should be changed so that the old checksum is removed from the cache if the process resulted in an error.
That actually makes sense, let's do it
Another interesting case for load: changing times/table which leads to a drop+create and a full load due to empty table. If in that case the load fails in the middle, half of the data is in there. The last_modified_value is still set, the table is not empty, so will do an incremental load and the table misses a lot of items. If we delete the last_modified_value, the table is not dropped (as the files are not changed) and so contains data and will result in duplicate data. So for load job:
For file dependencies, I would do this:
We just run in this scenario: Incremental load job, with
create_xxx_data_table.sql
+read_xxx.sql
.create_xx_data_table.sql
depends on schema + both involved sql files.The problem happened on a bad merge which resulted in
create_xxx_data_table.sql
errors after theDROP
and before theCREATE TABLE
was run successfully. This merge got reverted, so the file afterwards had the same checksum as before the merge.The problem was that the table got DROPed but the checksum was thinking the file didn't change so does not need to rerun.
IMO the logic should be changed so that the old checksum is removed from the cache if the process resulted in an error.
I can send a PR if this is the right approach (or any other which might be better).