Reverting a bad create_xxx_data_table with file dependencies leads to inconsistent state

jankatins commented 5 years ago

We just run in this scenario: Incremental load job, with create_xxx_data_table.sql + read_xxx.sql. create_xx_data_table.sql depends on schema + both involved sql files.

The problem happened on a bad merge which resulted in create_xxx_data_table.sql errors after the DROP and before the CREATE TABLE was run successfully. This merge got reverted, so the file afterwards had the same checksum as before the merge.

The problem was that the table got DROPed but the checksum was thinking the file didn't change so does not need to rerun.

IMO the logic should be changed so that the old checksum is removed from the cache if the process resulted in an error.

I can send a PR if this is the right approach (or any other which might be better).

martin-loetzsch commented 4 years ago

IMO the logic should be changed so that the old checksum is removed from the cache if the process resulted in an error.

That actually makes sense, let's do it

jankatins commented 4 years ago

Another interesting case for load: changing times/table which leads to a drop+create and a full load due to empty table. If in that case the load fails in the middle, half of the data is in there. The last_modified_value is still set, the table is not empty, so will do an incremental load and the table misses a lot of items. If we delete the last_modified_value, the table is not dropped (as the files are not changed) and so contains data and will result in duplicate data. So for load job:

If the table is empty, remove any existing last_modified_value
If the last_modified_value is empty, but the table is not: truncate the table before doing a full load

For file dependencies, I would do this:

If a file changed, delete any checksum value for that task (would be set to new value on success but this is now a second write) -> as we are attempting something, we anyway can't be sure if the result of that task is still there so even on a revert we need to run completely.
If the process result sin an error (for whatever reason): also drop the checksum value for that task

mara / mara-pipelines

Reverting a bad create_xxx_data_table with file dependencies leads to inconsistent state #18