mara / mara-pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
MIT License
2.08k stars 102 forks source link

Reverting a bad create_xxx_data_table with file dependencies leads to inconsistent state #18

Closed jankatins closed 4 years ago

jankatins commented 5 years ago

We just run in this scenario: Incremental load job, with create_xxx_data_table.sql + read_xxx.sql. create_xx_data_table.sql depends on schema + both involved sql files.

The problem happened on a bad merge which resulted in create_xxx_data_table.sql errors after the DROP and before the CREATE TABLE was run successfully. This merge got reverted, so the file afterwards had the same checksum as before the merge.

The problem was that the table got DROPed but the checksum was thinking the file didn't change so does not need to rerun.

IMO the logic should be changed so that the old checksum is removed from the cache if the process resulted in an error.

I can send a PR if this is the right approach (or any other which might be better).

martin-loetzsch commented 4 years ago

IMO the logic should be changed so that the old checksum is removed from the cache if the process resulted in an error.

That actually makes sense, let's do it

jankatins commented 4 years ago

Another interesting case for load: changing times/table which leads to a drop+create and a full load due to empty table. If in that case the load fails in the middle, half of the data is in there. The last_modified_value is still set, the table is not empty, so will do an incremental load and the table misses a lot of items. If we delete the last_modified_value, the table is not dropped (as the files are not changed) and so contains data and will result in duplicate data. So for load job:

For file dependencies, I would do this: