Open timmysuh opened 3 years ago
I have the same doubt.
I agree, it should be ">". I've tested this change and am no longer getting the duplicates. Hoping we can get this change made to master.
In fact, it's normal to have ">=" to avoid loss of data. If 2 row are written on disk sequentially but have the same "updated_at" value, and the extraction is made between the 2 writes, you will lose data.
What's the best way to avoid duplicated rows then? I haven't run into this issue with any of the other taps I've worked with.
The target must manage that and upsert or merge the New data.
@NicolasRisi Any recommends when original table doesn't have the PK but target has PK?
so WHERE {} >= '{}'::{}
this makes duplicated key for the last row as @timmysuh mentioned when rerun the query in this case.
I cannot think of a reason why we have ">=" instead of just ">" for INCREMENTAL replication logic?
It seems that we always end up updating the last row. Why is this a desired behavior?
if replication_key_value: select_sql = """SELECT {} FROM {} WHERE {} >= '{}'::{} ORDER BY {} ASC""".format(','.join(escaped_columns), post_db.fully_qualified_table_name(schema_name, stream['table_name']), post_db.prepare_columns_sql(replication_key), replication_key_value, replication_key_sql_datatype, post_db.prepare_columns_sql(replication_key))