singer-io / tap-postgres

tap-postgres
GNU Affero General Public License v3.0
67 stars 63 forks source link

During incremental copy, the last row is always updated #104

Open timmysuh opened 3 years ago

timmysuh commented 3 years ago

I cannot think of a reason why we have ">=" instead of just ">" for INCREMENTAL replication logic?

It seems that we always end up updating the last row. Why is this a desired behavior?

if replication_key_value: select_sql = """SELECT {} FROM {} WHERE {} >= '{}'::{} ORDER BY {} ASC""".format(','.join(escaped_columns), post_db.fully_qualified_table_name(schema_name, stream['table_name']), post_db.prepare_columns_sql(replication_key), replication_key_value, replication_key_sql_datatype, post_db.prepare_columns_sql(replication_key))

ingcrengifo commented 3 years ago

I have the same doubt.

tmonks commented 3 years ago

I agree, it should be ">". I've tested this change and am no longer getting the duplicates. Hoping we can get this change made to master.

NicolasRisi commented 3 years ago

In fact, it's normal to have ">=" to avoid loss of data. If 2 row are written on disk sequentially but have the same "updated_at" value, and the extraction is made between the 2 writes, you will lose data.

tmonks commented 3 years ago

What's the best way to avoid duplicated rows then? I haven't run into this issue with any of the other taps I've worked with.

NicolasRisi commented 3 years ago

The target must manage that and upsert or merge the New data.

bback99 commented 3 years ago

@NicolasRisi Any recommends when original table doesn't have the PK but target has PK?

so WHERE {} >= '{}'::{}

this makes duplicated key for the last row as @timmysuh mentioned when rerun the query in this case.