medic / cht-sync

Data synchronization between CouchDB and PostgreSQL for the purpose of analytics.
GNU General Public License v3.0
4 stars 5 forks source link

Figure out what to do with edits and deletes #112

Closed dianabarsan closed 4 months ago

dianabarsan commented 5 months ago

@garethbowen and I have been talking about how to handle edits and deletes in cht-sync -> postgres.

  1. Old couch2pg would always overwrite previous versions of docs, and postgres would only have access to the latest version. Deleted docs from CouchDb were also deleted from postgres.
  2. Old cht-sync (logstash) would never edit records, and instead keep a copy of every document revision, and deleted docs were deleted from postgres (theoretically, haven't tested)
  3. New cht-sync (couch2pg++) follows logstash in keeping copies of every rev, but does not handle deletes.

I would like to discuss and reach a conclusion about two things:

  1. What data do we store in postgres? Do we keep every rev or do we overwrite?
  2. What do we do with deletesin postgres?

There are a couple of issues (of varying degress of complexity impact do code and views). When we save a copy of every ref, we need to disambiguate these in our main queries and make sure we always get the "correct" one. Aidan's original suggestion is naive and doesn't even work: https://github.com/medic/cht-sync/pull/1

Ordering by _rev, which is a string, won't yield the correct result, because rev 183-fjdsklfjhdsklfs will score higher than rev 1012-fgdshjfsfks. In couch2pg++ I've changed the stored rev to strip the hash, and made the rev field to be an integer to allow ordering, but we still need to be very careful and intentional about which version of the doc we sent further into DBT. An additional question here is how incremental tables handle these changes? are they detected as changes? I think a round of comprehensive behavior testing is necessary here, to make an informed decision about: how complex this gets when we do it right, how much of a performance impact edits have on incremental models/views/whatever we decieded is most performant, do we know how to select the correct version of the doc.

The other thing I want to discuss is what we do with deletes. If we indeed keep a copy of every rev, and we have this ledger approach, the we should not be deleting any records from postgres, and rather mark them as deleted and propagate them somehow further in the models, to be either ignored or taken into account.

I think the simplest thing we could do is decide that we won't keep a copy of evert rev, make edits to postgres records, and delete postgres records when docs get deleted from CouchDb. It's unclear to me what significant advantages we get from keeping old copies of records in postgres, while we clearly have a lot more complexity and overhead in code and queries to maintain this.

garethbowen commented 5 months ago

I think we should go back to the couch2pg approach of only keeping the latest change. Nobody has asked for all revs to be available so it adds unnecessary complexity to the queries to have to find the latest. If it becomes a requirement later we can worry about it then once we know the actual use case.

It's likely that this will make write performance worse because instead of always just appending now we have to UPSERT, but write performance is less important than read performance because reading will happen far more than writing over the lifetime of the record.

I'm not sure what the limitations of DBT are here so we'll just have to make sure it can handle this scenario before going too far down this road.

witash commented 5 months ago

For DBT, the strategy for incremental tables that it's currently using, and the one we probably want, is delete+insert where it queries the source table for changed rows (changed defined however you want, for cht-pipeline its @timestamp ) then deletes each of those rows from the target table (using the unique_key column, for cht-pipeline currently its document id from couchdb) and then inserts all changed rows into the target table.

so keeping old revs adds complexity there also; as its currently written it won't quite work because document id by itself wouldn't uniquely identify a row. and adding the rev to the unique key would mean pushing this update/insert logic down into the models, which I don't think we want.

about deletes, this strategy doesn't support hard deletes at all; deleting rows from the source table wouldn't change anything in the target table, and there's also not a way to delete things from the target table.

dianabarsan commented 5 months ago

According to this, we could enable deletes as post hooks; https://medium.com/@uhuynh/execute-sql-delete-statement-in-dbt-2fe64bca8df5

dianabarsan commented 5 months ago

After numerous discussions, we have had noone advocating for the necessity of keeping a copy of every rev. This mean we're going ahead with:

  1. Overwriting records on edits
  2. Deleting records
njuguna-n commented 4 months ago

PR merged so closing this issue.