toluaina / pgsync

Postgres to Elasticsearch/OpenSearch sync
https://pgsync.com
MIT License
1.1k stars 172 forks source link

Provide Optimized Reindexing #517

Open jvanderen1 opened 6 months ago

jvanderen1 commented 6 months ago

I was wondering if it would be possible to optimize how PGSync updates Elasticsearch records. I put 2 different thoughts I had into these dropdown menu's. Please let me know what you think.

Only Update When Specified Columns Changed
I notice that when an unspecified column updates, it fires a re-index. ### Example Given the following schema for table `my_table` ```ruby # Table: my_table # --------------------------------------------------------------------------------------------------------- # Columns: # id | integer | PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY # foo | text | # bar | text | # --------------------------------------------------------------------------------------------------------- ``` And a `schema.json` defined as: ```json [ { "database": "default", "index": "my_index", "nodes": { "table": "my_table", "columns": [ "foo" ] } } ] ``` ### What is happening? `my_index` is getting updated when column `bar` updates. ### What do I expect to happen instead? `my_index` would not be updated when `bar` updates because it is not specified in `schema.json`.
Batch Updates Together
I notice that when a record updates multiple times, it gets re-indexed the same number of times it was updated. ### Example Given the following schema for table `my_table` ```ruby # Table: my_table # --------------------------------------------------------------------------------------------------------- # Columns: # id | integer | PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY # foo | text | # --------------------------------------------------------------------------------------------------------- ``` And a `schema.json` defined as: ```json [ { "database": "default", "index": "my_index", "nodes": { "table": "my_table", "columns": [ "foo" ] } } ] ``` And I perform the following query: ```sql UPDATE my_table SET foo = "hello" WHERE id = 1; ``` Followed by another query (before `pgsync` runs again): ```sql UPDATE my_table SET foo = "world" WHERE id = 1; ``` ### What is happening? 2 insert requests are sent to `my_index` the next time `pgsync` runs. ### What do I expect to happen instead? 1 insert request is sent to `my_index` the next time `pgsync` runs. Because we are performing full document updates in the index, there is no reason to send the same document to Elasticsearch twice.