toluaina / pgsync

Postgres to Elasticsearch/OpenSearch sync
https://pgsync.com
MIT License
1.11k stars 174 forks source link

High Latency - only sync when specific columns are changed in child table #422

Closed qbzenker closed 1 year ago

qbzenker commented 1 year ago

PGSync version: 2.5

Postgres version: 14.4

Elasticsearch version: Opensearch 2.3

Redis version: 6.2

Python version: 3.10

Problem Description:

Hi - we are seeing high latency numbers from pg -> es (over 30 mins in some cases), even after tuning the performance variables (e.g., USE_ASYNC, NTHREADS_POLLDB and POLL_TIMEOUT). We think this is because we have one child table that has one or two columns that change very frequently. However, we are not indexing these columns (they don't appear in the pgsync schema).

It appears that the PG triggers are set up in such a way such that they'll fire and cause pgsync to run after any column in that child table has changed. Is this correct? If so, do you know of any way around this with PGSync?

Thanks in advance, and thanks for all of your work on this library!

toluaina commented 1 year ago

30 mins seems excessively high. Can you give me an idea of the database schema layout? Can you ensure you have the correct indices first ? i.e run pgsync -c schema.json -a

Yes pgsync triggeres will fire events when any child entity has changed. That said there might be possibility of some optimisation here if you are not concerned about the child column.

qbzenker commented 1 year ago

Agreed - thanks for taking a look. I did pgsync -c schema.json -a and and got all OK - cols had an appropriate index.

here is our table layout

[
  {
    "database": "db",
    "index": "index-v1",
    "mapping": {},
    "plugins": ["P1", "P2", "P3"],
    "nodes": {
      "table": "J",
      "schema": "public",
      "columns": [
        ...
      ],
      "children": [
        {
          "table": "B",
          "label": "b",
          "columns": ["id", "type", "etc"],
          "relationship": {
            "variant": "object",
            "type": "one_to_many"
          }
        },
        {
          "table": "JC",
          "label": "jc",
          "columns": ["id", "package_id"],
          "relationship": {
            "variant": "object",
            "type": "one_to_many"
          },
          "children": [
            {
              "table": "T",
              "label": "t",
              "columns": [
                ...
              ],
              "relationship": {
                "variant": "object",
                "type": "one_to_many",
                "foreign_key": {
                  "child": ["jcid"],
                  "parent": ["id"]
                }
              },
              "children": [
                {
                  "table": "P",
                  "label": "p",
                  "columns": ["name"],
                  "relationship": {
                    "variant": "scalar",
                    "type": "one_to_one",
                    "foreign_key": {
                      "child": ["id"],
                      "parent": ["pid"]
                    }
                  }
                },
                {
                  "table": "D", // <-- this is the table that FREQUENTLY changes!
                  "label": "d",
                  "columns": [
                    ...
                  ],
                  "relationship": {
                    "variant": "object",
                    "type": "one_to_one",
                    "foreign_key": {
                      "child": ["tid"],
                      "parent": ["id"]
                    }
                  }
                }
              ]
            },
            {
              "table": "P",
              "label": "p",
              "columns": [
                ...
              ],
              "relationship": {
                "variant": "object",
                "type": "one_to_one"
              },
              "children": [
                {
                  "table": "L",
                  "label": "pl",
                  "columns": [
                    ...
                  ],
                  "relationship": {
                    "variant": "object",
                    "type": "one_to_one",
                    "foreign_key": {
                      "child": ["id"],
                      "parent": ["plid"]
                    }
                  }
                },
                {
                  "table": "L",
                  "label": "dl",
                  "columns": [
                  ...
                  ],
                  "relationship": {
                    "variant": "object",
                    "type": "one_to_one",
                    "foreign_key": {
                      "child": ["id"],
                      "parent": ["dlid"]
                    }
                  }
                }
              ]
            }
          ]
        }
      ]
    }
  }
]
qbzenker commented 1 year ago

Is there a way to horizontally or vertically scale the deployment? If so, would that alleviate some of the latency problems? I'm hoping to be able to use this in a production environment where this solution will scale 10 to 100x what it is today :)

toluaina commented 1 year ago

Apologies for the delay. It's hard to know where the bottleneck is. It might be work running with -v to see if some of the queries are taking a long time to complete. Another suggestion would be to incrementally add child nodes and see where the latency starts to happen. Also please ensure you have the latest version which address some performance concerns,

qbzenker commented 1 year ago

Hey - thanks for the reply! I am using the latest version. I have a curious observation - when I restart the container running pgsync in daemon mode, the lagging records get indexed immediately, but then after a few minutes (to hrs, depending on volume) the latency increases to the point where created records in PG take ~1hr+ to get indexed into ES. Any thoughts here?

Seems like in our case it'd be faster to just periodically (every few seconds) call the pull() method in sync.py instead of doing the continuous sync. I believe this is happening because when we create a top-level "B" we also create a ton of records in children tables too.

qbzenker commented 1 year ago

The fix here was to switch to the polling mode instead of streaming.