toluaina / pgsync

Postgres to Elasticsearch/OpenSearch sync
https://pgsync.com
MIT License
1.16k stars 181 forks source link

Memory problem #356

Open eluizbr opened 1 year ago

eluizbr commented 1 year ago

PGSync version: 2.3.2

Postgres version: 12.11

Elasticsearch version: 7.10

Redis version: 6.2.6

Python version: 3.10.8

Problem Description: Friends, I have a PGSYNC implementation on a kubernetes, I have been facing constant problems with memory overflow.

pgsync starts using memory until it runs out of memory. Do you have any config or guidance you can give me?

Error Message (if any):

toluaina commented 1 year ago

Can you share more details. How much RAM is PGSync using? Which is the overall size of your DB including row count. You can run this Query to produce some stats

SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_total_relation_size(C.oid)) AS "total_size"
  FROM pg_class C
  LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND C.relkind <> 'i'
    AND nspname !~ '^pg_toast'
  ORDER BY pg_total_relation_size(C.oid) DESC
  LIMIT 10;
eluizbr commented 1 year ago

Hi, Thanks for your reply:

PGSYNC Memory : 2GB

public.refurbishItems,2596 MB public.sessions,1804 MB public.task,1662 MB public.searches,1235 MB public.activities,382 MB public.refurbishes,334 MB public.files,308 MB public.UserPermissions,197 MB public.items,160 MB public.refurbishFiles,109 MB

toluaina commented 1 year ago
mmbfreitas commented 1 year ago

Hi @toluaina , i work with Emerson, we solved that, thanks... however one of our replication slots are stoping syncing occasionally and we can't figure out why. What's is happening is that some point the slot stop syncing and the size of it start growing indefinitely, and it is happening just with one table, Do you have any suggestions?

thanks

eluizbr commented 1 year ago

Hi @toluaina, Just to complement... It occurs between 1 or 2 times a week. We need to kill the PGSync POD to get it working normally again. I don't know if it's useful information, but it's on a Kubernets

POSTGRES: 12.11 AWS RDS PYTHON: 3.10.8 (DOCKER / KUBERNETS)

this is our ENV today:

CONSOLE_LOGGING_HANDLER_MIN_LEVEL=WARNING ELASTICSEARCH_SSL_SHOW_WARN=False ELASTICSEARCH_USE_SSL=False QUERY_CHUNK_SIZE=20000 REDIS_HOST=HOST_HERE ELASTICSEARCH_AWS_HOSTED=true ELASTICSEARCH_CHUNK_SIZE=6000 REDIS_READ_CHUNK_SIZE=3000 REDIS_WRITE_CHUNK_SIZE=3000 ELASTICSEARCH_STREAMING_BULK=True

LOGS:

Sync homehero Xlog: [16] => Db: [746] => Redis: [total = 746 pending = 0] => Elastic: [18,248] ... Sync homehero Xlog: [17] => Db: [744] => Redis: [total = 744 pending = 0] => Elastic: [64,152] ... Sync homehero Xlog: [16] => Db: [746] => Redis: [total = 746 pending = 0] => Elastic: [18,248] ... Sync homehero Xlog: [17] => Db: [744] => Redis: [total = 744 pending = 0] => Elastic: [64,152] ... Sync homehero Xlog: [17] => Db: [744] => Redis: [total = 744 pending = 0] => Elastic: [64,152] ... Sync homehero Xlog: [16] => Db: [746] => Redis: [total = 746 pending = 0] => Elastic: [18,248] ...

mmbfreitas commented 1 year ago

@toluaina Do you have any ideia? we are having a lot of this problem live

toluaina commented 1 year ago

Hi,

Sorry about the delay.

toluaina commented 1 year ago

also the replication slot is tied to the database and not to individual tables. do you mean pgsync stops syncing and the replication slots starts growing?

eluizbr commented 1 year ago

Hi, @toluaina how many replication slots do you have? does this mean that replication slots are continually growing?

toluaina commented 1 year ago

So the default is 180 secs (3 mins) as described here We just want to confirm the slots are getting truncated each time it runs. So we it to a low value like 10 secs to see if/when the replication slot gets truncated.

eluizbr commented 1 year ago

@toluaina We only have the problem with a specific slot... today I fear about 10 slots...

mmbfreitas commented 1 year ago

Hey tanks for helping,

how many replication slots do you have?

R: We have 2, pgsync is creating one to each mapped table on schema

does that mean the replication slot(s) is/are growing continuously?

R: yes, analyzing logs on AWS the oldest replication lag and the transaction log disk usage

transaction log disk

image

oldes replication lag

image
eluizbr commented 1 year ago

also the replication slot is tied to the database and not to individual tables. do you mean pgsync stops syncing and the replication slots starts growing?

exactly what happens....

toluaina commented 1 year ago

So after 10 secs. You should see this line in the logs. assuming REPLICATION_SLOT_CLEANUP_INTERVAL is 10 "Truncating replication slot: "

eluizbr commented 1 year ago

So after 10 secs. You should see this line in the logs. assuming REPLICATION_SLOT_CLEANUP_INTERVAL is 10 "Truncating replication slot: "

We try 15 seconds... If the problem persists, let us know here for you... Thanks for your help.

mmbfreitas commented 1 year ago

@toluaina First of all it is important to say that we have a lambda function to clean our slots every day at 8:45 UTC, we created that because we are using postgres 12 and it does not provide max_slot_wal_keep_size options, and sometimes the size of slots was growing indefinitely

So i have just analyzed that old replication lag start to grow always at this time 8:45, however this not happened every day, there aren't a pattern to happen that.

Looking into logs i have seen this

2023-01-30 08:42:06 UTC:172.17.3.182(33648):homehero@homehero:[29390]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_GET_CHANGES('homehero_installments', NULL, NULL) 2023-01-30 08:42:06 UTC:172.17.3.182(33648):homehero@homehero:[29390]:LOG: logical decoding found consistent point at C6F/140008C8 2023-01-30 08:42:06 UTC:172.17.3.182(33648):homehero@homehero:[29390]:DETAIL: There are no running transactions. 2023-01-30 08:42:06 UTC:172.17.3.182(33648):homehero@homehero:[29390]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_GET_CHANGES('homehero_installments', NULL, NULL) 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:LOG: starting logical decoding for slot "homehero_payments" 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:DETAIL: Streaming transactions committing after C6F/18000110, reading WAL from C6F/18000028. 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_GET_CHANGES('homehero_payments', NULL, NULL) 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:LOG: logical decoding found consistent point at C6F/18000028 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:DETAIL: There are no running transactions. 2023-01-30 08:44:38 UTC:172.17.3.182(57322):homehero@homehero:[11817]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_GET_CHANGES('homehero_payments', NULL, NULL) 2023-01-30 08:45:07 UTC:172.17.3.182(43982):homehero@homehero:[11864]:ERROR: replication slot "homehero_installments" does not exist 2023-01-30 08:45:07 UTC:172.17.3.182(43982):homehero@homehero:[11864]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_GET_CHANGES('homehero_installments', NULL, NULL) 2023-01-30 08:45:15 UTC:172.17.3.182(53812):homehero@homehero:[22735]:LOG: logical decoding found consistent point at C6F/1C04E8A0 2023-01-30 08:45:15 UTC:172.17.3.182(53812):homehero@homehero:[22735]:DETAIL: There are no running transactions. 2023-01-30 08:45:15 UTC:172.17.3.182(53812):homehero@homehero:[22735]:STATEMENT: SELECT * FROM PG_CREATE_LOGICAL_REPLICATION_SLOT('homehero_payments', 'test_decoding') 2023-01-30 08:45:22 UTC:172.17.3.182(53814):homehero@homehero:[22751]:LOG: logical decoding found consistent point at C6F/1C079E38 2023-01-30 08:45:22 UTC:172.17.3.182(53814):homehero@homehero:[22751]:DETAIL: There are no running transactions. 2023-01-30 08:45:22 UTC:172.17.3.182(53814):homehero@homehero:[22751]:STATEMENT: SELECT * FROM PG_CREATE_LOGICAL_REPLICATION_SLOT('homehero_installments', 'test_decoding') 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:LOG: starting logical decoding for slot "homehero_payments" 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:DETAIL: Streaming transactions committing after C6F/1C04E8D8, reading WAL from C6F/1C04E8A0. 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:STATEMENT: SELECT count(*) AS count_1 FROM PG_LOGICAL_SLOT_PEEK_CHANGES('homehero_payments', NULL, NULL) WHERE CAST(CAST(xid AS TEXT) AS BIGINT) < 162715942 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:LOG: logical decoding found consistent point at C6F/1C04E8A0 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:DETAIL: There are no running transactions. 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:STATEMENT: SELECT count(*) AS count_1 FROM PG_LOGICAL_SLOT_PEEK_CHANGES('homehero_payments', NULL, NULL) WHERE CAST(CAST(xid AS TEXT) AS BIGINT) < 162715942 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:LOG: starting logical decoding for slot "homehero_payments" 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:DETAIL: Streaming transactions committing after C6F/1C04E8D8, reading WAL from C6F/1C04E8A0. 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_PEEK_CHANGES('homehero_payments', NULL, NULL) WHERE CAST(CAST(xid AS TEXT) AS BIGINT) < 162715942 LIMIT 5000 OFFSET 0 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:LOG: logical decoding found consistent point at C6F/1C04E8A0 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:DETAIL: There are no running transactions. 2023-01-30 08:45:30 UTC:172.17.3.182(43538):homehero@homehero:[22761]:STATEMENT: SELECT xid, data FROM PG_LOGICAL_SLOT_PEEK_CHANGES('homehero_payments', NULL, NULL) WHERE CAST(CAST(xid AS TEXT) AS BIGINT) < 162715942 LIMIT 5000 OFFSET 0

at 8:45 ERROR: replication slot "homehero_installments" does not exist happened, after that pgsync has created replication slots again, payment back to work, but installment doesn't

toluaina commented 1 year ago

PS: I'm happy to move this discussion to discord if you feel necessary.

eluizbr commented 1 year ago
  • It seems you have a number of replication slots.
  • I'm trying to understand which one is the culprit.
  • Replication slot homehero_installments should already exist before running pgsync.
  • The fact that it does not exist implies that the bootstrap was not run yet.
  • bootstrap -c schema.jsonshould be run just once to set things up including ALL your replication slots.
  • then run pgsync -c schema.json to sync

PS: I'm happy to move this discussion to discord if you feel necessary.

I run bootstrap -c schema.json every time the container is recreated... Could this be a problem then? About going to discord, that would be great...

toluaina commented 1 year ago