Resource usage climbing

bhalonen commented 2 years ago

PGSync version: 2.3.1 Postgres version: 13.7 Elasticsearch version: 8.3.2 Redis version: redis:7-alpine Python version: 3.10.5 Problem Description: We are experiencing growing memory usage during production. How can we limit pgsync's memory usage? Does limiting query complexity help with that? ngram count (would expect that to be an elastic thing)?

Our db size is 750 MB and pgsync is using 40GB of memory.

toluaina commented 2 years ago

The initial sync is always the most demanding and requires a lot resources proportional to the data size and number.
I don't think ngram has any effect here as that is related to Elasticsearch.
This might be due to SQLAlchemy holding on to references. Some discussion on this here.
Once the initial sync is complete, memory usage should not continue to grow.
When the daemon is restarted memory memory usage should drop considerably.

bhalonen commented 2 years ago

@toluaina thanks for the response!

we are definitely out of the initial sync phase. The resource usage creeps up over hours.

Restarting the daemon definitely drops memory, our hold over solution is to run pgsync in run once mode in a loop from bash.

Thanks for the SQLAlchemy tip, I was suspecting it was something with the queries.

zacherkkila commented 2 years ago

Interesting, for some context on timing (Brent and I working on the same issue). After a restart we are up from negligible memory usage to 44% of our machine in under 2 hours (approx 28gb memory used)

toluaina commented 2 years ago

Do you know how many rows were synced in that period?
Are you using any plugins?
A high cpu usage suggests pgsync is not idle and is still processing some heavy workload.
what is the status of Redis at this point ? i.e count
what is the size of the replication slot ? you can find this with (where testdb is your replication slot name)
- SELECT COUNT(*) FROM PG_LOGICAL_SLOT_PEEK_CHANGES('testdb', NULL, 1)
also what is the status of the console?

My guess is that the initial sync is still not complete even though it may appear so .

bhalonen commented 2 years ago

@toluaina what would you look for to indicate that initial sync is complete? We have one very simple plugin:

from pgsync import plugin

class IdInDomainPlugin(plugin.Plugin):
    name = 'IdInDomain'

    def transform(self, doc, **kwargs):
        if 'id_in_domain' in doc:
            id_in_domain = doc['id_in_domain'] 
            doc['id_in_domain'] = str(id_in_domain)

        return doc

right now we have a 10-30s sync time with a ~400x decrease in memory usage by looping with run once.

toluaina commented 2 years ago

Yes that plugin looks straightforward.
You should see something like this in the terminal meaning its waiting for an event to process

Specifically pending should be near or equal to 0


Sync starcraft Xlog: [0] => Db: [0] => Redis: [total = 0 pending = 0] => Elastic: [0] ...
Sync starcraft Xlog: [0] => Db: [0] => Redis: [total = 0 pending = 0] => Elastic: [0] ...
Sync starcraft Xlog: [0] => Db: [0] => Redis: [total = 0 pending = 0] => Elastic: [0] ...
Sync starcraft Xlog: [0] => Db: [0] => Redis: [total = 0 pending = 0] => Elastic: [0] ...

zacherkkila commented 2 years ago

From what I can see the latest logs look pretty healthy and everything seems to work well besides the memory increase

zacherkkila commented 2 years ago

Here is a frame where data was being synced

and then back to 0 in the next set of logs

loren commented 2 years ago

I have a similar issue and may have some more data to share to help shine a light on this. I run top on the pgsync container, which is running on a GCP VM instance (Google cloud OS) with 4GB RAM. After the initial sync, I start the -d process and it logs like this:

Sync foo Xlog: [0] => Db: [0] => Redis: [total = 0 pending = 0] => Elastic: [0] ...

Looking at the RSS in top, pgsync is using ~95MB RAM.

I have a Postgres table with 600,000 rows that I sync to ES with a simple mapping, no children nodes. If I update the table all at once (e.g., update mytable set x=x+1), the pgsync RSS spikes from 95MB to 3.0GB within seconds, and within a minute the OOM Killer shuts down pgsync.

[84675.994166] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=docker.service,mems_allowed=0,global_oom,task_memcg=/system.slice/docker-8ed4d7f0dbab30e509c77e6a1ceaa1ea1c6e8262ad691c10a72ac0218e51d876.scope,task=pgsync,pid=2339,uid=1000
[84676.016132] Out of memory: Killed process 2339 (pgsync) total-vm:4378844kB, anon-rss:3435444kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:7036kB oom_score_adj:0

If I repeat that whole process but only update 300K of the 600K, RSS shoots to 1.6GB but everything runs. The redis queue gets worked through in a few minutes and the CPU stabilizes, but the RSS stays at 1.6GB indefinitely. I understand there are several non-psgync reasons why RSS may not decrease. What's causing problems is the almost immediate ballooning of RAM when the daemon receives a lot of updates.

In the comment above, I see "Once the initial sync is complete, memory usage should not continue to grow" but this is not at all the case for me, and apparently not for the OP either.

toluaina / pgsync

Resource usage climbing #315