toluaina / pgsync

Postgres to Elasticsearch/OpenSearch sync
https://pgsync.com
MIT License
1.1k stars 172 forks source link

Redis bulk issue #481

Open erry-takumi opened 11 months ago

erry-takumi commented 11 months ago

PGSync version: 2.5.0

Postgres version: PostgreSQL 13.10 on aarch64-unknown-linux-gnu, compiled by gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-6), 64-bit (AWS RDS)

Elasticsearch version: 6.8.0

Redis version: 6.2.4

Python version: python:3.8.16-alpine3.17

Problem Description:

When the program processes items, it starts by processing the number of items specified by REDIS_READ_CHUNK_SIZE. However, there's an issue. If there are more items left in the queue after this initial processing, the program should continue processing in chunks of REDIS_READ_CHUNK_SIZE until the queue is empty. But it doesn't do this. Instead, it waits until the queue size grows by an additional REDIS_WRITE_CHUNK_SIZE before processing again. At that point, it will still process only the newest 1000, then go back to sleep.

In simpler terms: The program should keep processing items in chunks until the queue is empty. But, if there are items left after the first chunk, it waits for even more items to be added before it processes again.

We have some cronjobs that can edit a lot of records in 1 go, so there can be thousands of changes all at once. This resulted in our redis queue exceeding 120K records.

However, the process did not fully catch up, and instead did 1000 records even if there were 10000 changes, then just waited for AN ADDITIONAL 1000 changes.

I’ve addressed this by setting REDIS_READ_CHUNK_SIZE to a large enough value that it consumed the whole queue at once, which didn’t seem to be a problem in terms of system resources, but I think now we can set it back. As for REDIS_READ_CHUNK_SIZE i’ve set it to 1 for now.

For a visual example:

We also see messages like this:

 Xlog: [132] => Db: [19,084] => Redis: [total = 9,000 pending = 116,532] => Elasticsearch: [2,950,833]...

These messages keep repeating. Redis queue will not get smaller until there are another 1000 records in the database, and it cannot catch up.

Basically steps to reproduce:

I hope this makes sense, let me know if you want some script to reproduce as I am fairly certain of how it can happen

Error Message (if any):

None but pgsync 'falls behind' the db

toluaina commented 10 months ago

Thanks for the very detailed analysis. This gives me enough to understand what is going on. I will take a look at this right away

nowfred commented 3 months ago

Same thing happening to us!

@erry-takumi thank you for the writeup and helpful temp solve