Open jpughcs opened 2 years ago
Seems to work fine with a single schema.
Scratch that, problem remains even with single schema
Syncing db Xlog: [22] => Db: [3,142] => Redis: [total = 1,001 pending = 2,141] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [3,142] => Redis: [total = 1,001 pending = 2,141] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [3,142] => Redis: [total = 1,001 pending = 2,141] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [3,177] => Redis: [total = 1,001 pending = 2,177] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [3,250] => Redis: [total = 1,001 pending = 2,250] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [3,311] => Redis: [total = 1,001 pending = 2,311] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [3,392] => Redis: [total = 1,001 pending = 2,393] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [3,741] => Redis: [total = 1,001 pending = 2,741] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [4,573] => Redis: [total = 1,001 pending = 3,573] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [5,139] => Redis: [total = 1,001 pending = 4,138] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [5,605] => Redis: [total = 1,001 pending = 4,604] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [6,178] => Redis: [total = 1,001 pending = 5,178] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [6,674] => Redis: [total = 1,001 pending = 5,674] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [7,612] => Redis: [total = 1,001 pending = 6,611] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [7,612] => Redis: [total = 1,001 pending = 6,611] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [7,612] => Redis: [total = 1,001 pending = 6,611] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [7,612] => Redis: [total = 1,001 pending = 6,611] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [7,612] => Redis: [total = 1,001 pending = 6,611] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [7,612] => Redis: [total = 1,001 pending = 6,611] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [7,612] => Redis: [total = 1,001 pending = 6,611] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [7,612] => Redis: [total = 1,001 pending = 6,611] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [7,612] => Redis: [total = 1,001 pending = 6,611] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [7,631] => Redis: [total = 1,001 pending = 6,631] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [8,366] => Redis: [total = 1,001 pending = 7,366] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [9,037] => Redis: [total = 1,001 pending = 8,036] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [9,734] => Redis: [total = 1,001 pending = 8,733] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [10,185] => Redis: [total = 1,001 pending = 9,184] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [10,585] => Redis: [total = 1,001 pending = 9,584] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [10,962] => Redis: [total = 1,001 pending = 9,961] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [11,239] => Redis: [total = 1,001 pending = 10,239] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [11,295] => Redis: [total = 1,001 pending = 10,295] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [11,330] => Redis: [total = 1,001 pending = 10,329] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [11,330] => Redis: [total = 1,001 pending = 10,329] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [11,330] => Redis: [total = 1,001 pending = 10,329] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [11,448] => Redis: [total = 1,001 pending = 10,447] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [11,807] => Redis: [total = 1,001 pending = 10,807] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [12,171] => Redis: [total = 1,001 pending = 11,170] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [12,262] => Redis: [total = 1,001 pending = 11,262] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [12,907] => Redis: [total = 1,001 pending = 11,906] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [13,268] => Redis: [total = 1,001 pending = 12,267] => Elastic: [26,680] ...
Syncing db Xlog: [22] => Db: [13,633] => Redis: [total = 1,001 pending = 12,632] => Elastic: [26,680] ...
2022-01-26 19:14:02.229:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:0.957s]
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [27,680] ...
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [27,680] ...
2022-01-26 19:14:04.716:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:0.944s]
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [28,680] ...
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [28,680] ...
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [28,680] ...
2022-01-26 19:14:07.480:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:0.954s]
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [29,680] ...
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [29,680] ...
2022-01-26 19:14:09.802:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:1.008s]
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [30,680] ...
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [30,680] ...
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [30,680] ...
2022-01-26 19:14:12.506:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:0.966s]
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [31,680] ...
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [31,680] ...
2022-01-26 19:14:14.868:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:0.975s]
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [32,680] ...
Syncing db Xlog: [22] => Db: [13,927] => Redis: [total = 1,001 pending = 12,926] => Elastic: [32,680] ...
Syncing db Xlog: [22] => Db: [14,006] => Redis: [total = 1,001 pending = 13,005] => Elastic: [32,680] ...
2022-01-26 19:14:17.468:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:1.125s]
Syncing db Xlog: [22] => Db: [14,139] => Redis: [total = 1,001 pending = 13,139] => Elastic: [33,680] ...
Syncing db Xlog: [22] => Db: [14,676] => Redis: [total = 1,001 pending = 13,675] => Elastic: [33,680] ...
Syncing db Xlog: [22] => Db: [15,004] => Redis: [total = 1,001 pending = 14,004] => Elastic: [33,680] ...
Syncing db Xlog: [22] => Db: [15,413] => Redis: [total = 1,001 pending = 14,412] => Elastic: [33,680] ...
2022-01-26 19:14:21.607:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:1.373s]
Syncing db Xlog: [22] => Db: [15,697] => Redis: [total = 1,001 pending = 14,697] => Elastic: [34,680] ...
Syncing db Xlog: [22] => Db: [16,005] => Redis: [total = 1,001 pending = 15,005] => Elastic: [34,680] ...
Syncing db Xlog: [22] => Db: [16,484] => Redis: [total = 1,001 pending = 15,483] => Elastic: [34,680] ...
2022-01-26 19:14:24.462:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:1.183s]
Syncing db Xlog: [22] => Db: [16,484] => Redis: [total = 1,001 pending = 15,483] => Elastic: [35,680] ...
Syncing db Xlog: [22] => Db: [16,484] => Redis: [total = 1,001 pending = 15,483] => Elastic: [35,680] ...
2022-01-26 19:14:26.794:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:1.085s]
Syncing db Xlog: [22] => Db: [16,484] => Redis: [total = 1,001 pending = 15,483] => Elastic: [36,680] ...
Syncing db Xlog: [22] => Db: [16,484] => Redis: [total = 1,001 pending = 15,483] => Elastic: [36,680] ...
Syncing db Xlog: [22] => Db: [16,548] => Redis: [total = 1,001 pending = 15,547] => Elastic: [36,680] ...
2022-01-26 19:14:29.603:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:1.005s]
Syncing db Xlog: [22] => Db: [16,826] => Redis: [total = 1,001 pending = 15,826] => Elastic: [37,680] ...
Syncing db Xlog: [22] => Db: [17,201] => Redis: [total = 1,001 pending = 16,200] => Elastic: [37,680] ...
Syncing db Xlog: [22] => Db: [17,529] => Redis: [total = 1,001 pending = 16,528] => Elastic: [37,680] ...
2022-01-26 19:14:33.289:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:1.013s]
Syncing db Xlog: [22] => Db: [17,796] => Redis: [total = 1,001 pending = 16,796] => Elastic: [38,680] ...
Syncing db Xlog: [22] => Db: [18,153] => Redis: [total = 1,001 pending = 17,153] => Elastic: [38,680] ...
Syncing db Xlog: [22] => Db: [18,482] => Redis: [total = 1,001 pending = 17,483] => Elastic: [38,680] ...
Syncing db Xlog: [22] => Db: [18,689] => Redis: [total = 1,001 pending = 17,689] => Elastic: [38,680] ...
Syncing db Xlog: [22] => Db: [18,975] => Redis: [total = 1,001 pending = 17,974] => Elastic: [38,680] ...
2022-01-26 19:14:38.202:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:0.991s]
Syncing db Xlog: [22] => Db: [19,378] => Redis: [total = 1,001 pending = 18,379] => Elastic: [39,680] ...
Syncing db Xlog: [22] => Db: [19,970] => Redis: [total = 1,001 pending = 18,970] => Elastic: [39,680] ...
Syncing db Xlog: [22] => Db: [20,731] => Redis: [total = 1,001 pending = 19,730] => Elastic: [39,680] ...
Syncing db Xlog: [22] => Db: [21,255] => Redis: [total = 1,001 pending = 20,255] => Elastic: [39,680] ...
2022-01-26 19:14:41.999:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:0.979s]
Syncing db Xlog: [22] => Db: [21,647] => Redis: [total = 1,001 pending = 20,647] => Elastic: [40,680] ...
Syncing db Xlog: [22] => Db: [22,286] => Redis: [total = 1,001 pending = 21,286] => Elastic: [40,680] ...
Syncing db Xlog: [22] => Db: [23,198] => Redis: [total = 1,001 pending = 22,198] => Elastic: [40,680] ...
Syncing db Xlog: [22] => Db: [24,012] => Redis: [total = 1,001 pending = 23,012] => Elastic: [40,680] ...
2022-01-26 19:14:46.546:INFO:elasticsearch: POST https://es:9200/db-idx/_bulk?refresh=false [status:200 request:1.035s]
Syncing db Xlog: [22] => Db: [24,493] => Redis: [total = 1,001 pending = 23,493] => Elastic: [41,680] ...
Syncing db Xlog: [22] => Db: [25,334] => Redis: [total = 1,001 pending = 24,334] => Elastic: [41,680] ...
Is this issue easily reproducible from your end? Perhaps we can introduce some logging to narrow down the cause.
Yes, this happens always. Let me know what you'd like to patch in for logging.
Can you add some logging into this method such as:
def bulk_pop(self, chunk_size: Optional[int] = None) -> List[dict]:
"""Remove and return multiple items from the queue."""
chunk_size: int = chunk_size or REDIS_CHUNK_SIZE
pipeline = self.__db.pipeline()
print(f'pipe {chunk_size}')
pipeline.lrange(self.key, 0, chunk_size - 1)
print(f'range: {chunk_size - 1}')
pipeline.ltrim(self.key, chunk_size, -1)
print(f'ltrim: {chunk_size}')
items: List[List[bytes], bool] = pipeline.execute()
logger.debug(f"bulk_pop nsize: {len(items[0])}")
print(f"bulk_pop nsize: {len(items[0])}"}
return list(map(lambda x: json.loads(x), items[0]))
and send the output ?
Absolutely.
I will get this for you ASAP, likely tomorrow morning (evening for you).
Thank you!
Attached is a full run.
You can see there's one tick with bulk_pop nsize: 1000
and then bulk_pop
never seems to be called again.
Scratch that, bulk_pop
is being called again.... but almost exactly at 15 minutes intervals...
Here is my Env config
ELASTICSEARCH_TIMEOUT=60
ELASTICSEARCH_TIMEOUT=600
ELASTICSEARCH_QUEUE_SIZE=4
ELASTICSEARCH_THREAD_COUNT=4
QUERY_CHUNK_SIZE=5000
ELASTICSEARCH_CHUNK_SIZE=1000
ELASTICSEARCH_STREAMING_BULK=False
REDIS_POLL_INTERVAL=0.01
Just to note, I've experimented a bit with the configuration without luck. Even with REDIS_CHUNK_SIZE=10000
it seems that the bulk_pop
, processing of the records, and insertion of the documents in to ES happen near instantaneous. I am unable to figure out what is happening between ticks.
are you running this in docker?
are we sure there is no env var defined somehwere in the config?
The fact that this is being called at intervals of 15 mins leads me to suspect REDIS_POLL_INTERVAL
is in use.
Can you experiment with changing REDIS_POLL_INTERVAL
to a low value like 10 secs and see the impact.
Perhaps even printing out the value of REDIS_POLL_INTERVAL
on statrup
This is being run in docker. The interval appears to change proportionally with REDIS_CHUNK_SIZE
I will experiment changing REDIS_POLL_INTERVAL
Changing REDIS_POLL_INTERVAL
does not seem to make a difference. I am seeing a large amount of these POST
and DELETE
requests (seen below) that happen in between bulk_pop
. I am not familiar with the ES library, but from a quick glance, these appear to be internal functions of the ES library you're using. I've tried both ELASTICSEARCH_STREAMING_BULK
True
and False
and many different values for ELASTICSEARCH_CHUNK_SIZE
(between 1 and 100000) without any luck. All of the said POST
and DELETE
requests complete in a fraction of a second and it appears there is no other activity between bulk_pop
calls.
2022-02-01 19:24:26.965:INFO:elasticsearch: POST https://es:9200/idx/_search?scroll=5m&size=1000 [status:200 request:0.003s]
2022-02-01 19:24:26.967:INFO:elasticsearch: POST https://es:9200/_search/scroll [status:200 request:0.002s]
2022-02-01 19:24:26.968:INFO:elasticsearch: DELETE https://es:9200/_search/scroll [status:200 request:0.001s]
2022-02-01 19:24:26.971:INFO:elasticsearch: POST https://es:9200/idx/_search?scroll=5m&size=1000 [status:200 request:0.002s]
2022-02-01 19:24:26.973:INFO:elasticsearch: DELETE https://es:9200/_search/scroll [status:200 request:0.001s]
2022-02-01 19:24:26.976:INFO:elasticsearch: POST https://es:9200/idx/_search?scroll=5m&size=1000 [status:200 request:0.003s]
2022-02-01 19:24:26.978:INFO:elasticsearch: POST https://es:9200/_search/scroll [status:200 request:0.002s]
2022-02-01 19:24:26.979:INFO:elasticsearch: DELETE https://es:9200/_search/scroll [status:200 request:0.001s]
2022-02-01 19:24:26.982:INFO:elasticsearch: POST https://es:9200/idx/_search?scroll=5m&size=1000 [status:200 request:0.002s]
2022-02-01 19:24:26.983:INFO:elasticsearch: DELETE https://es:9200/_search/scroll [status:200 request:0.001s]
2022-02-01 19:24:26.987:INFO:elasticsearch: POST https://es:9200/idx/_search?scroll=5m&size=1000 [status:200 request:0.003s]
2022-02-01 19:24:26.990:INFO:elasticsearch: POST https://es:9200/_search/scroll [status:200 request:0.003s]
2022-02-01 19:24:26.991:INFO:elasticsearch: DELETE https://es:9200/_search/scroll [status:200 request:0.001s]
2022-02-01 19:24:26.996:INFO:elasticsearch: POST https://es:9200/idx/_search?scroll=5m&size=1000 [status:200 request:0.004s]
2022-02-01 19:24:26.998:INFO:elasticsearch: DELETE https://es:9200/_search/scroll [status:200 request:0.002s]
2022-02-01 19:24:27.001:INFO:elasticsearch: POST https://es:9200/idx/_search?scroll=5m&size=1000 [status:200 request:0.003s]
2022-02-01 19:24:27.005:INFO:elasticsearch: POST https://es:9200/_search/scroll [status:200 request:0.003s]
2022-02-01 19:24:27.007:INFO:elasticsearch: DELETE https://es:9200/_search/scroll [status:200 request:0.002s]
2022-02-01 19:24:27.013:INFO:elasticsearch: POST https://es:9200/idx/_search?scroll=5m&size=1000 [status:200 request:0.003s]
@toluaina https://github.com/toluaina/pgsync/blob/master/pgsync/elastichelper.py#L215
It seems that if records/events come in too fast from PG, these search scans back up and hold up the queues. They will eventually process, but it may take 10 minutes or longer for each full scan to complete before the next tick to process the queue.
Is the scan actually blocking the redis poll?
Can this process be entirely disabled if running in standalone mode? If so, I can let the process gracefully exit and just run it on a cron in the meantime.
Are there any optimizations that can be made to reduce the amount of data scanned?
Can we make the scan page size configurable?
Can we add extra concurrency at any of these blocking processes?
Sorta the same problem on my side as well. During the initial data loading/snapshot, with debug log enabled, There are only logs for elastic search POST API calls, nothing about Redis. Then after the document count reaches the expected count (~236million), then there is just continuous spam of 2022-04-26 05:40:24.669:DEBUG:pgsync.redisqueue: bulk_pop nsize: 0
, Like 100's of times per second.
@jpughcs @toluaina could the issue be due to the Redis version? you are using 6.2 and I am trying with 5.0.5, while the documentation says 3.1.0.
what I noticed now is that my Redis is basically blank, There are no keys at all. This is during the initial loading. I don't get what is wrong. There are no errors related to Redis in debug log at all beyond 2022-04-26 11:23:25.939:DEBUG:pgsync.urls: Connecting to Redis without password.
Nvm, what I said above, was a false understanding from my side, the logs were simply debug level logs of the Redis polling every 0.1s. 🤦
@jpughcs @toluaina could the issue be due to the Redis version? you are using 6.2 and I am trying with 5.0.5, while the documentation says 3.1.0.
I've tried 3.x, 5.x, 6.x
I don't think the root of my issue is necessarily related to Redis. I've been using a workaround successfully, removing these three lines (the first one here is the culprit: the post-batch, catch-up sync), which allows me to essentially schedule a full sync (without checkpoints or even the need for pg triggers and redis, which I've also stripped out) every five minutes or so (it completes in about 40 seconds with my dataset). This hammers ES pretty hard and locks up the indices, but it works for me in the meantime as my document count is relatively low and there's no need (yet) to have near-realtime updates. I don't think this will work for you with ~236 million records :( unless you just need to batch load your data and call it a day.
PGSync version:
2.1.10
Postgres version:
13.4
Elasticsearch version:
7.15.0
Redis version:
6.2
Python version:
3.7
Problem Description:
Using multiple schemas in single config. Initial load works great. Normal sync runs fine for a while until a large influx of records come through and pgsync enters a broken state. Once in this state, the queue stops being processed and the Redis
pending
continuously grows. The bulk pop pipeline does not pull any records from the queue (nor does it trim). However, executing alrange
manually will return results.Error Message (if any):
Strange:
bulk_pop nsize: 0
. Could be a red herring... Redis poller thread handling the index in question could have died?Querying queue directly on Redis server: