Open bxcodec opened 8 months ago
The query_chunk_size is not on the Postgres cursor. So SQL Alchemy keeps on pulling the data until it can. https://github.com/toluaina/pgsync/blob/95116702c4b314d8b97696ef857cfe116241e236/pgsync/base.py#L867
You can add a new env variable to control it https://github.com/accelq/pgsync/blob/f1d7caa95cf8edb30da03e05172e90bf7775b666/pgsync/base.py#L869
This worked for me though.
Any plan to add this functionality as a core feature? cc @toluaina
I found after debugging with 26MM of records on initial load, If you use es parallel bulk(default)(stream it will exit in case of error) in case of failures/exceptions(ex: different structure potentially on json type between records) with the options ELASTICSEARCH_RAISE_ON_ERROR and ELASTICSEARCH_RAISE_ON_EXCEPTION(both true by default), all errors will make the the failed record resident in memory due unhandled error/exception, continue to process, and pile up until has out of memory. Disable both options and memory will be stable, also review the data structure matches on all records on what is being created on ES. Hope this helps ;) I was able to copy the 26MM records, 250G to es in 3h35m with max usage of 1.6G of ram.
PGSync version: master branch commit: 95116702c4b314d8b97696ef857cfe116241e236 Postgres version: 14.5 Elasticsearch version: Latest version of Opensearch Redis version: Latest Python version: 3.8
Problem Description:
Hi, I tried to run this in my local.
I have this env set
Qns
Why does the memory grow exponentially on my initial load? Is there any way to make the memory consumption stable?
I have tried to lower the number of the chunk size, e.g., to even 1K and 500, but I am still facing the same issue on the first initial load (?)
Error Message (if any): Because of OOM, it killed the application