Very slow operation on very large tables

toluaina / pgsync

Postgres to Elasticsearch/OpenSearch sync

https://pgsync.com

MIT License

1.19k stars 182 forks source link

Very slow operation on very large tables #130

Open timbarkerSE opened 3 years ago

timbarkerSE commented 3 years ago

PGSync version: 2.0.0 Postgres version: 13 Elasticsearch version: 7.10.2 Redis version: 5.0.3 Python version: 3.7.9

Problem Description: I'm running pgsync on AWS using EC2, hosted Elastic Search and hosted Postgres. My tables are huge - primary table has 160M rows and one of the child tables has >2B rows.

I've set the QUERY_CHUNK_SIZE to 1,000 as when set to 10,000 the process was locking up the server and I couldn't SSH on to it. I'm monitoring the process on the database side and can see it's on the second chunk from the SQL being run: FETCH FORWARD 1000 FROM "c_7fec50320ac8_1". The first chunk took about 4 hours so clearly it won't finish the whole table in a viable amount of time. I suspect this is partly due to the size of the tables and therefore the joins being performed.

Is there something I can change to improve this or am I better trying a different tool like logstash?

Thanks for your help

toluaina commented 3 years ago

Really sorry about the delay. 2 Billion is a lot! QUERY_CHUNK_SIZE is the only db related variable. I would suggest dropping this even further. Have you considered tuning any Postgres parameters. Also is the database normalised and with the required indices?

Maybe drop QUERY_CHUNK_SIZE to a very small number and then examine the resulting Query. You can slo send me the db schema and pgsync schema as well

@timbarkerSE

toluaina commented 3 years ago

There is an edge case where this could be a problem. If you have parent table that has a relationship to large number of child tables. Then the chunk only applies to the parent.

I would be keen to understand the structure of you database and the size of each row. Also can you give an indication of the data distribution e.g the ration of parent to child, and can each parent have a large number of children etc.

timbarkerSE commented 3 years ago

Thanks for replying. The process did speed up in the end and the chunks were taking about 10 mins each with the whole process taking about 10 days. Still slightly long for what I need but better.

In the end I decided to write my own script to do the syncing because I could build a really bespoke and efficient query that runs really quickly. My use case is primary table instagram posts and child table hashtags (via huge join table). The hashtags are contained within the caption in the primary table so it's easier to derive them from there than handle the huge join! Lots of my other joins are many-to-many relations but in reality there's only one record so it's many-to-one and I could write fast SQL that handles that situation.

Anyway, I appreciate your work on pgsync and taking the time to answer issues.

toluaina commented 3 years ago

Thanks for this. I am keep to learn from what you did to optimise the query path.

dorin-musteata commented 1 year ago

Thanks for replying. The process did speed up in the end and the chunks were taking about 10 mins each with the whole process taking about 10 days. Still slightly long for what I need but better.

In the end I decided to write my own script to do the syncing because I could build a really bespoke and efficient query that runs really quickly. My use case is primary table instagram posts and child table hashtags (via huge join table). The hashtags are contained within the caption in the primary table so it's easier to derive them from there than handle the huge join! Lots of my other joins are many-to-many relations but in reality there's only one record so it's many-to-one and I could write fast SQL that handles that situation.

Anyway, I appreciate your work on pgsync and taking the time to answer issues.

Hi, could you maybe share some pieces of code for the community, maybe we can improve pgsync library with your insights ?