toluaina / pgsync

Postgres to Elasticsearch/OpenSearch sync
https://pgsync.com
MIT License
1.18k stars 182 forks source link

Postgres queries are inefficient #164

Open jdposthuma opened 3 years ago

jdposthuma commented 3 years ago

I have a simple, 3 tier table like this:

Here's my schema:

[
    {
        "database": "main",
        "index": "conversation",
        "nodes": {
            "table": "conversation",
            "schema": "staging",
            "columns": [
                "id",
                "messages_count",
                "mode"
            ],
            "children": [
                {
                    "table": "message",
                    "schema": "staging",
                    "label": "messages",
                    "columns": [ 
                        "id",
                        "body",
                        "modes",
                        "tags"
                    ],
                    "relationship": {
                        "variant": "object",
                        "type": "one_to_many",
                        "foreign_key": {
                            "child": ["conversation_id"],
                            "parent": ["id"]
                        }
                    },
                    "children": [
                        {
                            "table": "user",
                            "schema": "staging",
                            "columns": [ 
                                "first_name",
                                "last_name"
                            ],
                            "relationship": {
                                "variant": "object",
                                "type": "one_to_one",
                                "foreign_key": {
                                    "child": ["id"],
                                    "parent": ["user_id"]
                                }
                            }
                        }
                    ]
                }
            ]
        }
    }
]

On an update to the user object, the SQL generated by sqlalchemy produces the following inefficient query (see explain analyze). This runs in about 57 seconds on my dataset:

SELECT anon_1."JSON_BUILD_ARRAY_1",
    anon_1."JSON_BUILD_OBJECT_1",
    anon_1.id
FROM (
        SELECT JSON_BUILD_ARRAY(anon_2._keys) AS "JSON_BUILD_ARRAY_1",
            JSON_BUILD_OBJECT(
                'id',
                conversation_1.id,
                'messages_count',
                conversation_1.messages_count,
                'mode',
                conversation_1.mode,
                'messages',
                anon_2.messages
            ) AS "JSON_BUILD_OBJECT_1",
            conversation_1.id AS id
        FROM staging.conversation AS conversation_1
            LEFT OUTER JOIN (
                SELECT CAST(
                        JSON_BUILD_OBJECT(
                            'message',
                            JSON_AGG(
                                JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(message_1.id))
                            )
                        ) AS JSONB
                    ) || CAST(JSON_AGG(anon_3._keys) AS JSONB) AS _keys,
                    JSON_AGG(
                        JSON_BUILD_OBJECT(
                            'id',
                            message_1.id,
                            'body',
                            message_1.body,
                            'modes',
                            message_1.modes,
                            'tags',
                            message_1.tags,
                            'user',
                            anon_3."user"
                        )
                    ) AS messages,
                    message_1.conversation_id AS conversation_id
                FROM staging.message AS message_1
                    JOIN (
                        SELECT CAST(
                                JSON_BUILD_OBJECT('user', JSON_BUILD_OBJECT('id', user_1.id)) AS JSONB
                            ) AS _keys,
                            JSON_BUILD_OBJECT(
                                'first_name',
                                user_1.first_name,
                                'last_name',
                                user_1.last_name
                            ) AS "user",
                            user_1.id AS id
                        FROM staging."user" AS user_1
                    ) AS anon_3 ON anon_3.id = message_1.user_id
                GROUP BY message_1.conversation_id
            ) AS anon_2 ON anon_2.conversation_id = conversation_1.id
        WHERE conversation_1.id = 'eX...'
            OR conversation_1.id = 'OE...'
            OR conversation_1.id = '6f...'
            OR conversation_1.id = 'PH...'
            OR conversation_1.id = 'EG...'
            OR conversation_1.id = 'CB...'
            OR conversation_1.id = 'V4...'
            OR conversation_1.id = 'gx...'
            OR conversation_1.id = '2Q...'
            OR conversation_1.id = '1P...'
            OR conversation_1.id = 'A7...'
            OR conversation_1.id = 'Mk...'
            OR conversation_1.id = 'nq...'
            OR conversation_1.id = 'de...'
            OR conversation_1.id = 'FB...'
            OR conversation_1.id = 'iO...'
            OR conversation_1.id = 'RB...'
            OR conversation_1.id = 'M4...'
            OR conversation_1.id = 'mL...'
            OR conversation_1.id = 'UW...'
            OR conversation_1.id = 'Yc...'
            OR conversation_1.id = 'wQ...'
            OR conversation_1.id = 'XK...'
            OR conversation_1.id = 'Zr...'
    ) AS anon_1

I can optimize this query to produce the same result. It's much more efficient with memory and runs on about 300ms (see the explain analyze):

select 
    conversation_1.id,
    conversation_1.messages_count,
    conversation_1.mode,
    JSON_AGG(row_to_json(message_1_out)::jsonb || jsonb_build_object('user', row_to_json(user_1_out))) as messages
from staging.conversation conversation_1
left join staging.message message_1 on message_1.conversation_id = conversation_1.id
left join lateral (select message_1.id, message_1.body) message_1_out on true 
left join staging.user user_1 on user_1.id = message_1.user_id
left join lateral (select user_1.first_name, user_1.last_name) as user_1_out on true 
WHERE conversation_1.id in (
    'eX...',
    'OE...',
    '6f...',
    'PH...',
    'EG...',
    'CB...',
    'V4...',
    'gx...',
    '2Q...',
    '1P...',
    'A7...',
    'Mk...',
    'nq...',
    'de...',
    'FB...',
    'iO...',
    'RB...',
    'M4...',
    'mL...',
    'UW...',
    'Yc...',
    'wQ...',
    'XK...',
    'Zr...'
)
group by 
    conversation_1.id,
    conversation_1.messages_count,
    conversation_1.mode;

This design also resolves issue #149.

@toluaina - Can you think of any problems with this approach? Can you provide input on the design?

toluaina commented 3 years ago

Yes I thought about this and it is certainly possible. Although large IN clause's have limitations of their own. I will assess how to go about this.

jdposthuma commented 3 years ago

The IN clause doesn't affect performance either way in this case, just syntax preference.

I took a look at this today and was completely lost trying get sqlalchemy to form a LEFT JOIN LATERAL, which is a neat feature of my proposed query. Hopefully, this won't be a big deal for you.

I assume maintaining this project isn't your full-time job, so no pressure, but do you have an estimate on when this is achievable by?

jdposthuma commented 3 years ago

Also, I almost forgot - here's the keys query I'm using. Hopefully sqlalchemy can do window functions easily:

select jsonb_build_object(
        'message',
        jsonb_build_object(
            'id',
            JSON_AGG(message_1.id) filter (
                where message_1.id IS NOT NULL
            )
        ),
        'user',
        jsonb_build_object(
            'id',
            JSON_AGG(user_1.id) filter (
                where user_1.id IS NOT NULL
            )
        )
    ) as _meta,
    row_to_json(conversation_1_out)::jsonb || jsonb_build_object(
        'messages',
        JSON_AGG(
            row_to_json(message_1_out)::jsonb || jsonb_build_object('user', row_to_json(user_1_out))
        )
    ) as obj,
    conversation_1.id as id
from staging.conversation conversation_1
...
toluaina commented 3 years ago

That query might not be generic enough. Amongst many reasons, pgsync supports compound foreign and primary keys. So this would break that functionality. I have added an optimization strategy to the master branch for your use case. Can you try and let me know the verdict?

jdposthuma commented 3 years ago

To clarify - the use of the IN query did not significantly affect performance in my scenario.

The biggest performance boost came from the use of joins in the main query (not subqueries). Therefore, this commit doesn't significantly affect performance.

toluaina commented 3 years ago

So this did not help in your case but it does provide some level of optimization in certain cases.. I'm trying to find a window to explore your query a bit more.

voyc-jean commented 2 years ago

@toluaina Have you had a chance to look into this? I have a schema very similar to that of @jdposthuma and returning a mere 5 results can take up to a 60 seconds.

In another instance, the query took 53 minutes to return a single record.

toluaina commented 2 years ago

Will take another look at this

toluaina commented 2 years ago

Can either of you provide a more concrete/complete working example, please? I have so far failed to reproduce this myself Additionally, jdposthuma's idea is good but non-trivial to generalize. Although, I believe it can be done. I noticed that jdposthuma is using explicit foreign keys. This usually means there is no foreign key relationship. Are there appropriate indexes defined that could be contributing to the inefficiency?

@voyc-jean Do you have a minimal working example to examine in isolation? i.e both pgsync schema.json and database schema

voyc-jean commented 2 years ago

@toluaina

Things only really start becoming problematic when reaching tens of millions of records in the joined tables. The schema looks like this:

 - book [500k rows]
    - paragraph [3.7 mil rows]
       - sentence [75 mil rows]

The initial sync runs in a reasonable amount of time. The continuous sync with daemon mode, however, is where things get really slow.

If I make a change to a row (id 1) in the sentence table, the query that PGSync generates looks like this:

SELECT JSON_BUILD_ARRAY(anon_1._keys) AS "JSON_BUILD_ARRAY_1",
       JSON_BUILD_OBJECT('id', book_1.id, 'title', book_1.title, 'description', book_1.description, 'paragraph', anon_1.paragraph) AS "JSON_BUILD_OBJECT_1",
       book_1.id
FROM public.book AS book_1
LEFT OUTER JOIN
  (SELECT CAST(JSON_BUILD_OBJECT('paragraph', JSON_AGG(JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(paragraph_1.id)))) AS JSONB) || CAST(JSON_AGG(anon_2._keys) AS JSONB) AS _keys,
          JSON_AGG(JSON_BUILD_OBJECT('id', paragraph_1.id, 'sentence', anon_2.sentence)) AS paragraph,
          paragraph_1.book_id AS book_id
   FROM public.paragraph AS paragraph_1
   JOIN
     (SELECT CAST(JSON_BUILD_OBJECT('sentence', JSON_AGG(JSON_BUILD_OBJECT('id', sentence_1.id))) AS JSONB) AS _keys,
             JSON_AGG(JSON_BUILD_OBJECT('id', sentence_1.id, 'text', sentence_1.text)) AS sentence,
             sentence_1.paragraph_id AS paragraph_id
      FROM public.sentence AS sentence_1
      GROUP BY sentence_1.paragraph_id) AS anon_2 ON anon_2.paragraph_id = paragraph_1.id
   GROUP BY paragraph_1.book_id) AS anon_1 ON anon_1.book_id = book_1.id
WHERE book_1.id = '1'

This query takes 101 seconds to execute and returns one result. The explain analyze looks like this.


To replicate this I've had to write a script which populates the database with millions of records. Please see the example in my fork here. The script can be run similarly to the other examples: ./examples/bootstrap.sh book_test

It might obviously take some time for the records to generate as it should generate 500k book rows and upwards of 70 million sentence rows.


On a side note, I believe that this might also be the cause of #211, #130 and #88

toluaina commented 2 years ago
voyc-jean commented 2 years ago

Hi @toluaina

Thanks for the detailed reply and for adding --analyze. You are right about the index, I'm used to the Django ORM which automatically creates foreign key indices.

However, even with the indices added, the query still takes upward of 100s to complete. Please see the new explain.

As per the ouput of --analyze, the indices exist:

Found index "paragraph_book_id_idx" for table "paragraph" for columns: ['book_id']: OK 
Found index "sentence_paragraph_id_idx" for table "sentence" for columns: ['paragraph_id']: OK 

Unless I am still missing something, it doesn't seem like the indices are the main problem?


To add to the above, if I make use of the query from @jdposthuma, the result returns in 16ms.

toluaina commented 2 years ago

Hello @voyc-jean . Can you please check this again? Once I created, the index the updates are near-instantaneous. Can you run this query and time how long it takes?

SELECT JSON_BUILD_ARRAY(anon_1._keys) AS "JSON_BUILD_ARRAY_1",
       CAST(JSON_BUILD_OBJECT('id', book_1.id, 'title', book_1.title, 'description', book_1.description, 'paragraph', anon_1.paragraph) AS JSONB) AS "JSON_BUILD_OBJECT_1",
       book_1.id
FROM public.book AS book_1
LEFT OUTER JOIN
    (
        SELECT CAST(JSON_BUILD_OBJECT('paragraph', JSON_AGG(CAST(JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(paragraph_1.id)) AS JSONB))) AS JSONB) || CAST(JSON_AGG(anon_2._keys) AS JSONB) AS _keys,
            JSON_AGG(CAST(JSON_BUILD_OBJECT('id', paragraph_1.id, 'sentence', anon_2.sentence) AS JSONB)) AS paragraph,
            paragraph_1.book_id AS book_id
        FROM public.paragraph AS paragraph_1 
        JOIN
        (
            SELECT CAST(JSON_BUILD_OBJECT('sentence', JSON_AGG(CAST(JSON_BUILD_OBJECT('id', sentence_1.id) AS JSONB))) AS JSONB) AS _keys,
                JSON_AGG(CAST(JSON_BUILD_OBJECT('id', sentence_1.id, 'text', sentence_1.text) AS JSONB)) AS sentence,
                sentence_1.paragraph_id AS paragraph_id
            FROM public.sentence AS sentence_1
            GROUP BY sentence_1.paragraph_id
        ) AS anon_2 ON anon_2.paragraph_id = paragraph_1.id
        where  paragraph_1.book_id = 1
        GROUP BY paragraph_1.book_id
    ) AS anon_1 ON anon_1.book_id = book_1.id
WHERE book_1.id = '1'
voyc-jean commented 2 years ago

@toluaina Your query executed in 375 seconds. Here is the explain. Note that Index scans area being used.

Are you certain that you're testing with roughly the same amount of records I mentioned previously? With smaller tables and fewer nested queries, this issue does not seem to occur.

toluaina commented 2 years ago

@voyc-jean I made some changes to the master branch that should hopefully improve proformance. Can you please try that and let me know.

voyc-jean commented 2 years ago

Hi there @toluaina,

Apologies for the long delay, I've had to find some time to look further into this.

The good news is that the change you made seems to have had a huge positive effect on performance. My former query that took upwards of 100s to return now returns in ~700ms, a great improvement!

The bad news is that this has lead to another issue; After the initial sync is done (after about 8 hours), the following query runs:

SELECT xid,
       DATA
FROM PG_LOGICAL_SLOT_PEEK_CHANGES('database_index', NULL, NULL)
WHERE CAST(CAST(xid AS TEXT) AS BIGINT) >= 152020106
  AND CAST(CAST(xid AS TEXT) AS BIGINT) < 152306901

Once complete after about 8 minutes, the data waiting in the replication slot is synced and and the checkpoint is updated. A new query then runs:

SELECT JSON_BUILD_ARRAY(anon_1._keys, anon_2._keys, anon_3._keys, anon_4._keys, anon_5._keys, anon_6._keys, anon_7._keys, anon_8._keys, anon_9._keys) AS "JSON_BUILD_ARRAY_1",
       CAST(JSON_BUILD_OBJECT('id', voycapi_conversations_1.id, 'name', voycapi_conversations_1.name, 'project_id', voycapi_conversations_1.project_id, 'conversation_date', voycapi_conversations_1.conversation_date, 'conversation_time', voycapi_conversations_1.conversation_time, 'createdate', voycapi_conversations_1.createdate, 'status', voycapi_conversations_1.status, 'note', voycapi_conversations_1.note, 'audiolength', voycapi_conversations_1.audiolength, 'agents', anon_1.agents, 'attribute', anon_2.attribute, 'grading', anon_3.grading, 'hold_time', anon_4.hold_time, 'review', anon_5.review, 'score', anon_6.score, 'tags', anon_7.tags, 'transcript', anon_8.transcript, 'workflow_session', anon_9.workflow_session) AS JSONB) AS "JSON_BUILD_OBJECT_1",
       voycapi_conversations_1.id
FROM public.voycapi_conversations AS voycapi_conversations_1
LEFT OUTER JOIN LATERAL
  (SELECT CAST(JSON_BUILD_OBJECT('voyc_agent_agentconversation', JSON_AGG(CAST(JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(voyc_agent_agentconversation_1.id)) AS JSONB))) AS JSONB) || CAST(JSON_AGG(anon_10._keys) AS JSONB) AS _keys,
          JSON_AGG(CAST(JSON_BUILD_OBJECT('id', voyc_agent_agentconversation_1.id, 'agent', anon_10.agent) AS JSONB)) AS agents,
          voyc_agent_agentconversation_1.conversation_id AS conversation_id
   FROM public.voyc_agent_agentconversation AS voyc_agent_agentconversation_1
   JOIN LATERAL
     (SELECT CAST(JSON_BUILD_OBJECT('voyc_agent_agent', CAST(JSON_BUILD_OBJECT('id', voyc_agent_agent_1.id) AS JSONB)) AS JSONB) AS _keys,
             CAST(JSON_BUILD_OBJECT('id', voyc_agent_agent_1.id, 'uid', voyc_agent_agent_1.uid) AS JSONB) AS agent,
             voyc_agent_agent_1.id AS id
      FROM public.voyc_agent_agent AS voyc_agent_agent_1
      WHERE voyc_agent_agent_1.id = voyc_agent_agentconversation_1.agent_id) AS anon_10 ON anon_10.id = voyc_agent_agentconversation_1.agent_id
   WHERE voyc_agent_agentconversation_1.conversation_id = voycapi_conversations_1.id
   GROUP BY voyc_agent_agentconversation_1.conversation_id) AS anon_1 ON anon_1.conversation_id = voycapi_conversations_1.id
LEFT OUTER JOIN LATERAL
  (SELECT CAST(JSON_BUILD_OBJECT('voyc_attribute_attributevalue', JSON_AGG(CAST(JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(voyc_attribute_attributevalue_1.id)) AS JSONB))) AS JSONB) || CAST(JSON_AGG(anon_11._keys) AS JSONB) AS _keys,
          JSON_AGG(CAST(JSON_BUILD_OBJECT('value', voyc_attribute_attributevalue_1.value, 'attribute', anon_11.attribute) AS JSONB)) AS attribute,
          voyc_attribute_attributevalue_1.conversation_id AS conversation_id
   FROM public.voyc_attribute_attributevalue AS voyc_attribute_attributevalue_1
   JOIN LATERAL
     (SELECT CAST(JSON_BUILD_OBJECT('voyc_attribute_attribute', CAST(JSON_BUILD_OBJECT('id', voyc_attribute_attribute_1.id) AS JSONB)) AS JSONB) AS _keys,
             CAST(JSON_BUILD_OBJECT('id', voyc_attribute_attribute_1.id, 'name', voyc_attribute_attribute_1.name) AS JSONB) AS attribute,
             voyc_attribute_attribute_1.id AS id
      FROM public.voyc_attribute_attribute AS voyc_attribute_attribute_1
      WHERE voyc_attribute_attribute_1.id = voyc_attribute_attributevalue_1.attribute_id) AS anon_11 ON anon_11.id = voyc_attribute_attributevalue_1.attribute_id
   WHERE voyc_attribute_attributevalue_1.conversation_id = voycapi_conversations_1.id
   GROUP BY voyc_attribute_attributevalue_1.conversation_id) AS anon_2 ON anon_2.conversation_id = voycapi_conversations_1.id
LEFT OUTER JOIN LATERAL
  (SELECT CAST(JSON_BUILD_OBJECT('voyc_grading_conversationscore', CAST(JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(voyc_grading_conversationscore_1.id)) AS JSONB)) AS JSONB) AS _keys,
          CAST(JSON_BUILD_OBJECT('auto_compliance_resolution', voyc_grading_conversationscore_1.auto_compliance_resolution, 'manual_compliance_resolution', voyc_grading_conversationscore_1.manual_compliance_resolution) AS JSONB) AS grading,
          voyc_grading_conversationscore_1.conversation_id AS conversation_id
   FROM public.voyc_grading_conversationscore AS voyc_grading_conversationscore_1
   WHERE voyc_grading_conversationscore_1.conversation_id = voycapi_conversations_1.id) AS anon_3 ON anon_3.conversation_id = voycapi_conversations_1.id
LEFT OUTER JOIN LATERAL
  (SELECT CAST(JSON_BUILD_OBJECT('voycapi_conversationsilence', CAST(JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(voycapi_conversationsilence_1.id)) AS JSONB)) AS JSONB) AS _keys,
          CAST(JSON_BUILD_OBJECT('duration', voycapi_conversationsilence_1.duration) AS JSONB) AS hold_time,
          voycapi_conversationsilence_1.conversation_id AS conversation_id
   FROM public.voycapi_conversationsilence AS voycapi_conversationsilence_1
   WHERE voycapi_conversationsilence_1.conversation_id = voycapi_conversations_1.id) AS anon_4 ON anon_4.conversation_id = voycapi_conversations_1.id
LEFT OUTER JOIN LATERAL
  (SELECT CAST(JSON_BUILD_OBJECT('voyc_review_conversationreview', CAST(JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(voyc_review_conversationreview_1.id)) AS JSONB)) AS JSONB) || CAST(JSON_BUILD_ARRAY(anon_12._keys) AS JSONB) || CAST(JSON_BUILD_ARRAY(anon_13._keys) AS JSONB) || CAST(JSON_BUILD_ARRAY(anon_14._keys) AS JSONB) || CAST(JSON_BUILD_ARRAY(anon_15._keys) AS JSONB) || CAST(JSON_BUILD_ARRAY(anon_16._keys) AS JSONB) || CAST(JSON_BUILD_ARRAY(anon_17._keys) AS JSONB) AS _keys,
          CAST(JSON_BUILD_OBJECT('reviewed_status', voyc_review_conversationreview_1.reviewed_status, 'resolved_status', voyc_review_conversationreview_1.resolved_status, 'coached_status', voyc_review_conversationreview_1.coached_status, 'graded_status', voyc_review_conversationreview_1.graded_status, 'date_reviewed', voyc_review_conversationreview_1.date_reviewed, 'date_resolved', voyc_review_conversationreview_1.date_resolved, 'date_coached', voyc_review_conversationreview_1.date_coached, 'date_graded', voyc_review_conversationreview_1.date_graded, 'date_assigned', voyc_review_conversationreview_1.date_assigned, 'assigned_by', anon_12.assigned_by, 'assigned_to', anon_13.assigned_to, 'coached_by', anon_14.coached_by, 'graded_by', anon_15.graded_by, 'resolved_by', anon_16.resolved_by, 'reviewed_by', anon_17.reviewed_by) AS JSONB) AS review,
          voyc_review_conversationreview_1.conversation_id AS conversation_id
   FROM public.voyc_review_conversationreview AS voyc_review_conversationreview_1
   LEFT OUTER JOIN LATERAL
     (SELECT CAST(JSON_BUILD_OBJECT('users_user', CAST(JSON_BUILD_OBJECT('id', users_user_1.id) AS JSONB)) AS JSONB) AS _keys,
             CAST(JSON_BUILD_OBJECT('id', users_user_1.id, 'username', users_user_1.username) AS JSONB) AS assigned_by,
             users_user_1.id AS id
      FROM public.users_user AS users_user_1
      WHERE users_user_1.id = voyc_review_conversationreview_1.assigned_by_id) AS anon_12 ON anon_12.id = voyc_review_conversationreview_1.assigned_by_id
   LEFT OUTER JOIN LATERAL
     (SELECT CAST(JSON_BUILD_OBJECT('users_user', CAST(JSON_BUILD_OBJECT('id', users_user_1.id) AS JSONB)) AS JSONB) AS _keys,
             CAST(JSON_BUILD_OBJECT('id', users_user_1.id, 'username', users_user_1.username) AS JSONB) AS assigned_to,
             users_user_1.id AS id
      FROM public.users_user AS users_user_1
      WHERE users_user_1.id = voyc_review_conversationreview_1.assigned_to_id) AS anon_13 ON anon_13.id = voyc_review_conversationreview_1.assigned_to_id
   LEFT OUTER JOIN LATERAL
     (SELECT CAST(JSON_BUILD_OBJECT('users_user', CAST(JSON_BUILD_OBJECT('id', users_user_1.id) AS JSONB)) AS JSONB) AS _keys,
             CAST(JSON_BUILD_OBJECT('id', users_user_1.id, 'username', users_user_1.username) AS JSONB) AS coached_by,
             users_user_1.id AS id
      FROM public.users_user AS users_user_1
      WHERE users_user_1.id = voyc_review_conversationreview_1.coached_by_id) AS anon_14 ON anon_14.id = voyc_review_conversationreview_1.coached_by_id
   LEFT OUTER JOIN LATERAL
     (SELECT CAST(JSON_BUILD_OBJECT('users_user', CAST(JSON_BUILD_OBJECT('id', users_user_1.id) AS JSONB)) AS JSONB) AS _keys,
             CAST(JSON_BUILD_OBJECT('id', users_user_1.id, 'username', users_user_1.username) AS JSONB) AS graded_by,
             users_user_1.id AS id
      FROM public.users_user AS users_user_1
      WHERE users_user_1.id = voyc_review_conversationreview_1.graded_by_id) AS anon_15 ON anon_15.id = voyc_review_conversationreview_1.graded_by_id
   LEFT OUTER JOIN LATERAL
     (SELECT CAST(JSON_BUILD_OBJECT('users_user', CAST(JSON_BUILD_OBJECT('id', users_user_1.id) AS JSONB)) AS JSONB) AS _keys,
             CAST(JSON_BUILD_OBJECT('id', users_user_1.id, 'username', users_user_1.username) AS JSONB) AS resolved_by,
             users_user_1.id AS id
      FROM public.users_user AS users_user_1
      WHERE users_user_1.id = voyc_review_conversationreview_1.resolved_by_id) AS anon_16 ON anon_16.id = voyc_review_conversationreview_1.resolved_by_id
   LEFT OUTER JOIN LATERAL
     (SELECT CAST(JSON_BUILD_OBJECT('users_user', CAST(JSON_BUILD_OBJECT('id', users_user_1.id) AS JSONB)) AS JSONB) AS _keys,
             CAST(JSON_BUILD_OBJECT('id', users_user_1.id, 'username', users_user_1.username) AS JSONB) AS reviewed_by,
             users_user_1.id AS id
      FROM public.users_user AS users_user_1
      WHERE users_user_1.id = voyc_review_conversationreview_1.reviewed_by_id) AS anon_17 ON anon_17.id = voyc_review_conversationreview_1.reviewed_by_id
   WHERE voyc_review_conversationreview_1.conversation_id = voycapi_conversations_1.id) AS anon_5 ON anon_5.conversation_id = voycapi_conversations_1.id
LEFT OUTER JOIN LATERAL
  (SELECT CAST(JSON_BUILD_OBJECT('voyc_grading_conversationscore', CAST(JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(voyc_grading_conversationscore_1.id)) AS JSONB)) AS JSONB) AS _keys,
          CAST(JSON_BUILD_OBJECT('manual_score', voyc_grading_conversationscore_1.manual_score, 'auto_score', voyc_grading_conversationscore_1.auto_score, 'date_graded', voyc_grading_conversationscore_1.date_graded, 'note_grader', voyc_grading_conversationscore_1.note_grader) AS JSONB) AS score,
          voyc_grading_conversationscore_1.conversation_id AS conversation_id
   FROM public.voyc_grading_conversationscore AS voyc_grading_conversationscore_1
   WHERE voyc_grading_conversationscore_1.conversation_id = voycapi_conversations_1.id) AS anon_6 ON anon_6.conversation_id = voycapi_conversations_1.id
LEFT OUTER JOIN LATERAL
  (SELECT CAST(JSON_BUILD_OBJECT('voyc_tag_taggedconversation', JSON_AGG(CAST(JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(voyc_tag_taggedconversation_1.id)) AS JSONB))) AS JSONB) || CAST(JSON_AGG(anon_18._keys) AS JSONB) AS _keys,
          JSON_AGG(CAST(JSON_BUILD_OBJECT('id', voyc_tag_taggedconversation_1.id, 'tag', anon_18.tag) AS JSONB)) AS tags,
          voyc_tag_taggedconversation_1.conversation_id AS conversation_id
   FROM public.voyc_tag_taggedconversation AS voyc_tag_taggedconversation_1
   JOIN LATERAL
     (SELECT CAST(JSON_BUILD_OBJECT('voyc_tag_tag', CAST(JSON_BUILD_OBJECT('id', voyc_tag_tag_1.id) AS JSONB)) AS JSONB) AS _keys,
             CAST(JSON_BUILD_OBJECT('id', voyc_tag_tag_1.id, 'name', voyc_tag_tag_1.name) AS JSONB) AS tag,
             voyc_tag_tag_1.id AS id
      FROM public.voyc_tag_tag AS voyc_tag_tag_1
      WHERE voyc_tag_tag_1.id = voyc_tag_taggedconversation_1.tag_id) AS anon_18 ON anon_18.id = voyc_tag_taggedconversation_1.tag_id
   WHERE voyc_tag_taggedconversation_1.conversation_id = voycapi_conversations_1.id
   GROUP BY voyc_tag_taggedconversation_1.conversation_id) AS anon_7 ON anon_7.conversation_id = voycapi_conversations_1.id
JOIN LATERAL
  (SELECT CAST(JSON_BUILD_OBJECT('voyc_transcript_transcript', CAST(JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(voyc_transcript_transcript_1.id)) AS JSONB)) AS JSONB) || CAST(JSON_BUILD_ARRAY(anon_19._keys) AS JSONB) AS _keys,
          CAST(JSON_BUILD_OBJECT('id', voyc_transcript_transcript_1.id, 'confidence', voyc_transcript_transcript_1.confidence, 'utterance', anon_19.utterance) AS JSONB) AS transcript,
          voyc_transcript_transcript_1.conversation_id AS conversation_id
   FROM public.voyc_transcript_transcript AS voyc_transcript_transcript_1
   JOIN LATERAL
     (SELECT CAST(JSON_BUILD_OBJECT('voyc_transcript_utterance', JSON_AGG(CAST(JSON_BUILD_OBJECT('id', voyc_transcript_utterance_1.id) AS JSONB))) AS JSONB) || CAST(JSON_AGG(anon_20._keys) AS JSONB) AS _keys,
             JSON_AGG(CAST(JSON_BUILD_OBJECT('id', voyc_transcript_utterance_1.id, 'emotion', voyc_transcript_utterance_1.emotion, 'sentence', anon_20.sentence) AS JSONB)) AS utterance,
             voyc_transcript_utterance_1.transcript_id AS transcript_id
      FROM public.voyc_transcript_utterance AS voyc_transcript_utterance_1
      JOIN LATERAL
        (SELECT CAST(JSON_BUILD_OBJECT('voyc_transcript_sentence', JSON_AGG(CAST(JSON_BUILD_OBJECT('id', voyc_transcript_sentence_1.id) AS JSONB))) AS JSONB) AS _keys,
                JSON_AGG(CAST(JSON_BUILD_OBJECT('id', voyc_transcript_sentence_1.id, 'text', voyc_transcript_sentence_1.text, 'words_{0,start}', voyc_transcript_sentence_1.words #> '{0,start}') AS JSONB)) AS sentence,
                voyc_transcript_sentence_1.utterance_id AS utterance_id
         FROM public.voyc_transcript_sentence AS voyc_transcript_sentence_1
         WHERE voyc_transcript_sentence_1.utterance_id = voyc_transcript_utterance_1.id
         GROUP BY voyc_transcript_sentence_1.utterance_id) AS anon_20 ON anon_20.utterance_id = voyc_transcript_utterance_1.id
      WHERE voyc_transcript_utterance_1.transcript_id = voyc_transcript_transcript_1.id
      GROUP BY voyc_transcript_utterance_1.transcript_id) AS anon_19 ON anon_19.transcript_id = voyc_transcript_transcript_1.id
   WHERE voyc_transcript_transcript_1.conversation_id = voycapi_conversations_1.id
     AND (voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419
          OR voyc_transcript_transcript_1.id = 4273419)) AS anon_8 ON anon_8.conversation_id = voycapi_conversations_1.id
JOIN LATERAL
  (SELECT CAST(JSON_BUILD_OBJECT('voyc_workflow_workflowsession', CAST(JSON_BUILD_OBJECT('id', JSON_BUILD_ARRAY(voyc_workflow_workflowsession_1.id)) AS JSONB)) AS JSONB) || CAST(JSON_BUILD_ARRAY(anon_21._keys) AS JSONB) AS _keys,
          CAST(JSON_BUILD_OBJECT('id', voyc_workflow_workflowsession_1.id, 'workflow_id', voyc_workflow_workflowsession_1.workflow_id, 'workflowsessiontask_session', anon_21.workflowsessiontask_session) AS JSONB) AS workflow_session,
          voyc_workflow_workflowsession_1.conversation_id AS conversation_id
   FROM public.voyc_workflow_workflowsession AS voyc_workflow_workflowsession_1
   JOIN LATERAL
     (SELECT CAST(JSON_BUILD_OBJECT('voyc_workflow_workflowsessiontask', JSON_AGG(CAST(JSON_BUILD_OBJECT('id', voyc_workflow_workflowsessiontask_1.id) AS JSONB))) AS JSONB) || CAST(JSON_AGG(anon_22._keys) AS JSONB) || CAST(JSON_AGG(anon_23._keys) AS JSONB) || CAST(JSON_AGG(anon_24._keys) AS JSONB) || CAST(JSON_AGG(anon_25._keys) AS JSONB) AS _keys,
             JSON_AGG(CAST(JSON_BUILD_OBJECT('id', voyc_workflow_workflowsessiontask_1.id, 'task_id', voyc_workflow_workflowsessiontask_1.task_id, 'status', voyc_workflow_workflowsessiontask_1.status, 'date_assigned', voyc_workflow_workflowsessiontask_1.date_assigned, 'date_completed', voyc_workflow_workflowsessiontask_1.date_completed, 'assigned_by', anon_22.assigned_by, 'assigned_to', anon_23.assigned_to, 'completed_by', anon_24.completed_by, 'task', anon_25.task) AS JSONB)) AS workflowsessiontask_session,
             voyc_workflow_workflowsessiontask_1.workflow_session_id AS workflow_session_id
      FROM public.voyc_workflow_workflowsessiontask AS voyc_workflow_workflowsessiontask_1
      LEFT OUTER JOIN LATERAL
        (SELECT CAST(JSON_BUILD_OBJECT('users_user', CAST(JSON_BUILD_OBJECT('id', users_user_1.id) AS JSONB)) AS JSONB) AS _keys,
                CAST(JSON_BUILD_OBJECT('id', users_user_1.id, 'username', users_user_1.username) AS JSONB) AS assigned_by,
                users_user_1.id AS id
         FROM public.users_user AS users_user_1
         WHERE users_user_1.id = voyc_workflow_workflowsessiontask_1.assigned_by_id) AS anon_22 ON anon_22.id = voyc_workflow_workflowsessiontask_1.assigned_by_id
      LEFT OUTER JOIN LATERAL
        (SELECT CAST(JSON_BUILD_OBJECT('users_user', CAST(JSON_BUILD_OBJECT('id', users_user_1.id) AS JSONB)) AS JSONB) AS _keys,
                CAST(JSON_BUILD_OBJECT('id', users_user_1.id, 'username', users_user_1.username) AS JSONB) AS assigned_to,
                users_user_1.id AS id
         FROM public.users_user AS users_user_1
         WHERE users_user_1.id = voyc_workflow_workflowsessiontask_1.assigned_to_id) AS anon_23 ON anon_23.id = voyc_workflow_workflowsessiontask_1.assigned_to_id
      LEFT OUTER JOIN LATERAL
        (SELECT CAST(JSON_BUILD_OBJECT('users_user', CAST(JSON_BUILD_OBJECT('id', users_user_1.id) AS JSONB)) AS JSONB) AS _keys,
                CAST(JSON_BUILD_OBJECT('id', users_user_1.id, 'username', users_user_1.username) AS JSONB) AS completed_by,
                users_user_1.id AS id
         FROM public.users_user AS users_user_1
         WHERE users_user_1.id = voyc_workflow_workflowsessiontask_1.completed_by_id) AS anon_24 ON anon_24.id = voyc_workflow_workflowsessiontask_1.completed_by_id
      LEFT OUTER JOIN LATERAL
        (SELECT CAST(JSON_BUILD_OBJECT('voyc_workflow_task', CAST(JSON_BUILD_OBJECT('id', voyc_workflow_task_1.id) AS JSONB)) AS JSONB) AS _keys,
                CAST(JSON_BUILD_OBJECT('name', voyc_workflow_task_1.name, 'is_grading', voyc_workflow_task_1.is_grading) AS JSONB) AS task,
                voyc_workflow_task_1.id AS id
         FROM public.voyc_workflow_task AS voyc_workflow_task_1
         WHERE voyc_workflow_task_1.id = voyc_workflow_workflowsessiontask_1.task_id) AS anon_25 ON anon_25.id = voyc_workflow_workflowsessiontask_1.task_id
      WHERE voyc_workflow_workflowsessiontask_1.workflow_session_id = voyc_workflow_workflowsession_1.id
      GROUP BY voyc_workflow_workflowsessiontask_1.workflow_session_id) AS anon_21 ON anon_21.workflow_session_id = voyc_workflow_workflowsession_1.id
   WHERE voyc_workflow_workflowsession_1.conversation_id = voycapi_conversations_1.id) AS anon_9 ON anon_9.conversation_id = voycapi_conversations_1.id

This query takes about 23 minutes to complete and appears to return no data. Due to this long query time, the backlog of items to sync also quickly grows.

The stdout looks like this:

  [--------------------------------------------------]  0/0    0%

Another side effect of this problem is that the replication slot isn't advancing, nor is the checkpoint file updating. This causes the DB disk usage to consistently grow and also slows down subsequent syncs due to the PG_LOGICAL_SLOT_PEEK_CHANGES query having to search through a larger amount of data.

toluaina commented 2 years ago

Thanks for this again @voyc-jean. I spent some time trying to narrow down the cause. It would really help to have access to your db schema and also psgync schema as well. The duplicated values in the query don't seem right and I would like to fix that first.

voyc-jean commented 2 years ago

Hi @toluaina

After some more investigating, I discovered that there are multiple problems here.

I've decided to rather create separate issues for each of these:

268

267

269

In addition, is there any way that I can directly communicate with you? Perhaps an email address?

Thanks.

dorin-musteata commented 1 year ago

Hi @toluaina

After some more investigating, I discovered that there are multiple problems here.

I've decided to rather create separate issues for each of these:

268 #267 #269

In addition, is there any way that I can directly communicate with you? Perhaps an email address?

Thanks.

Any updates?