linagora / james-project

Mirror of Apache James Project
Apache License 2.0
70 stars 63 forks source link

[Postgresql] Investigate why PopulateEmailQueryViewTask gets stuck while running #5137

Closed Arsnael closed 3 months ago

Arsnael commented 4 months ago

On a platform running james postgres-app, I forgot to enable the listener to populate the email query view on new messages added.

Then, when I tried to run PopulateEmailQueryViewTask to repopulate the view, after 15h was still stuck on the first user, with no message processed:

{
  "additionalInformation": {                        
    "type": "PopulateEmailQueryViewTask",                                                                
    "timestamp": "2024-03-22T01:17:08.992833704Z",
    "processedUserCount": 1,
    "processedMessageCount": 0,
    "failedUserCount": 0,
    "failedMessageCount": 0,
    "runningOptions": {
      "messagesPerSecond": 10
    }
  },
  "submittedFrom": "tmail-imap-smtp-7f4bfdccb7-dsrtq",
  "executedOn": "tmail-imap-smtp-7f4bfdccb7-dsrtq",
  "cancelledFrom": null,
  "submitDate": "2024-03-21T10:13:15.131+0000",
  "type": "PopulateEmailQueryViewTask",
  "status": "inProgress",
  "taskId": "127f8b74-6557-46ec-929f-21f10031743e",
  "startedDate": "2024-03-21T10:13:15.136+0000"
}

The first user has 540 messages in total for example.

Postgres relies on memory task manager. Maybe the issue is within the pg implem.

The same task launched on a distributed deployment seems running fine for example.

Need to investigate.

quantranhong1999 commented 4 months ago

Maybe we can try to write an integration test and try to reproduce this?

Arsnael commented 4 months ago

Thought about this... a bigger one though. The IT test that exists it's maybe just on user with one message :)

Arsnael commented 3 months ago

After trying to debug this without success, I finished by writing a test => https://github.com/apache/james-project/pull/2163

Except the test works.

I tried with 3 users, that have 16 mailboxes each (for example first user on staging has 15) with 800 messages each total (first user on stg has between 500~600 messages)

Somehow on staging it seems to still hang forever on the first user processing, I thought I could be able to reproduce it.

I tried to debug it running the task with logs a bit everywhere on staging, seems it hangs when listing the private mailboxes of the users, and then nothing. I really don't get it...

If somebody else has an idea or wants to take a crack at it feel free :(

Arsnael commented 3 months ago

Unsure of what's going on... The IT test I did write I did run it with the rabbitmq implem and it still passes.

But on staging no way... Maybe issue postgresql side? I noticed as well that after I launch the task, all the health checks against postgres fail, with a timeout, like the connection was dead or frozen.

Maybe with the number of data we have the request is too much for it to handle? The best would be to try to check logs on postgres side but well... the logs and metrics you can only check them via the ovh console to which I dont have access so....

Arsnael commented 3 months ago

Free to take if somebody else wants to take a shot at it :)

quantranhong1999 commented 3 months ago

So far I see it hangs while searching mailboxes of a user: https://github.com/quantranhong1999/james-project/commit/f578e1f34af9d9cb613a28a0e8707834215befb6#diff-6f0ecd9a68cbe0ff1c827afe177cef3dfa39c876d3388c07a313217d77570096R200

It was never returned doing that search.

I needed to put a timeout to force the code to return.

That search code seems to search on Postgres DB and not OpenSearch (OpenSearch healthcheck seems OK too).

My guess is jooq hanging issue. I just not sure what is the exact hang code yet and the pattern, I will try to dig more.

Arsnael commented 3 months ago

That's the point where I got too as well, just could not explain why it happened.

Maybe ask @ducnm0711 if he has some spare time to help check logs on postgresql side when running the task? not sure what we can get or if it can help

quantranhong1999 commented 3 months ago

It hangs at PostgresMailboxDAO::findMailboxWithPathLike.

Funny thing: for this method in the past we had a jOOQ hang issue and fixed it cf https://github.com/apache/james-project/commit/03e4b363409adf197ae3af667996b537a69e934f

Likely something else is sneaking...

quantranhong1999 commented 3 months ago

The issue is likely not on Postgres side.

SELECT query, min_exec_time, mean_exec_time, max_exec_time, stddev_exec_time, calls, total_exec_time, rows
FROM pg_stat_statements
WHERE dbid = (SELECT oid FROM pg_database WHERE datname = 'tmail-stg')
    AND
query LIKE 'select *
from mailbox
where (
  mailbox_name like $1
  and user_name = $2
  and mailbox_namespace = $3
)';

After triggering the task, the calls in rows (returned records) increased, meaning OVH Postgres executes the query normally.

It is just James does not handle the result well.

Arsnael commented 3 months ago

Maybe some timeouts or something hanging in the reactive pipeline yes

What I find strange is that healthchecks against postgres after launching the tasks are all failing on a timeout.

Or maybe trying to do a thread dump to try to see if threads allocated to postgres connection are waiting on something indefinitely?

quantranhong1999 commented 3 months ago

Or maybe trying to do a thread dump to try to see if threads allocated to postgres connection are waiting on something indefinitely?

thread_dump.txt Seems nothing special to me.

What I find strange is that healthchecks against postgres after launching the tasks are all failing on a timeout.

I think the Postgres thread/connection froze and became a zombie doing that task, and can not do anything else.

Maybe some timeouts or something hanging in the reactive pipeline yes

FYI https://github.com/jOOQ/jOOQ/issues/14490 jOOQ implements its own reactive streams implementation. I won't be surprised if jOOQ somehow can break the reactive pipeline (which by reactive streams specs is supposed to work...)

quantranhong1999 commented 3 months ago

Fix: https://github.com/apache/james-project/pull/2179

(More of a mitigated fix. .collectList is not always good in case collecting a big collection in memory. But for a task and given jOOQ reactive that would not be stable soon, I think it is more than acceptable).

The task id cebddfb9-501d-47b8-b6d2-539bdd7b0061 is running well so far on the postgres env.

{
    "additionalInformation": {
        "type": "PopulateEmailQueryViewTask",
        "timestamp": "2024-04-04T10:16:04.378224500Z",
        "processedUserCount": 272,
        "processedMessageCount": 11400,
        "failedUserCount": 0,
        "failedMessageCount": 0,
        "runningOptions": {
            "messagesPerSecond": 10
        }
    },
    "submittedFrom": "tmail-jmap-6f58f7d4b4-cszvz",
    "executedOn": "tmail-jmap-6f58f7d4b4-cszvz",
    "cancelledFrom": null,
    "submitDate": "2024-04-04T09:56:52.587+0000",
    "type": "PopulateEmailQueryViewTask",
    "status": "inProgress",
    "taskId": "cebddfb9-501d-47b8-b6d2-539bdd7b0061",
    "startedDate": "2024-04-04T09:56:52.591+0000"
}

How I debugged this: connect local James to Postgres OVH DB (need Linagora VPN) to ease reproducing and debugging, then reproduce it in IT.

Arsnael commented 3 months ago

Good catch, thanks a lot @quantranhong1999

quantranhong1999 commented 3 months ago

The task finished:

{
    "additionalInformation": {
        "type": "PopulateEmailQueryViewTask",
        "timestamp": "2024-04-04T22:52:43.384046554Z",
        "processedUserCount": 20000,
        "processedMessageCount": 463186,
        "failedUserCount": 0,
        "failedMessageCount": 0,
        "runningOptions": {
            "messagesPerSecond": 10
        }
    },
    "submittedFrom": "tmail-jmap-6f58f7d4b4-cszvz",
    "executedOn": "tmail-jmap-6f58f7d4b4-cszvz",
    "cancelledFrom": null,
    "submitDate": "2024-04-04T09:56:52.587+0000",
    "type": "PopulateEmailQueryViewTask",
    "status": "completed",
    "taskId": "cebddfb9-501d-47b8-b6d2-539bdd7b0061",
    "startedDate": "2024-04-04T09:56:52.591+0000",
    "completedDate": "2024-04-04T22:52:43.386+0000"
}
quantranhong1999 commented 3 months ago

While we are at it: https://github.com/apache/james-project/pull/2185

chibenwa commented 3 months ago

I have a stupid question: does this bug happens when we have a few items? Or only if the have Many?

Could this be a bug in the paging?

Arsnael commented 3 months ago

Written integration test was working fine without the fix, and just 3 users, so I guess it does not happen when having a few items?

quantranhong1999 commented 3 months ago

Written integration test was working fine without the fix, and just 3 users, so I guess it does not happen when having a few items?

Yes. It does not happen with little records.

quantranhong1999 commented 3 months ago

I reported the bug on the jOOQ side: https://github.com/jOOQ/jOOQ/issues/16556

The scenario to reproduce is straightforward and basic actually...

quantranhong1999 commented 3 months ago

Maybe I would need to force .collectList in PostgresExecutor::executeRows.

Otherwise for now we would not be able to scale data :/