packit / packit-service

Packit provided as a service
https://packit.dev
MIT License
37 stars 48 forks source link

Hard time limit (300s) exceeded for task.steve_jobs.process_message #2512

Closed majamassarini closed 2 weeks ago

majamassarini commented 2 months ago

I realized that starting from 2nd of June 2024 we have this exception raised quite often in the same day (5-10 occurrences).

It makes no sense for the process_message function.

Unless it is somehow related to the communication with the pushgateway

If we can't find the reason for the slowliness we should at least increase again the hard_time_limit for this task.

See also #2519 for follow-ups.

majamassarini commented 1 month ago

Since I am not able to comment here (since the comment is too long). I link here a public gist with the summary of my findings.

mfocko commented 1 month ago

Can't comment on gist :eyes: thx GitHub :D anyways

Hard time limit (900s)

"2024-09-26T10:51:51.546708534+00:00 stderr F [2024-09-26 10:51:51,546: INFO/MainProcess] task.run_copr_build_end_handler[d47aa906-d996-4745-9ac1-378a6b0f51f1] Cleaning up the mess.","2024-09-26T10:51:51.546+0000"
"2024-09-26T10:36:52.820075409+00:00 stderr F [2024-09-26 10:36:52,820: DEBUG/MainProcess] task.run_copr_build_end_handler[d47aa906-d996-4745-9ac1-378a6b0f51f1] Setting Github status check 'success' for check 'rpm-build:fedora-40-aarch64:fedora-rawhide': RPMs were built successfully.","2024-09-26T10:36:52.820+0000"

I wouldn't be surprised if it got stuck on the GitHub status check, but it's still weird that we see the clean up and then timeout exception. OTOH the time difference between these two messages is ~15 minutes :eyes:

https://github.com/packit/packit-service/blob/b58b3b71db4ed46f78da465978348fa58c68a631/packit_service/worker/handlers/abstract.py#L285-L287

into

https://github.com/packit/packit-service/blob/b58b3b71db4ed46f78da465978348fa58c68a631/packit_service/worker/handlers/abstract.py#L225-L228

Hard time limit (300s)

Redis connection problems are already been resolved after switching from redict to valkey

just to clarify, I've adjusted the timeout, issue itself is not related to neither Redict or Valkey, the short-running pods leak memory from the concurrent threads :/

Ideally it should be only temporary solution, cause the workers (caused by either celery, or gevent, or celery × gevent) have the issue anyways, the only difference is that Valkey currently cleans up “dead” (idle for a long time) connections, so we don't run out.


By this point, from what I see in your investigation, there are 3 occurrences where I see GitHub-related action before a gap and then timeout right away. There is also one more that is related to TF, which could point to network issue.

All in all, timeout with requests sounds good, but we should be able to retry (not sure if we could, by any chance, spam the external services like that somehow) :/

SLO1 related problems

From the last log :eyes: how many times do we need to parse the comment? it is repeated a lot IMO

majamassarini commented 1 month ago

I wouldn't be surprised if it got stuck on the GitHub status check, but it's still weird that we see the clean up and then timeout exception. OTOH the time difference between these two messages is ~15 minutes 👀

Good point, I didn't saw the time difference between the status check action and the cleaning. I think an exception is thrown there and we are not logging it. I will add some code to log it.

Hard time limit (300s)

Redis connection problems are already been resolved after switching from redict to valkey

just to clarify, I've adjusted the timeout, issue itself is not related to neither Redict or Valkey, the short-running pods leak memory from the concurrent threads :/

Ideally it should be only temporary solution, cause the workers (caused by either celery, or gevent, or celery × gevent) have the issue anyways, the only difference is that Valkey currently cleans up “dead” (idle for a long time) connections, so we don't run out.

I just noticed that after the switch to Valkey the hard time limit 300s... exceptions are 10 times less frequent on average. So if we can not find the root cause, I will keep it ;)

By this point, from what I see in your investigation, there are 3 occurrences where I see GitHub-related action before a gap and then timeout right away. There is also one more that is related to TF, which could point to network issue.

All in all, timeout with requests sounds good, but we should be able to retry (not sure if we could, by any chance, spam the external services like that somehow) :/

There are already 5 max_retries set for the HTTPAdapter class, so we should be good to go.

SLO1 related problems

From the last log 👀 how many times do we need to parse the comment? it is repeated a lot IMO

I agree, there have two different kind of problems.

majamassarini commented 1 month ago

We agreed on keeping this open for a while and see later what is fixed and what is not. To open new follow-up issues if needed.

majamassarini commented 2 weeks ago

Closing this in favour of #2603