Closed majamassarini closed 2 weeks ago
Since I am not able to comment here (since the comment is too long). I link here a public gist with the summary of my findings.
Can't comment on gist :eyes: thx GitHub :D anyways
"2024-09-26T10:51:51.546708534+00:00 stderr F [2024-09-26 10:51:51,546: INFO/MainProcess] task.run_copr_build_end_handler[d47aa906-d996-4745-9ac1-378a6b0f51f1] Cleaning up the mess.","2024-09-26T10:51:51.546+0000"
"2024-09-26T10:36:52.820075409+00:00 stderr F [2024-09-26 10:36:52,820: DEBUG/MainProcess] task.run_copr_build_end_handler[d47aa906-d996-4745-9ac1-378a6b0f51f1] Setting Github status check 'success' for check 'rpm-build:fedora-40-aarch64:fedora-rawhide': RPMs were built successfully.","2024-09-26T10:36:52.820+0000"
I wouldn't be surprised if it got stuck on the GitHub status check, but it's still weird that we see the clean up and then timeout exception. OTOH the time difference between these two messages is ~15 minutes :eyes:
into
Redis connection problems are already been resolved after switching from redict to valkey
just to clarify, I've adjusted the timeout, issue itself is not related to neither Redict or Valkey, the short-running pods leak memory from the concurrent threads :/
Ideally it should be only temporary solution, cause the workers (caused by either celery, or gevent, or celery × gevent) have the issue anyways, the only difference is that Valkey currently cleans up “dead” (idle for a long time) connections, so we don't run out.
By this point, from what I see in your investigation, there are 3 occurrences where I see GitHub-related action before a gap and then timeout right away. There is also one more that is related to TF, which could point to network issue.
All in all, timeout
with requests sounds good, but we should be able to retry (not sure if we could, by any chance, spam the external services like that somehow) :/
From the last log :eyes: how many times do we need to parse the comment? it is repeated a lot IMO
I wouldn't be surprised if it got stuck on the GitHub status check, but it's still weird that we see the clean up and then timeout exception. OTOH the time difference between these two messages is ~15 minutes 👀
Good point, I didn't saw the time difference between the status check action and the cleaning. I think an exception is thrown there and we are not logging it. I will add some code to log it.
Hard time limit (300s)
Redis connection problems are already been resolved after switching from redict to valkey
just to clarify, I've adjusted the timeout, issue itself is not related to neither Redict or Valkey, the short-running pods leak memory from the concurrent threads :/
Ideally it should be only temporary solution, cause the workers (caused by either celery, or gevent, or celery × gevent) have the issue anyways, the only difference is that Valkey currently cleans up “dead” (idle for a long time) connections, so we don't run out.
I just noticed that after the switch to Valkey the hard time limit 300s...
exceptions are 10 times less frequent on average. So if we can not find the root cause, I will keep it ;)
By this point, from what I see in your investigation, there are 3 occurrences where I see GitHub-related action before a gap and then timeout right away. There is also one more that is related to TF, which could point to network issue.
All in all,
timeout
with requests sounds good, but we should be able to retry (not sure if we could, by any chance, spam the external services like that somehow) :/
There are already 5 max_retries set for the HTTPAdapter class, so we should be good to go.
SLO1 related problems
From the last log 👀 how many times do we need to parse the comment? it is repeated a lot IMO
I agree, there have two different kind of problems.
We agreed on keeping this open for a while and see later what is fixed and what is not. To open new follow-up issues if needed.
Closing this in favour of #2603
I realized that starting from 2nd of June 2024 we have this exception raised quite often in the same day (5-10 occurrences).
It makes no sense for the
process_message
function.Unless it is somehow related to the communication with the
pushgateway
If we can't find the reason for the slowliness we should at least increase again the
hard_time_limit
for this task.See also #2519 for follow-ups.