Closed Omrigan closed 4 weeks ago
@Omrigan how long has it been flaky for? Can you link some example runs?
@Omrigan how long has it been flaky for?
I'd assume from the introduction.
Can you link some example runs?
Your PR: https://github.com/neondatabase/autoscaling/actions/runs/9878750381/job/27283615758#step:14:1034 My PR: https://github.com/neondatabase/autoscaling/actions/runs/9889976645/job/27317397731#step:14:948
I had 3 or 4 such cases, I think.
However, maybe it is not that particular test, here is the other test failing: https://github.com/neondatabase/autoscaling/actions/runs/9890395470/job/27319200362#step:14:987
We should check if this is only timeouts or if we suspect an actual problem. Until then marking as P1.
Haven't yet had a chance to look at it. However - @Omrigan if you're truly observing it fail 50% of the time, it may be due to your PR. The occurrence you flagged was the first time I'd seen it fail on one of my PRs.
A couple recent runs:
Both cases look like they timed out after 5 minutes on downscaling.
Reproduces locally in about ~2 hours. From the logs it seems like the issue is that VM uses too much memory sometimes and it prevents downscaling. I think the next step for debugging would be to print memory usage in logs periodically. For now I'll put this issue in selected, but want to get back to it later.
This might be fixed by now. Arthur will reproduce and see if it is fixed.
https://gist.github.com/petuhovskiy/226522b34bd85a3a8d2d8ee88fa43dbd
Tried to reproduce, it failed only once in 360 runs. Let's consider it fixed for now.
Fails roughly 50% of the time.