runtimeverification / kontrol

BSD 3-Clause "New" or "Revised" License
55 stars 9 forks source link

Jobs occasionally timeout since updating to 0.1.247 #526

Closed mds1 closed 6 months ago

mds1 commented 7 months ago

In https://github.com/ethereum-optimism/optimism/pull/10159 we updated from kontrol version 0.1.196 to 0.1.247. We now occasionally get Too long with no output (exceeded 10m0s): context deadline exceeded job failures in CircleCI, such as this one.

This is a CircleCI feature that causes jobs to timeout if it's been over 10m with no output, you can read more here: https://support.circleci.com/hc/en-us/articles/360045268074-Build-Fails-with-Too-long-with-no-output-exceeded-10m0s-context-deadline-exceeded.

I'm unsure if the issue here is that the job is still running as expected but just not producing output (in which case the solution might be to add more logs), or if kontrol actually hanging

JuanCoRo commented 6 months ago

So far, I've not been able to reproduce this locally. From what I can see in the logs, if it's hanging, it seems to be hanging at an execute request: image I've never seen the backend hang on this type of request before. To move this forward, besides removing the big test being executed and implementing better logging, we'll shortly be opening a PR to offload the compute. Hopefully, all of this sheds more light on what might be happening here. Will keep updating here what I find!

mds1 commented 6 months ago

We have not since this since I've opened, it's possible it could have been some flake with CircleCI infra. So I will optimistically close this for now, and will reopen if it surfaces again