0s in queue, ran for 0s

art-w commented 1 year ago

Context

I don't understand the 0s timings displayed by the CI header on some jobs. The example link is coming from https://github.com/ocurrent/current-bench/pull/438/checks?check_run_id=13718052712 which has other jobs with understandable timings.

Steps to reproduce

Go to https://ocaml.ci.dev/github/ocurrent/current-bench/commit/43c5f8cf5f7c5dfa7e1be4a50b904561c9af462c/variant/debian-12-4.14_arm64_opam-2.1
Header displays "0s in queue", "Ran for 0s"
Logs indicate different:

2023-07-31 15:44.43: New job: test ocurrent/current-bench https://github.com/ocurrent/current-bench.git#refs/heads/schema-infix-op (43c5f8cf5f7c5dfa7e1be4a50b904561c9af462c) (linux-arm64:debian-12-4.14_arm64_opam-2.1)
...
2023-07-31 15:44.43: Waiting for resource in pool OCluster
2023-07-31 17:06.50: Waiting for worker…
2023-07-31 17:22.45: Got resource from pool OCluster
...
2023-07-31 17:38.24: Job succeeded

Expected behaviour

16min in queue (or 1h32min?) Ran for 16min

benmandrew commented 1 year ago

Thanks for the report. IIRC these are cached results (and thus cached logs) so the queue- and run-time in the header are correct, but it would be ideal to mark them as cached to prevent confusion. I seem to remember this was a surprisingly hard problem to solve @novemberkilo?

novemberkilo commented 1 year ago

Thanks @art-w -- as @benmandrew points out, unfortunately this is a known issue. Will add it to our list of fixes/enhancements. // @tmcgilchrist

moyodiallo commented 1 year ago

This issue is related to the work I did when connecting ocaml-ci and solver-service. Because we're send 2 different requests to OCluster, this is why we start immediately the job. Otherwise using Cluster connection, the job will be started twice so ends up failing.

Making those 2 different requests in one type of request to be sent to the solver-service can solve this issue.

moyodiallo commented 12 months ago

Making those 2 different requests in one type of request to be sent to the solver-service can solve this issue.

Instead of having to upgrade the solver-service API each time we add a different request, it is preferable to have a pool for analysis in which all the different requests is sent at different time to the solver-service using the same API. Some selections like ocamlformat, opam-dune-lint have been got with different requests to the solver-service during the analysis job.

This PR https://github.com/ocurrent/ocaml-ci/pull/888 fixes the issue at line https://github.com/ocurrent/ocaml-ci/blame/b3c3facfe0e1e1e18dfd0389827f555908c1ee0b/lib/analyse.ml#L253, where the pool was removed at some point.

moyodiallo commented 12 months ago

@art-w would you like to confirm the fix ?

moyodiallo commented 11 months ago

@benmandrew this could be closed, I think.

art-w commented 11 months ago

Oh sorry @moyodiallo I didn't see your message! I'm not sure I understand the technical details, beside the 0s timings being related to the cache (and so it's obviously hard to fix)... So without digging into the code, I had a look at the latest commit on ocurrent which shows a bunch of tasks with 0s duration: https://ocaml.ci.dev/github/ocurrent/ocurrent/commit/8e0b9d4bb348b13df8696fe63feba303b9a476fd (I don't know if the CI is running your fix though!)

(also I understand that there were other issues related to cluster jobs which were higher priorities, I don't think the run duration is critical for end-users, it's a bit confusing but otherwise a minor issue)

benmandrew commented 11 months ago

@art-w you are correct, the issue is related to the ocurrent cache, not the cluster connection. This issue still exists as you saw.

moyodiallo commented 11 months ago

Sorry guys (@benmandrew, @art-w) I mixed it, I solved another issue and thinking it is related. The issue I solved is when all the analysis jobs start with 0s in queue and lot of them keep waiting at some point.

ocurrent / ocaml-ci