Open kit-ty-kate opened 1 year ago
@art-w so far, to me it seems that the pattern is: 1) some PR gets merge 2) the analysis restarts 3) it fails for whatever reason
i can see some of these PRs change to a github status with a red cross instead of "in progress"
Yes you are absolutely right!
When a PR gets merged, master
is updated: In practice only the analysis is strict about using the latest HEAD, but master
is still an ocurrent dependency of all the other jobs (lint / build) (master
gets past to those Current_cache
as "extra context" meaning that previous runs with an older master are reused for free, but ocurrent still has to traverse the whole graph whenever master
changes -- at the scale of opam-repo-ci this might explains the freezes following a merge?)
I think we could cheat a bit here by gating master such that the latest master
value would only be propagated if the analysis' result had also changed (.. and this should produce the same outcome as today ^^') Using https://github.com/ocurrent/ocurrent/pull/330 this ugly code might work:
let analysis = Analyse.examine ~master src in (* same as before *)
let master_analysis =
Current.cutoff
(* keep the previous master if the analysis hasn't changed *)
~eq:(fun (_old_master, old_analysis) (_new_master, new_analysis) ->
Analyse.Analysis.equal old_analysis new_analysis)
@@ Current.pair master analysis
in
(* see below: protect from analysis failures here with [latch]? *)
let master = Current.map fst master_analysis in
let analysis = Current.map snd master_analysis in
Now the situation is even worse when the analysis fails, because then the whole PR subgraph gets invalidated.. When the next update to master
re-triggers the analysis and it eventually succeeds, ocurrent has to rebuild the subgraph from scratch and ask the Current_cache
sqlite database to recover each previous job result. I'm guessing it would be nicer to use Pipeline.latch
to hide those analysis failures on re-run:
let master_analysis = latch ~label:"analysis" master_analysis in
But this would hide the "second re-run" analysis failures from the github status, so perhaps it needs to be sent separately like you did for the lint pass? :)
Bingo! I managed to catch one of the failures in (analysis)
:
2023-01-27 15:31.55: New job: Analyse
2023-01-27 15:31.55: Waiting for resource in pool analyse
2023-01-27 15:39.14: Got resource from pool analyse
2023-01-27 15:39.14: Checking out commit ae1eaa1e. To reproduce:
git clone --recursive "https://github.com/ocaml/opam-repository.git" && cd "opam-repository" && git fetch origin "refs/pull/23140/head" && git reset --hard ae1eaa1e
2023-01-27 15:39.14: Exec: "cp" "-a" "--" "/var/lib/ocurrent/var/git/opam-repository.git-a0bc4b41ddb868605dc1500002c27cfa79faf4389fe53b6754817ea16b48d458/.git"
"/tmp/git-checkout19f801a9"
2023-01-27 15:39.16: Exec: "git" "-C" "/tmp/git-checkout19f801a9" "submodule"
"deinit" "--force" "--all"
2023-01-27 15:39.17: Exec: "git" "-C" "/tmp/git-checkout19f801a9" "reset"
"--hard" "-q" "ae1eaa1ecb788f8ede49c45f775f509d3462bf20"
fatal: Unable to create '/tmp/git-checkout19f801a9/.git/index.lock': File exists.
Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
2023-01-27 15:39.18: Exec: "git" "-C" "/var/lib/ocurrent/var/git/opam-repository.git-a0bc4b41ddb868605dc1500002c27cfa79faf4389fe53b6754817ea16b48d458"
"branch" "-f" "fetch-ae1eaa1ecb788f8ede49c45f775f509d3462bf20"
"ae1eaa1ecb788f8ede49c45f775f509d3462bf20"
2023-01-27 15:39.24: Job failed: Command "git" "-C" "/tmp/git-checkout19f801a9" "reset" "--hard" "-q"
"ae1eaa1ecb788f8ede49c45f775f509d3462bf20" exited with status 128
Nice! /tmp/git-checkoutXYZ
appears to be created specifically for that job with no concurrent git processes accessing it, so my best guess is that the cp -a /var/lib/...
triggers during an update to the reference repo and copies along the lock? (but then git submodule deinit
ignores the lock? >_>)
Happened in https://github.com/ocaml/opam-repository/pull/23047#issuecomment-1403837851 (noticed live)
1) all the jobs for the PR finished and was all green 2) i merged some PRs (many in a row) 3)
(analysis)
is thus recalculated 4) but all the other jobs disappeared from the web-ui and the github status moved back from green to active 5) waited for the(analysis)
job to finish and all the previous jobs came back directly without any rebuild4)
should not happen as this job is latched