ocurrent / opam-repo-ci

An OCurrent pipeline for testing submissions to opam-repository
Apache License 2.0
20 stars 22 forks source link

BUG: jobs temporary disappeared while recalculating (analysis) #202

Open kit-ty-kate opened 1 year ago

kit-ty-kate commented 1 year ago

Happened in https://github.com/ocaml/opam-repository/pull/23047#issuecomment-1403837851 (noticed live)

1) all the jobs for the PR finished and was all green 2) i merged some PRs (many in a row) 3) (analysis) is thus recalculated 4) but all the other jobs disappeared from the web-ui and the github status moved back from green to active 5) waited for the (analysis) job to finish and all the previous jobs came back directly without any rebuild

4) should not happen as this job is latched

kit-ty-kate commented 1 year ago

@art-w so far, to me it seems that the pattern is: 1) some PR gets merge 2) the analysis restarts 3) it fails for whatever reason

i can see some of these PRs change to a github status with a red cross instead of "in progress"

art-w commented 1 year ago

Yes you are absolutely right!

When a PR gets merged, master is updated: In practice only the analysis is strict about using the latest HEAD, but master is still an ocurrent dependency of all the other jobs (lint / build) (master gets past to those Current_cache as "extra context" meaning that previous runs with an older master are reused for free, but ocurrent still has to traverse the whole graph whenever master changes -- at the scale of opam-repo-ci this might explains the freezes following a merge?)

I think we could cheat a bit here by gating master such that the latest master value would only be propagated if the analysis' result had also changed (.. and this should produce the same outcome as today ^^') Using https://github.com/ocurrent/ocurrent/pull/330 this ugly code might work:

let analysis = Analyse.examine ~master src in (* same as before *)
let master_analysis =
  Current.cutoff
     (* keep the previous master if the analysis hasn't changed *)
     ~eq:(fun (_old_master, old_analysis) (_new_master, new_analysis) ->
            Analyse.Analysis.equal old_analysis new_analysis)
  @@ Current.pair master analysis
in
(* see below: protect from analysis failures here with [latch]? *)
let master = Current.map fst master_analysis in
let analysis = Current.map snd master_analysis in

Now the situation is even worse when the analysis fails, because then the whole PR subgraph gets invalidated.. When the next update to master re-triggers the analysis and it eventually succeeds, ocurrent has to rebuild the subgraph from scratch and ask the Current_cache sqlite database to recover each previous job result. I'm guessing it would be nicer to use Pipeline.latch to hide those analysis failures on re-run:

let master_analysis = latch ~label:"analysis" master_analysis in

But this would hide the "second re-run" analysis failures from the github status, so perhaps it needs to be sent separately like you did for the lint pass? :)

kit-ty-kate commented 1 year ago

Bingo! I managed to catch one of the failures in (analysis):

2023-01-27 15:31.55: New job: Analyse
2023-01-27 15:31.55: Waiting for resource in pool analyse
2023-01-27 15:39.14: Got resource from pool analyse
2023-01-27 15:39.14: Checking out commit ae1eaa1e. To reproduce:
                       git clone --recursive "https://github.com/ocaml/opam-repository.git" && cd "opam-repository" && git fetch origin "refs/pull/23140/head" && git reset --hard ae1eaa1e
2023-01-27 15:39.14: Exec: "cp" "-a" "--" "/var/lib/ocurrent/var/git/opam-repository.git-a0bc4b41ddb868605dc1500002c27cfa79faf4389fe53b6754817ea16b48d458/.git" 
                           "/tmp/git-checkout19f801a9"
2023-01-27 15:39.16: Exec: "git" "-C" "/tmp/git-checkout19f801a9" "submodule" 
                           "deinit" "--force" "--all"
2023-01-27 15:39.17: Exec: "git" "-C" "/tmp/git-checkout19f801a9" "reset" 
                           "--hard" "-q" "ae1eaa1ecb788f8ede49c45f775f509d3462bf20"
fatal: Unable to create '/tmp/git-checkout19f801a9/.git/index.lock': File exists.

Another git process seems to be running in this repository, e.g.
an editor opened by 'git commit'. Please make sure all processes
are terminated then try again. If it still fails, a git process
may have crashed in this repository earlier:
remove the file manually to continue.
2023-01-27 15:39.18: Exec: "git" "-C" "/var/lib/ocurrent/var/git/opam-repository.git-a0bc4b41ddb868605dc1500002c27cfa79faf4389fe53b6754817ea16b48d458" 
                           "branch" "-f" "fetch-ae1eaa1ecb788f8ede49c45f775f509d3462bf20" 
                           "ae1eaa1ecb788f8ede49c45f775f509d3462bf20"
2023-01-27 15:39.24: Job failed: Command "git" "-C" "/tmp/git-checkout19f801a9" "reset" "--hard" "-q" 
"ae1eaa1ecb788f8ede49c45f775f509d3462bf20" exited with status 128
art-w commented 1 year ago

Nice! /tmp/git-checkoutXYZ appears to be created specifically for that job with no concurrent git processes accessing it, so my best guess is that the cp -a /var/lib/... triggers during an update to the reference repo and copies along the lock? (but then git submodule deinit ignores the lock? >_>)