ocurrent / opam-repo-ci

An OCurrent pipeline for testing submissions to opam-repository
Apache License 2.0
20 stars 22 forks source link

Freeze caused by `opam lint` inside `Lwt_preemptive` thread. #300

Closed benmandrew closed 4 months ago

benmandrew commented 5 months ago

When running opam-repo-ci-service locally, once all of the jobs have finished the service goes into a state where the OCurrent graph does not render in the web-ui, and the process cannot be Ctrl-C'd. This may be related to https://github.com/ocurrent/opam-repo-ci/issues/293.

The issue occurs because in the linting step, we call OpamFileTools.lint inside a pre-emptive thread.

https://github.com/ocurrent/opam-repo-ci/blob/18a5dd1db585c0003ee538eaad09e206fc3049fe/lib/lint.ml#L300-L309

If I remove the threading as shown below, then I no longer see the issues on my machine.

let opam_lint ~check_extra_files ~errors ~pkg opam =
    OpamFileTools.lint ~check_extra_files ~check_upstream:true opam |>
    List.fold_left begin fun errors -> function
      | (67, _, _) -> errors (* TODO: Disable error 67 (Checksum specified with a non archive url) while discussing a fix:
                                - https://github.com/ocaml/opam/pull/4834
                                - https://github.com/ocaml/opam/pull/4960 *)
      | x -> (pkg, OpamLint x) :: errors
    end errors
    |> Lwt.return

The threading was introduced in This was introduced in https://github.com/ocurrent/opam-repo-ci/pull/98 to avoid 504s when checking "large PRs" (I assume this means PRs with many packages), as OpamFileTools.lint blocks the main thread until it finishes. Note that Lwt_preemptive is also used for downloading the opam-repository tree with no issue, so this may be a problem with opam lint specifically.

https://github.com/ocurrent/opam-repo-ci/blob/18a5dd1db585c0003ee538eaad09e206fc3049fe/lib/lint.ml#L101-L109

(I'm running on an ARM64 Mac.)

benmandrew commented 5 months ago

I have been unable to create a minimal example using only opam lint inside an Lwt_preemptive thread — I suspect that there is some interaction with the OCurrent pipeline that is causing the issue.

If others see this issue, then I propose removing the preemptive threading. I'd rather correct code that's sometimes a little unresponsive.

shonfeder commented 5 months ago

I propose removing the preemptive threading. I'd rather correct code that's sometimes a little unresponsive.

I am in favor of this change. It would be nice if we could first test it out on some large input just to safety check whether it may have unexpectedly high cost. But from what I understand we have no really strong reason to think so.

talex5 commented 5 months ago

It would be nice to know why this happens though. A debugger will show what all the threads are doing when it's stuck, which might be useful. If a thread is stuck in the kernel, /proc/PID/stack will show what it's doing (on Linux).

benmandrew commented 4 months ago

I'm unfamiliar with kernel-level stuff so that's a bit out of my scope for now.

Some added context for anyone who's investigating is that this issue does not seem to appear on Linux, only MacOS.