oscar-system / Oscar.jl

A comprehensive open source computer algebra system for computations in algebra, geometry, and number theory.
https://www.oscar-system.org
Other
339 stars 120 forks source link

Error `#E component 'PDFFile' must be bound to a string denoting a relative path to a readable file` in macOS CI #3967

Open lgoettgens opened 1 month ago

lgoettgens commented 1 month ago

Seen on a recent master, e.g. in https://github.com/oscar-system/Oscar.jl/actions/runs/10066028282/job/27826455858#step:9:239. It seems to only occur on the macOS runners.

The only reference to PDFFile I found in this repo is https://github.com/oscar-system/Oscar.jl/blob/8163ba8856c7c5d7643c4f7c78a082eabbe09575/gap/OscarInterface/PackageInfo.g#L71 so I suspect the GAP interface to be somehow the culprit of this. Maybe @fingolfin or @ThomasBreuer have some insight into this?

More breadcrumbs:

fingolfin commented 1 month ago

The comment could in principle refer to any of the 150 packages shipped with GAP and being loaded. But they all should have their manual as a PDF bundled, except for OscarInterface and JuliaInterface.

This message shows up when something calls ValidatePackageInfo on a package's metadata. I'll submit a PR to avoid this warning for OscarInterface

But I wonder: why is it being shown at all? Why / what is validating GAP package metadata here? And is it really OscarInterface it complains about, or something else? Perhaps @ThomasBreuer has an idea?

lgoettgens commented 1 month ago

maybe doing something similar to https://github.com/oscar-system/Oscar.jl/pull/3968 for JuliaInterface (and JuliaExperimental) gets rid of it? (But that of course needs https://github.com/oscar-system/Oscar.jl/pull/3688 to be merged before we can see any changes here in CI)

ThomasBreuer commented 1 month ago

Code in GAP's PackageManager calls ValidatePackageInfo in several places, I guess this is the source of the output (without having verified this).

One of the problems is that ValidatePackageInfo calls Print if it finds something to complain about; this function should have a quiet mode, and PackageManager should pass its own quiet flag on to ValidatePackageInfo.

(I do not understand why we did not get these messages earlier.)

lgoettgens commented 1 month ago

After some digging around I now have the following additional information:

So @ThomasBreuer and I think that this is somehow due to the way that the RPTU runner re-uses things from previous runs.

@aaruni96 will look into properly separating and cleaning up the Julia Homes on the RPTU runner when he is back in KL. Let's hope that this then resolves this issue

fingolfin commented 1 month ago

@lgoettgens @ThomasBreuer so which things are re-used? Is this about a git clone of a dev version of recog sticking around?

If so it could probably be solved by doing rm -rf ~/.gap ~/.julia/gaproot at the start of our CI workflows (at least when they are running on the self-hosted runners).

Or is there anything else you suspect might be relevant?

lgoettgens commented 1 month ago

This unfortunately won't work, as it will introduce race conditions between the different runners on the self-hosted machine. But combining this together with separating the JULIA_HOMEs of the different runners is exactly what @aaruni96 wanted to try

fingolfin commented 1 month ago

@aaruni96 already separated them now

lgoettgens commented 1 month ago

I just checked the latest CI run on master (which was 3 days ago), and this doesn't have these prints anymore. See https://github.com/oscar-system/Oscar.jl/actions/runs/10130857450/job/28149030285#step:9:352. Thus I think this issue can be closed

aaruni96 commented 1 month ago

which was 3 days ago

I made the change last evening, and re ran just the short test, to make sure nothing is broken as a side effect.

But the long test appears to have failed "one hour ago" with a similar (but not the same) error ( https://github.com/oscar-system/Oscar.jl/actions/runs/10130857450/job/28149030653#step:9:240 ). You can see that its using artefacts from the runner specific depot

Error, file "/Users/aaruni/Desktop/oscar-runners/runner-4/julia-depot/scratchspaces/c863536\
a-3901-11e9-33e7-d5cd0df7b904/gap_14920021487703909718_1.10/pkg/gapdoc/lib/GAP\
Doc2HTML.gi" must exist and be readable

This might be related to the same problem ?

aaruni96 commented 1 month ago

Note, that we have just done a separation of the depots, there is still no "cleanup" phase which removes stuff for a fresh start.

aaruni96 commented 1 month ago

Github runners do allow Pre / Post hook scripts ( https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/running-scripts-before-or-after-a-job ), so, a decision needs to be made about what we should do in the cleanup phase, and then I can work on making it happen.

fingolfin commented 1 month ago

I am surprised to see this error about GAPDoc2HTML.gi missing -- the suspected cause for this (race condition) should be eliminated now that the runner have separate depots. I just had a look at the machine and the file is there now...

But this suggests that in addition to the two directories I already suggested, perhaps we should also rm -rf JULIADEPOT/scratchspaces ?

lgoettgens commented 1 month ago

I think we should just clear the complete JULIA_DEPOT and instead use the julia-actions/cache action again. This should cache exactly those parts that are safe to keep between runs.

fingolfin commented 1 month ago

I disagree. That wastes many dozens of gigabytes of bandwidth every day.

lgoettgens commented 1 month ago

I disagree. That wastes many dozens of gigabytes of bandwidth every day.

True...

So removing .gap, .julia/gaproot, and .julia/scratchspaces (or whatever they are called in the multi-depot setup) in a job hook seems to be a good compromise