oxidecomputer / buildomat

a software build labour-saving device
Mozilla Public License 2.0
53 stars 2 forks source link

ignore ENOENT when uploading best-effort log files #16

Closed davepacheco closed 2 months ago

davepacheco commented 1 year ago

An Omicron build failed after everything appeared to work, but buildomat failed to upload a log file because it was gone by the time it got to it:

381 2023-02-23T05:33:13.697Z    uploading: /zone/oxz_oximeter/root/var/svc/log/system-illumos-oximeter:default.log (67780 bytes)
382 2023-02-23T05:33:13.723Z    upload warning: file "/zone/oxz_oximeter/root/var/svc/log/system-illumos-oximeter:default.log" changed size mid upload: 67780 -> 69689
383 2023-02-23T05:33:13.749Z    uploaded: /zone/oxz_oximeter/root/var/svc/log/system-illumos-oximeter:default.log
384 2023-02-23T05:33:13.776Z    uploading: /zone/oxz_propolis-server_a2bb383d-7f58-48a0-82fd-cd0dc525b49c/root/var/svc/log/system-illumos-propolis-server:vm-a2bb383d-7f58-48a0-82fd-cd0dc525b49c.log (31419 bytes)
385 2023-02-23T05:33:13.802Z    upload error: open "/zone/oxz_propolis-server_a2bb383d-7f58-48a0-82fd-cd0dc525b49c/root/var/svc/log/system-illumos-propolis-server:vm-a2bb383d-7f58-48a0-82fd-cd0dc525b49c.log" failed: Os { code: 2, kind: NotFound, message: "No such file or directory" }

These were log files specified with the % prefix that's documented in the README like this:

By default, the system attempts to ensure that a job has not accidentally left background processes running that continue to modify the output artefacts. If the size or modified time of a file changes while it is being uploaded, the job will fail. To relax this restriction, the % prefix may be used to signify that "this file is allowed to change while it is being uploaded". This is used to make best effort uploads of diagnostic log files for background processes which may continue running even though the job is nominally complete

It seems like we could ignore this error for these best-effort log uploads. (In this particular case, I did not dig too deep, but I imagine the test deployed a VM and then destroyed it, but did not wait for the destroy to complete. Thus when buildomat started it cleaning up, it found the log file, but something was still cleaning it up. You could also argue that the test should wait for the destroy, or that buildomat should kill everything first, but I'm not sure either of those is better here or in general.)

jclulow commented 2 months ago

This should be fixed as of 12e1d0b27f5c759e52fcd20f509c82f7a435cde4 which is now live. Please let me know if you see it again!