runfinch / infrastructure

Infrastructure to build, test, and release Finch
Apache License 2.0
17 stars 13 forks source link

ci: Build action has been consistently failing macos-arm64-build in finch-core #367

Open ginglis13 opened 1 year ago

ginglis13 commented 1 year ago

The Build action has been consistently failing for the last month: https://github.com/runfinch/finch-core/actions/workflows/release.yaml

GitHub runners use passwordless sudo. However runners provisioned via finch infra don't allow passwordless sudo. (EDIT: this is observed consistently on the macOS 12 runner for arm64 https://github.com/runfinch/finch-core/actions/runs/6010202363/job/16301161154)

The log lines in step Make and release deps before timeout have been:

if [ "Darwin" != "Linux" -a ! -e "/opt/homebrew/bin/nerdctl" ]; then ln -sf nerdctl.lima "/opt/homebrew/bin/nerdctl"; fi
if [ "Darwin" != "Linux" -a ! -e "/opt/homebrew/bin/apptainer" ]; then ln -sf apptainer.lima "/opt/homebrew/bin/apptainer"; fi
sudo may prompt for password to run FileMonitor
Error: The operation was canceled. # <-- timeout, cancelled workflow

This message is coming from https://github.com/runfinch/finch-core/blob/08a4ca2a9285f1dd2fac3bd4701087b1b2fdec87/bin/lima-and-qemu.pl#L46

Still looking to verify but the smoking gun is that the script is hanging on a prompt for password.

eOn my machine macOS Ventura 13.4 M1 chip:

./bin/lima-and-qemu.pl                                            
ls: /opt/homebrew/bin/limactl: No such file or directory
Missing argument in sprintf at ./bin/lima-and-qemu.pl line 213.
sudo may prompt for password to run FileMonitor
Password:
weikequ commented 1 year ago

Thanks for the bringing this up! I am slightly confused, why does the x86 one work, but the arm64 one not work? Shouldn't they be based off the same underlying infra? In addition, our e2e tests on runfinch/finch works totally fine with sudo commands. See this example. I wonder if it has anything to do with this being a perl script that's run instead of a normal bash/zsh script

ginglis13 commented 1 year ago

I am slightly confused, why does the x86 one work, but the arm64 one not work? Shouldn't they be based off the same underlying infra?

Yes they should be from what I can tell. This is the root of the issue, which I do not have a root cause for. Take a look at this recent execution of the Build action: https://github.com/runfinch/finch-core/actions/runs/6010202363/job/16301161154

You can see the prompt for sudo is blocking. This has been consistent over the last 3months (at least from what I can see).

I wonder if it has anything to do with this being a perl script that's run instead of a normal bash/zsh script

maybe... but this is observed only on a specific macOS version/architecture, the perl script works fine on the others.

weikequ commented 1 year ago

Did a quick test on this runner by changing the workflow to the following:

...
          sudo echo hi
          ./bin/lima-and-qemu.pl
...

The runner gets stuck on ./bin/lima-and-qemu.pl and not sudo echo hi.

weikequ commented 1 year ago

Hmm, also not a perl thing:


Run echo '#!/usr/bin/env perl' >> test.pl
  echo '#!/usr/bin/env perl' >> test.pl
  echo 'system("sudo echo sudoed")' >> test.pl
  chmod u+x test.pl
  ./test.pl
  shell: /bin/zsh {0}
  env:
    GO111MODULE: on
sudoed
weikequ commented 1 year ago

It is due to this line sleep(1) until -s $filemonitor; that the workflow hangs, not the use of sudo's password entry. @vsiravar can you take a look at why it's not correctly evaluating the size changes? From lima-and-qemu.pl:

...
END { system("sudo pkill FileMonitor") }
system("sudo echo this-should-show");                # this shows up
print "sudo may prompt for password to run FileMonitor\n";
system("sudo -b /Applications/FileMonitor.app/Contents/MacOS/FileMonitor >$filemonitor 2>/dev/null");
system("sudo echo this-should-show");                # this shows up
sleep(1) until -s $filemonitor;
system("sudo echo this-probably-wont-show");         # this does not show up
...
vsiravar commented 1 year ago

The weird thing though is that it does not hang on a self-hosted runner provisioned manually. Log from a previous run. Does anything show up in the filemonitor.log when the workflow hangs?

weikequ commented 1 year ago

No, I tried inserting a system("sudo cat $filemonitor");, right before sleep, but nothing is displayed

vsiravar commented 1 year ago

can you take a look at why it's not correctly evaluating the size changes?

sleep(1) until -s $filemonitor; is behaving as expected since $filemonitor is empty.

Did you also check if /Applications/FileMonitor.app/Contents/MacOS/FileMonitor process is running after system("sudo -b /Applications/FileMonitor.app/Contents/MacOS/FileMonitor >$filemonitor 2>/dev/null");.

weikequ commented 1 year ago
96674 ??         0:00.00 sudo -b /Applications/FileMonitor.app/Contents/MacOS/FileMonitor
96675 ??         0:00.00 /Applications/FileMonitor.app/Contents/MacOS/FileMonitor
96676 ??         0:00.00 sh -c ps -ax | grep FileMonitor
96678 ??         0:00.00 grep FileMonitor
weikequ commented 1 year ago

Update after troubleshooting: FileMonitor requires (or makes Terminal require) Full Disk Access. It is unclear why macOS 11 for x86 works.