runfinch / finch

The Finch CLI is an open source client for container development
https://www.runfinch.com
Apache License 2.0
3.5k stars 91 forks source link

e2e tests failing intermittently #140

Closed sam-berning closed 3 weeks ago

sam-berning commented 1 year ago

Describe the bug During the process of reviewing and merging https://github.com/runfinch/finch/pull/133, I noticed intermittent and unpredictable failures in the e2e tests. I addressed some of these issues in the PR itself, but some of the failures persist. In the repository's current state, I've seen two types of failures:

  1. finch vm stop timeouts Occasionally, finch vm stop takes longer than 60 seconds to execute. I see such a timeout happen in about every other run of the e2e tests. The tests run this command ~20 times, and we run the e2e tests on 4 different hosts each test run. So it seems to happen on about 1% of stop commands. Because the tests assume that the vm is in a particular state before the test starts, this can also cause carry-over effects to other tests, which results in multiple tests failing.
  2. issues with BuildKit starting up I've only seen this one time so far, but it didn't seem caused by a finch vm stop timeout. Below are the relevant logs:
    time="2023-01-07T01:17:37Z" level=error msg="`buildctl` needs to be installed and `buildkitd` needs to be running, see https://github.com/moby/buildkit , and `containerd-rootless-setuptool.sh install-buildkit` for OCI worker or `containerd-rootless-setuptool.sh install-buildkit-containerd` for containerd worker" error="3 errors occurred:\n\t* failed to ping to host unix:///run/user/501/buildkit-finch/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit-default/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit/buildkitd.sock: exit status 1\n\n"
    time="2023-01-07T01:17:37Z" level=fatal msg="no buildkit host is available, tried 3 candidates: 3 errors occurred:\n\t* failed to ping to host unix:///run/user/501/buildkit-finch/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit-default/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit/buildkitd.sock: exit status 1\n\n"

The test failures seem to happen more often on ARM hosts than x86 hosts. The ARM hosts running OSX 11.7 in particular seem to have issues with finch vm stop timeouts.

Steps to reproduce Run e2e tests

Expected behavior Tests should succeed consistently

Screenshots or logs If applicable, add screenshots or logs to help explain your problem.

Additional context Add any other context about the problem here.

ahsan-z-khan commented 1 year ago

Failure captured in my PR:

Finch CLI e2e Tests Run a container image when running a container that echos dummy output should not echo dummy output if running with -d flag
/Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/tests/run.go:59
  d64e113339f9d2b97d87dea6eeedc776d1bc5679bbccc52f75f876584ec586f9
  No containers to be removed
  ce14a6258f37
  No images to be removed
  9bb07d084c6f3455a2e739b9d4c62e7171fe4312db73846dab17f46da769c919
  bridge
  host
  none
  No networks to be removed
  time="2023-01-11T03:20:37Z" level=error msg="`buildctl` needs to be installed and `buildkitd` needs to be running, see https://github.com/moby/buildkit , and `containerd-rootless-setuptool.sh install-buildkit` for OCI worker or `containerd-rootless-setuptool.sh install-buildkit-containerd` for containerd worker" error="3 errors occurred:\n\t* failed to ping to host unix:///run/user/501/buildkit-finch/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit-default/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit/buildkitd.sock: exit status 1\n\n"
  time="2023-01-11T03:20:37Z" level=fatal msg="no buildkit host is available, tried 3 candidates: 3 errors occurred:\n\t* failed to ping to host unix:///run/user/501/buildkit-finch/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit-default/buildkitd.sock: exit status 1\n\t* failed to ping to host unix:///run/user/501/buildkit/buildkitd.sock: exit status 1\n\n"
  time="2023-01-11T03:20:37Z" level=fatal msg="exit status 1"
  [FAILED] in [BeforeEach] - /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/command/command.go:122 @ 01/11/23 03:20:37.807
  d64e113339f9d2b97d87dea6eeedc776d1bc5679bbccc52f75f876584ec586f9
  No containers to be removed
  ce14a6258f37
  No images to be removed
  9bb07d084c6f3455a2e739b9d4c62e7171fe4312db73846dab17f46da769c919
  bridge
  host
  none
  No networks to be removed
• [FAILED] [1.103 seconds]
Finch CLI e2e Tests Run a container image when running a container that echos dummy output [BeforeEach] should not echo dummy output if running with -d flag
  [BeforeEach] /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/tests/run.go:45
  [It] /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/tests/run.go:59

  [FAILED] Expected
      <int>: 1
  to match exit code:
      <int>: 0
  In [BeforeEach] at: /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/command/command.go:122 @ 01/11/23 03:20:37.807
davidhsingyuchen commented 1 year ago

This set of 4 failures also often appear together (e.g., also in some dependabot PRs). One of them is caused by vm stop timeout, and the remaining 3 are caused by buildkit.

  [FAIL] Finch CLI e2e Tests Run a container image when running a container that echos dummy output [BeforeEach] should echo dummy output
  /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/command/command.go:122
  [FAIL] Finch CLI e2e Tests Run a container image when running a container that echos dummy output [BeforeEach] should not echo dummy output if running with -d flag
  /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/command/command.go:122
  [FAIL] Finch CLI e2e Tests Run a container image when running a container with environment related flags [It] with --entrypoint flag, ENTRYPOINT in dockerfile should not be executed
  /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/command/command.go:122
  [FAIL] Finch CLI e2e Tests Config [It] updates config values when a config file is present
  /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/command/command.go:115
weikequ commented 1 year ago

Interesting, from https://github.com/runfinch/finch/pull/154, it seems like a host encountered a vm stop timeout command and then proceeded to fail more with a buildkit related error in a "refreshed" run.

sam-berning commented 1 year ago

A couple of strange failures I've run into on #154, both on the ARM 11.7 runners:

  1. Stop-start-stop: looks like the stop command registers failure, start command starts up, and somehow the stop command notices the VM has changed state and stops it immediately?
    [FAILED] in [It] - /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/command/command.go:115 @ 01/13/23 00:09:08.278
    time="2023-01-13T00:09:08Z" level=fatal msg="the instance \"finch\" is already stopped"
    time="2023-01-13T00:09:08Z" level=info msg="Starting existing Finch virtual machine..."
    time="2023-01-13T00:09:09Z" level=info msg="Finch virtual machine stopped successfully"
    time="2023-01-13T00:09:10Z" level=error msg="Finch virtual machine failed to start, debug logs: time=\"2023-01-13T00:09:08Z\" level=info msg=\"Using the existing instance \\\"finch\\\"\"\ntime=\"2023-01-13T00:09:08Z\" level=info msg=\"Attempting to download the nerdctl archive from \\\"https://github.com/containerd/nerdctl/releases/download/v1.1.0/nerdctl-full-1.1.0-linux-arm64.tar.gz\\\"\" digest=\"sha256:3b613a1be5a24460c44bb93a3609b790ada94e06efd1a86467d45bec7da8b449\"\ntime=\"2023-01-13T00:09:08Z\" level=info msg=\"Using cache \\\"/Users/ec2-user/Library/Caches/lima/download/by-url-sha256/c69be86b3e48430e3d687d54361cf15b90e2c067fae6a294ca6292d41f42bf0e/data\\\"\"\ntime=\"2023-01-13T00:09:10Z\" level=warning msg=\"[hostagent] Reducing the guest memory from 4GiB to 3GiB, to avoid host kernel panic on macOS <= 12.3 with QEMU >= 7.0; Please update macOS to 12.4 or later, or downgrade QEMU to 6.2; See https://github.com/lima-vm/lima/issues/795 for the further background.\"\ntime=\"2023-01-13T00:09:10Z\" level=info msg=\"[hostagent] Mounting disk \\\"finch\\\" on \\\"/mnt/lima-finch\\\"\"\ntime=\"2023-01-13T00:09:10Z\" level=fatal msg=\"exiting, status={Running:false Degraded:false Exiting:true Errors:[] SSHLocalPort:0} (hint: see \\\"/Users/ec2-user/ar/_work/finch/finch/_output/lima/data/finch/ha.stderr.log\\\")\"\n"
    time="2023-01-13T00:09:10Z" level=fatal msg="exit status 1"
  2. Finch compose down failure:

    [FAILED] [2.657 seconds]
    Finch Container Development E2E Tests Compose down command [It] should stop compose services and delete volumes by specifying --volumes flag
    /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/tests/compose_down.go:45
    
    [FAILED] Expected
      <[]uint8 | len:13, cap:13>: [100, 53, 98, 54, 98, 102, 52, 52, 100, 52, 49, 97, 10]
    to be empty
    In [It] at: /Users/ec2-user/go/pkg/mod/github.com/runfinch/common-tests@v0.2.0/tests/tests.go:134