veertuinc / gitlab-runner

MIT License
16 stars 3 forks source link

Establishing an SSH connection doesn't retry if it's slow to start #14

Closed NorseGaud closed 2 years ago

NorseGaud commented 3 years ago

I noticed that only one SSH attempt is made and it's sometimes too soon after the VM starts that it throws a failure (but would work if it retried).

  "=== JOB LOG ===",
        "\u001b[0KRunning with gitlab-runner 13.11.0/1.4.0 (059848ac)",
        "\u001b[0;m\u001b[0K  on localhost project specific runner KyFZ35N-",
        "\u001b[0;msection_start:1634035346:prepare_executor",
        "\u001b[0K\u001b[0K\u001b[36;1mPreparing the \"anka\" executor\u001b[0;m",
        "\u001b[0;m\u001b[0KOpening a connection to the Anka Cloud Controller: http://19X2.XX",
        "\u001b[0;m\u001b[0K\u001b[36;1mStarting Anka VM using:\u001b[0;m",
        "\u001b[0;m\u001b[0K  - Template UUID: 5d1b40b9-7e68-4807-a290-c59c66e926b4",
        "\u001b[0;m\u001b[0K  - Tag: v1-with-file",
        "\u001b[0;m\u001b[0K  - Node Group: gitlab-test-group-env",
        "\u001b[0;m\u001b[0KPlease be patient...",
        "\u001b[0;m\u001b[0KYou can check the status of starting your Instance on the Anka Cloud Controller: http://192.XX/#/instances",
        "\u001b[0;m\u001b[0KVerifying connectivity to the VM - Host: 10.8.1.12 Port: 10000 Instance UUID: 2af9c1ed-f3b0-499b-49ad-5acb35d4c680 ",
        "\u001b[0;m\u001b[0KSSH Error to VM:ssh Dial() error: ssh: handshake failed: read tcp 1XXX9:41514->10.8.1.12:10000: read: connection reset by peer &{2af9c1ed-f3b0-499b-49ad-5acb35d4c680 10.8.1.12 10000 veertus-MacBook-Pro.local 10.8.1.12}",
        "\u001b[0;m\u001b[0KTerminating Anka VM  2af9c1ed-f3b0-499b-49ad-5acb35d4c680",
        "\u001b[0;msection_end:1634035526:prepare_executor",
        "\u001b[0K\u001b[31;1mERROR: Job failed (system failure): ssh Dial() error: ssh: handshake failed: read tcp 192.168.122.249:41514->10.8.1.12:10000: read: connection reset by peer",
        "\u001b[0;m",
        "================================================================",
        "sequential-variables-example, PIPELINE_ID: 2, JOB_ID: 4 FAILED!"

It doesn't look like we have control over this in gitlab-runner. We should find a way to address this with some sort of retries which can be set in the runner config.