tailscale / github-action

A GitHub Action to connect your workflow to your Tailscale network.
BSD 3-Clause "New" or "Revised" License
532 stars 80 forks source link

Network failure after tailscale connection is established #39

Closed paulpet closed 2 years ago

paulpet commented 2 years ago

Sometime within the last few days (without any changes to the workflow or tailscale config) we've started seeing DNS look up failures when our Linux Ubuntu 20.04 runner is connected to tailscale. Other clients on tailscale continue to work as expected and I have confirmed that the runner establishes a connection to tailscale. I was using the v1 versions of the tailscale action, but tried using the main version with the same results. Any thoughts on how to further troubleshoot?

EDIT: After further examination (see below) it appears all network functionality breaks during a workflow once the tailscale client connects.

paulpet commented 2 years ago

Looking deeper the runner seems to lose all network connectivity when tailscale initializes. It cannot even ping 1.1.1.1.

paulpet commented 2 years ago
  ping-pong:
    runs-on: ubuntu-20.04
    steps:
      - uses: actions/checkout@v2
      - name: Tailscale connection
        uses: tailscale/github-action@main
        with:
          authkey: ${{ secrets.TAILSCALE_AUTHKEY }}
      - name: Ping
        shell: bash
        run: |
           ping -c 5 1.1.1.1

This will fail to even ping 1.1.1.1.

Same failure occurs on Ubuntu 18.04 and Ubuntu 20.04 runner instances. I might try recreating the authkey - although the original had a no-expiry set.

paulpet commented 2 years ago

I generated a new auth key but it made no difference. A connection was established to tailscale but then no further network activity occurred. I think i'm running out of ideas.

SCR-20220624-in3

DentonGentry commented 2 years ago

I don't immediately know of a change made 3 days ago.

Approximately 11 days ago we updated the Tailscale version of the runner to 1.24.2: https://github.com/tailscale/github-action/pull/38, is there a chance the problem started longer ago than 3 days?

Nevermind, you were using v1 explicitly so you didn't automatically get the 1.24.2 tailscale client when it submitted.

paulpet commented 2 years ago

I don't think it's related to the Tailscale version number, as I think if I run tailscale action v1 (rather than main) it connect with v1.14 and still shows the same issues.

paulpet commented 2 years ago

Hoping you or someone can confirm it's not just me having these issues, as I'm out of ideas of what else to look into.

DentonGentry commented 2 years ago

Unfortunately so far as I know, it is just you. There have been no reports to support@tailscale.com about this Action recently, and no-one else has reported an issue here.

Are any of your runners still around, present in the admin panel and not deleted yet? I can look up their telemetry. Since they get an IP address, they must have managed to contact the coordination server at least briefly.

paulpet commented 2 years ago

github-fv-az241-70 is still connected.

Let me modify an action to keep a runner connected for the next 20m or so.

DentonGentry commented 2 years ago

I might try recreating the authkey - although the original had a no-expiry set.

Ah:

We're expecting to provide a way to renew API keys: https://tailscale.com/kb/1101/api/ and then one can use an API key to create authkeys as needed.

paulpet commented 2 years ago

I noticed that earlier too, but our key wasnt due to expire until Jul 6th. I ended up creating a new one to be certain and it made no difference :(.

github-fv-az90-773 github-fv-az316-221 should remain connected for the next 20 or minutes.

Also - wouldn't the behavior be different even if the auth key had expired, I would expect to still be able to ping 1.1.1.1?

justin-pierce commented 2 years ago
  ping-pong:
    runs-on: ubuntu-20.04
    steps:
      - uses: actions/checkout@v2
      - name: Tailscale connection
        uses: tailscale/github-action@main
        with:
          authkey: ${{ secrets.TAILSCALE_AUTHKEY }}
      - name: Ping
        shell: bash
        run: |
           ping -c 5 1.1.1.1

This will fail to even ping 1.1.1.1.

Same failure occurs on Ubuntu 18.04 and Ubuntu 20.04 runner instances. I might try recreating the authkey - although the original had a no-expiry set.

Does ping work before tailscale? I read github runners run on Azure and ping is disabled by design (and that seems to be the case when I try it)

Having an issue myself sshing into local machine through tailscale using github actions (not exactly sure when exactly it broke or if it's related to tailscale, but I regenerated some keys since it last worked)

edit: from docs:

GitHub hosts Linux and Windows runners on Standard_DS2_v2 virtual machines in Microsoft Azure with the GitHub Actions runner application installed. The GitHub-hosted runner application is a fork of the Azure Pipelines Agent. Inbound ICMP packets are blocked for all Azure virtual machines, so ping or traceroute commands might not work.

paulpet commented 2 years ago

You're right it doesn't. I discovered that yesterday after more troubleshooting. My issue purely appears to be DNS related. Tailscale client DNS configuration points at one our servers running dnsmasq. While DNS resolution works for all other clients connected to Tailscale, it suddenly stopped working for the Github actions runners, with no change (that i'm aware of on our side). I wasn't able to replicate the issue on a completely different Tailscale account, so it's likely something I need to figure out how to resolve (or workaround) rather than it being an issue with Tailscale or tailscale action. :-(

DentonGentry commented 2 years ago

Just because it came up recently, another issue which has come up about Magic DNS in container environments is when ip6tables is missing: https://github.com/gitpod-io/gitpod/issues/8049

I didn't catch the previous updates in time, but if you have a github-action runner which recently completed and the ephemeral node hasn't been cleaned up I can look at what it says about errors in setting up DNS.

justin-pierce commented 2 years ago

@paulpet my issue seems to be https://github.com/tailscale/github-action/issues/40 -- disabling manual device auth worked for me as a temporary workaround.

paulpet commented 2 years ago

@paulpet my issue seems to be #40 -- disabling manual device auth worked for me as a temporary workaround.

Fantastic @justin-pierce, this worked for me as well. I've spent a ridiculous amount of hours trying to figure this out.

@DentonGentry looks to be a tailscale backend issue which I hope can be resolved soon, so I am going to close this issue

paulpet commented 2 years ago

@DentonGentry looks to be a tailscale backend issue which I hope can be resolved soon, so I am going to close this issue

I think I spoke too early, I can't seem to replicate the issue on stand-alone VMs, only on github runners utilizing the tailscale action, so will re-open until things are confirmed.

jwhited commented 2 years ago

Thanks for reporting, this was a duplicate of #40, which is now resolved.