How to avoid the auto-generated numbered suffix in the node name (i.e. `mynode-1`)

thenbe commented 3 weeks ago

Hi. I'm using this plugin in CI pipeline, where it creates some nodes in the tailnet for per-PR preview environments. Sometimes, while iterating on the project, the underlying VPS is destroyed and immediately recreated. This results in undesirable behavior where node names are automatically suffixed with an integer:

Tailnet (before):
  - myvps       (online)
  - mycaddynode (online)

Tailnet (after):
  - myvps         (offline)
  - myvps-1       (online)
  - mycaddynode   (offline)
  - mycaddynode-1 (online)

This breaks the project's naming scheme. To fix it, I log on to the tailscale admin UI, remove mycaddynode from the tailnet, then rename mycaddynode-1 to mycaddynode. The fix is simple, but can get tedious at times as it first requires identifying that this event has occured in the first place.

Therefore, I'd like to automate the fix. After searching around for a bit, I found that tailscaled's --state=mem: flag essentially works around this auto-suffix naming issue.

After trying it, the --state=mem: does indeed fix the issue, but only for "regular" tailscale nodes and not for nodes that are generated by this caddy-tailscale plugin:

Tailnet (before):
  - myvps       (online)
  - mycaddynode (online)

Tailnet (after):
  - myvps (online)
  - mycaddynode   (offline)
  - mycaddynode-1 (online)

Is there a way to get the same behavior for the "caddy nodes" too?

Workarounds

For now, I think my best bet would be to write a script that leverages that tailscale's Rest API to remove the stale offline nodes and rename the replacement node names (POST /device/{deviceId}/name) (i.e. mycaddynode-1 -> mycaddynode).

willnorris commented 3 weeks ago

We have a state_dir config option, but it may not work with a mem: value... I've never tried. But I think that would be a reasonable feature to add.

That said, I'm curious about what why the specific device name matters. If it's related to tailnet ACLs, have you looked at tagging the nodes, and using those tags for ACLs? Or is there some other reason the specific device name matters, particularly in a CI environment?

thenbe commented 3 weeks ago

We have a state_dir config option, but it may not work with a mem: value

I'll give it a shot and report back with the results.

why the specific device name matters

The tailscale machine names are significant because of the declarative approach (nix) I'm following in my project. One prerequisite for this approach is that we need to know, at build time (typically in CI), what the final tailscale node machine names will be.

Example

To illustrate with an example, say the project exposes 3 services that need to be consumed by end-users (other tailnet members).

{
  tailscale {
    ephemeral
  }
}

https://{$SITE_TS_NODE}.tail123.ts.net {
  bind tailscale/{$SITE_TS_NODE}
  response "hello from site"
}

https://{$GRAFANA_TS_NODE}.tail123.ts.net {
  bind tailscale/{$GRAFANA_TS_NODE}
  response "hello from grafana"
}

https://{$LOKI_TS_NODE}.tail123.ts.net {
  bind tailscale/{$LOKI_TS_NODE}
  response "hello from loki"
}

For each PR, a separate instance is deployed. To namespace the tailscale nodes, every *_TS_NODE environment variable is suffixed with the git branch name.

So if a PR is opened for a git branch name called fix-some-bug, the placeholders in the Caddyfile will be populated with these values:

$LOKI_TS_NODE: "loki-fix-some-bug"
$SITE_TS_NODE: "site-fix-some-bug"
$GRAFANA_TS_NODE: "grafana-fix-some-bug"

Now that we can rely on a determinate naming scheme, we can use nix to define our development environments and infrastructure. This then unlocks some pretty cool stuff, including the ability to:

Enter into shells that automatically point towards the correct target environment, where the environment variables are appended with the correct suffix (e.g. -fix-some-bug).
Define our vps machine(s) configuration with all the dependencies they require (such as Grafana or Loki) then breathe life into them with nixos-anywhere.

Issue

Say we wanted our site to display a clickable link to our grafana service. The source code would look like this (using svelte template or similar):

<a href="https://{$GRAFANA_TS_NODE}.{$TAILNET_NAME}">
  Go to Grafana
</a>

And the compiled html would look like this:

<a href="https://grafana-fix-some-bug.tail123.ts.net">
  Go to Grafana
</a>

When our tailscale node name has -1 appended to it (as described in the OP), this anchor will not navigate to our grafana instance until we manually change the grafana tailscale node's machine name from grafana-fix-some-bug-1 to grafana-fix-some-bug.

In a similar fashion, the VPS machine we deploy may include a system environment variable LOKI_TS_NODE=loki-fix-some-bug, where it may be consumed by other services. Those services will break if a -1 suffix is added to the machine name later on by tailscale.

nixosConfigurations = {
  vps = nixpkgs.lib.nixosSystem {
    modules = [
      {

        environment.variables = {
          SITE_TS_NODE = "site-${GIT_BRANCH}";
          GRAFANA_TS_NODE = "grafana-${GIT_BRANCH}";
          # ...
        };

        # This grafana systemd service will not be able to reach loki if
        # the loki machine name ends up being `loki-fix-some-bug-1`
        systemd.user.services.grafana = {
          wantedBy = [ "multi-user.target" ];
          serviceConfig = {
            Restart = "on-failure";
            ExecStart = "start-grafana.sh";
          };
          environment = {
            LOKI_TS_NODE = "loki-${GIT_BRANCH}";
          };
        };

      }
    ];
  };
};

thenbe commented 3 weeks ago

We have a state_dir config option, but it may not work with a mem: value

I'll give it a shot and report back with the results.

It indeed does not work. When setting state_dir to mem:, it tries (and fails) to create a directory called mem:.

# Caddyfile
{
    tailscale {
        ephemeral
        state_dir mem:
    }
}

caddy[1354]: {"level":"debug","msg":"getting listener from plugin","network":"tailscale"}
caddy[1354]: {"level":"info","logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance"}
caddy[1354]: Error: loading initial config: loading new config: http app module: start: listening on tailscale/loki-ts-state-mem:443: mkdir mem:: permission denied
systemd[1]: caddy.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: caddy.service: Failed with result 'exit-code'.

rhardih commented 1 week ago

I have the exact same issue, although for a slightly different use-case. I'm self-hosting a number of services on a NAS device and whenever I re-deploy new configuration to caddy-tailscale, e.g. a new service needs to be reverse proxied, all the Tailnet DNS names changes and I have to go and rename/remove the ones that no longer works.

Being able to re-up caddy-tailscale, and have it attach to existing machines for known entries, and only create new ones for first time entries, would definitely solve this problem.

rhardih commented 6 days ago

I think I managed to solve my issue, by simply persisting the the state. I added this directive in my Caddyfile:

  tailscale {
    webui false
    state_dir "/caddy-state"
  }

And then mounted a persistent docker volume to /caddy-state.

Initially I just ran a script to remove the machines via API calls during deploy, but that seems no longer needed.

tailscale / caddy-tailscale