Add GitHub Action runners to builder

zimbatm commented 1 year ago

We're using https://github.com/numtide/srvos/blob/master/roles/github-actions-runner.nix in a number of places now, maybe the community could also benefit from having faster CI by pushing the builds to permanent machines?

Eg: https://github.com/nix-community/nix-vscode-extensions/issues/4

zowoq commented 1 year ago

https://docs.github.com/en/actions/hosting-your-own-runners/about-self-hosted-runners#self-hosted-runner-security

We recommend that you only use self-hosted runners with private repositories. This is because forks of your public repository can potentially run dangerous code on your self-hosted runner machine by creating a pull request that executes the code in a workflow.

Honestly don't think running our own is a good idea, they are a way bigger surface than any other CI we could run.

zimbatm commented 1 year ago

I think we can make it reasonably secure with https://github.com/numtide/srvos/issues/50

zowoq commented 1 year ago

Sorry but I guess I have a strong opinion on this, very uncomfortable with us hosting github actions runners vs. probably anything else.

zimbatm commented 1 year ago

Maybe we should talk about our security posture as a whole. Here are some things that I see:

a. Hydra is running as root. I wouldn't be surprised if the Perl code had a bunch of security holes allowing escalation given that nobody is super fluent in it. b. The Hercules and Hydra Nix builds are running arbitrary code. With FOD, it's easy to poke the whole system as a bld user. c. All the repos have access to the Cachix write token. That makes it easy to poison the cache. d. Systemd units can be hardened quite a bit. Further than Nix sandboxed builds.

So we already rely on trust quite a bit. Most of our system is one kernel elevation exploit away from being pwned. (as is the main Hydra by the way).

We could eliminate (c) by forcing users that want to publish to the shared cache to use our hosted GitHub Actions runner.

I don't have a really good answer for (a) and (b). Maybe there are some mitigations that we can put in place?

zowoq commented 1 year ago

Maybe we should talk about our security posture as a whole.

Yeah the current situation probably isn't the best but I see self hosting actions as biting off more than we can chew.

Hydra

IIRC we started with nixpkgs hydra, switched to the upstream hydra flake, switched back to nixpkgs hydra. At the moment we're running it for three projects, hard to say that it's worth the hassle. I'd just drop hydra in favour of buildbot and do what we can to help those projects move over.

We could eliminate (c) by forcing users that want to publish to the shared cache to use our hosted GitHub Actions runner.

Afraid I don't understand this?

zimbatm commented 1 year ago

I see self hosting actions as biting off more than we can chew.

Can you expand a bit on the issue relative to the other options? Is it the maintenance overhead?

zimbatm commented 1 year ago

Afraid I don't understand this?

We can have the node set up with cachix watch-store so only the system has access to the push token. That means we can decomission the org-wide secret and use the self-hosted runners instead.

Mic92 commented 1 year ago

I have to agree with both sides in some points.

Github Actions

I think there are some further hardenings required to the github actions runner service to bring it on the same level as our sandboxed hercules builds (https://github.com/numtide/srvos/issues/50). After that I don't think it is inherently more insecure than our nix builds.

Hydra

We currently do not build hydra pull requests, which would limit the attack surface to people having a repo explicitly. I do not see anything running as root though. This might be have been the case historically. There could be probably more hardening done for individual services though.

Cachix key

However after that removing the cachix key from our github org would give us some security benefits (also this might be a regression for macOS support where do not have any builder).

FOD

The network access for fixed-output derivation is far from ideal. I don't think it could actually be used to compromise other builds but it could be used to send out spam or DDoS other services. I think as a first step we should configure something like squid to at least filter some tcp ports - I don't think we need udp support at all for our fetchers.

zimbatm commented 1 year ago

A sub-point in that security posture is that if a user is given "trusted-user" probably can escalate access because they can run arbitrary post-build-hooks as the nix-daemon user ID (root). Not sure if Hydra is given that access?

zowoq commented 1 year ago

Can you expand a bit on the issue relative to the other options? Is it the maintenance overhead?

Yeah, it seems rather complicated and fragile with a dependency tree unlike anything else we're running. Also unlike hercules and hydra wouldn't we be allowing PR builds from anyone on our infra? If so that doesn't seem like a good idea.

We can have the node set up with cachix watch-store so only the system has access to the push token. That means we can decomission the org-wide secret and use the self-hosted runners instead.

Isn't cache poisoning still an issue with this? (same as the other CI systems I guess except it builds PRs and we'd maintaining it instead of github?)

zowoq commented 1 year ago

Should we look at isolating the CI systems from the hosts nix store and hosting separate caches for projects instead of the everyone using one big cache?

Mic92 commented 1 year ago

Isn't cache poisoning still an issue with this? (same as the other CI systems I guess except it builds PRs and we'd maintaining it instead of github?)

If you do not allow users to control the nix daemon that is used to build packages they can only send derivations and the result is directly uploaded to the binary cache. This is different from how current actions currently work where they could modify packages and then push them to the cache in github actions.

Mic92 commented 1 year ago

Should we look at isolating the CI systems from the hosts nix store and hosting separate caches for projects instead of the everyone using one big cache?

Isolating the host store from the CI store might be worth while but also probably requires a bit of trial and error (maybe a containerized nix-daemon?). At least in theory I don't not see how cache poisoning can be done without an exploit. It would be great if we could somehow keep one cache because it makes us more efficient as a community even if we have to put more thoughts into how we can design it in a secure way.

zowoq commented 1 year ago

Isolating the host store from the CI store might be worth while but also probably requires a bit of trial and error (maybe a containerized nix-daemon?)

I've thought about something like this a couple of times, might be interesting to pursue.

somehow keep one cache

https://github.com/zhaofengli/attic might be something we could use when it matures.

So the token should really be removed but I don't think that we should self host github runners.

For cached builds we already have hercules and hydra, we could set up buildkite and watch-store if we want to offer an alternative system that isn't nix specific. I assume that eventually we'll be running buildbot here as well.

We can't really do anything about caching darwin builds without setting up hardware and at the moment there doesn't seem to be darwin equivalents of our current methods of managing deployment, secrets, etc so we'd be managing it manually anyway.

I wouldn't say it's a good solution but I don't really see an issue with projects that want to stay on actions needing to use their own cachix cache.

Perhaps as a compromise we could have another untrusted shared cache just for actions, either cachix or something we host ourselves?

zimbatm commented 1 year ago

Aside from the runner discussion, the architecture that I found works best with Nix is to have a central machine, with remote builders attached to it. Everybody should be on the same network because the nix daemon protocol is sensitive to latency. The remote builders have a watch-store setup to replicate their content to the binary cache. It also protects the remote macOS runners a bit because only the nix-daemon interacts with them.

So my idea was that we put the GitHub Actions runners, Hydra and Hercule agents on that central host and let the remote builds dispatch the jobs in the short term. Remove the shared Cachix auth token from GitHub so we get this quick win.

I think we also want to build a long-term plan but I don't know what it looks like yet exactly. It probably involves hardening Nix itself or wrapping it in sandboxes.

So back to the initial topic, I don't understand why Buildkite is OK, but GitHub Actions isn't. Both are packaged and work fine. GitHub Actions has a bit more security because we're able to hide the join token, and I believe that it also benefits from GitHub's SRE proactive measures to combat abuse. Both are proprietary. Buildkite started restricting their free plan. And GitHub Actions integrates the best with GitHub of course.

This is what I see, but maybe I'm missing some data?

zowoq commented 1 year ago

So my idea was that we put the GitHub Actions runners, Hydra and Hercule agents on that central host and let the remote builds dispatch the jobs in the short term. Remove the shared Cachix auth token from GitHub so we get this quick win.

The token needs to be removed but that does not mean that we have to have a self hosted actions runner.

It is a ongoing maintenance and security burden that I don't think is really comparable to anything else we're running, a moving target, I don't think it's much of a stretch to imagine we end up having to maintain it in nixpkgs if it gets neglected.

I think we also want to build a long-term plan but I don't know what it looks like yet exactly.

Once we start with self hosted actions I think we're basically making a long term commitment to the org.

I don't understand why Buildkite is OK, but GitHub Actions isn't. Both are packaged and work fine.

Buildkite module, packaging, etc is very simple compared to actions runner.

Buildkite started restricting their free plan

Ah, I wasn't aware of this, I wouldn't have suggested it.

Hercule agents on that central host and let the remote builds dispatch the jobs

Does hercules support this?

zimbatm commented 1 year ago

Sorry but I guess I have a strong opinion on this, very uncomfortable with us hosting github actions runners vs. probably anything else.

I'm sorry, but this is not how we generate consensus or generally should be operating in this project. Everybody has opinions. The way to generate a shared understanding is to explain your position. Instead, I had to work hard here to get the information out, and it's still not super clear what the issue is.

At first, I thought it was a security concern, or an open source reason, but apparently, it's related to the packaging. I agree that the package is more complicated than, for example, BuildKite. But then buildbot is fine, which has tens of thousands more LOC. I don't know how you put those things in relationship with each other.

We don't have to use GitHub Actions. I'm suggesting it because we have a working module that has been secured reasonably well and will be maintained independently. It's also fine if we disagree. What I don't want is to be beholden to opinions and feelings if they are not accompanied by a rational discussion.

zowoq commented 1 year ago

I'm sorry, but this is not how we generate consensus or generally should be operating in this project. Everybody has opinions. The way to generate a shared understanding is to explain your position. Instead, I had to work hard here to get the information out, and it's still not super clear what the issue is.

At first, I thought it was a security concern, or an open source reason, but apparently, it's related to the packaging.

What I don't want is to be beholden to opinions and feelings if they are not accompanied by a rational discussion.

I agree that my comment you're quoting wasn't helpful but this seems a bit out of proportion? I can have multiple concerns and I had mentioned it previously:

Can you expand a bit on the issue relative to the other options? Is it the maintenance overhead?

Yeah, it seems rather complicated and fragile with a dependency tree unlike anything else we're running.

Also from the same comment as above this security concern doesn't seem to have been addressed yet:

Also unlike hercules and hydra wouldn't we be allowing PR builds from anyone on our infra?

I agree that the package is more complicated than, for example, BuildKite. But then buildbot is fine, which has tens of thousands more LOC. I don't know how you put those things in relationship with each other.

I'm suggesting it because we have a working module that has been secured reasonably well and will be maintained independently.

I didn't mention LOC? The packaging and module for buildbot (and hercules/hydra as well really) is also fairly simple compared to github actions.

The other CI systems don't have the 30 day update window that github has for the runner which what I meant by "moving target" (and which doesn't seems to have be mentioned previously) and I don't think github cares about how complicated it for us to maintain.

I think we can say with a reasonable level certainty that buildbot, hercules and hydra will still be functional on nixos 3/6/12 months from now, I don't see that we can say the same about the github runner.

zimbatm commented 1 year ago

Alight, I think I made my point.

Regarding security, the attack surface is smaller than FOD builds. GHA is running in a systemd unit that is more sandboxed than a FOD build.

The GHA runners are regularly updated in nixpkgs, with five maintainers in total. They are also getting exercised quite a bit as several customers use them. I understand the instinctive reaction of thinking this is bad, but if I look at the facts, it's really not that bad.

With something like Buildbot and Hydra we have an additional DB to manage. There is also a lot more code total to run and that can go wrong. There are new attack surfaces on the UI frontend bits. I think Buildkite was already put aside. So that leaves us with Hercules CI, Garnix and GHA.

zowoq commented 1 year ago

Alight, I think I made my point.

I don't understand?

Regarding security, the attack surface is smaller than FOD builds. GHA is running in a systemd unit that is more sandboxed than a FOD build.

So we're okay with running unreviewed PRs on our own hardware?

The GHA runners are regularly updated in nixpkgs, with five maintainers in total.

Doesn't mean that it'll always be updated in the 30 day window or that it'll still even work on NixOS in the future?

They are also getting exercised quite a bit as several customers use them.

What are these customers expectations regarding GHA?

I understand the instinctive reaction of thinking this is bad, but if I look at the facts, it's really not that bad.

I don't understand this either?

So that leaves us with Hercules CI, Garnix and GHA.

This seems to be the first time Garnix has been mentioned?

I'm not familiar with it beyond seeing it on a couple of my PRs against Numtide repos, looks okay I guess? Still mentions that it is "beta" but not sure what that means?

I'll try explaining my view another way:

If we're going to start pushing the org onto self hosted GHA (so they can still have cached builds and we can remove the shared token), I want to be able to say to the org that it will actually be a reliable service.

I don't see that I can say that, as we've no guarantee that we get the updates done inside the 30 day window or that we can even get the updates to build and run at all with NixOS/Nixpkgs.

We're basically just hoping that github doesn't screw us?

I am looking at if we can use the upstream binary via some workaround, fhsenv, container, etc which may mean this becomes less of an issue but so far all I've done is skim through some stuff.

zimbatm commented 1 year ago

I don't understand where the fear and uncertainty are coming from. We depend on GitHub not to screw us on so many dimensions, and the platform has been stable for us isn't it? We have multiple maintainers of the package that can react to the 30 days window; it's used in production. FODs are already running arbitrary code on our machines.

That being said, I will stop pushing for this. One thing I 100% agree on is that the infra team should not take on more infrastructure than they can manage.

nix-community / infra