nix-community / infra

nix-community infrastructure [maintainer=@zowoq]
https://nix-community.org
MIT License
114 stars 75 forks source link

Hydra & Hercules-CI resources for RISC-V projects #715

Open RaitoBezarius opened 1 year ago

RaitoBezarius commented 1 year ago

Hello there,

With @0x4A6F, discussed also with others on the Exotic Nix channel (https://matrix.to/#/#exotic:nixos.org), we (channel: https://matrix.to/#/#nixos-on-risc-v:matrix.org) want to organize and coordinate efforts towards RISC-V support in nixpkgs.

For this, we would like to organize the build capacity that each Nix community member can possess themselves, we would like to make it available as a set of remote builders for the Nix-community project in two forms for now:

Usually, with other architectures, it is possible to buy a big machine and have the nix-community system administrators. In this situation, as RISC-V is still early and SG2042s CPUs are not necessarily cost-efficient yet, we would like to start right away with a different model where users can bring their own build capacity forming a bigger build capacity from all over the world.

This bring a new set of headaches regarding trust and security, given that proprietary owners of their build capacity can probably stay root and be anyone, so here's my open questions:

(1) What should be the threat model for end users of that build capacity? What should be communicated? (2) Should a nix-community accepted build capacity be only configurable by nix-community sysadmins, therefore, stripping their proprietary owners from access until they factory reset the physical machine? (3) How to convey the expectations of cheap remote builder capacity that can go down at any time? (4) Should those remote builders be subject to a vetting process similar to being accepted to a community builder? (5) Should those remote builders be subject to a strict NixOS expression? Given that RISC-V is a early platform, it would require sysadmins to react in a timely fashion to follow one of the big developments like https://github.com/zhaofengli/nixos-riscv64 or https://github.com/misuzu/nixos-vf2 — potentially sending them together into nixos-hardware or something like that.

Personally, I would find it interesting to work with anyone without complicated trust vetting process and offer this as a capacity because then we can focus on the hard problems, nix-community seems an interesting hub for such things but maybe this should be advertised as an external initiative as the management issues are difficult to overcome?

Also, I note that if we had a model where owners cannot access their own machines, it's unclear how infra team can repair those machines which usually has no serial access somewhat, only their owners can intervene.

zowoq commented 1 year ago

At this stage I don't think integrating it into this repo and our existing CI is a good idea, I think it would be easier as an external initiative so the interested parties can organise themselves.

Could still organise under the nix-community org but that may be problematic for hercules, AFAIK we can't have separate pools of agents under the same github org.

RaitoBezarius commented 1 year ago

Alright, we will do that then :-).

zimbatm commented 1 year ago

It would be cool to offer more architecture for Hercules and Hydra if we can. The main question is how to organize ourselves so that the coordination overhead stays minimal.

One way would be if the hardware was "donated" to the nix-community org. Meaning that we would fully manage the configuration on the machine. The machine should also be running at a reliable location because the nix remote builder protocol doesn't deal with downtime very well. Ideally, we would have a point of contact to handle manual interventions.

zowoq commented 1 year ago

Meaning that we would fully manage the configuration on the machine.

If this was to happen who's volunteering to be responsible for keeping these machines patched?

zimbatm commented 1 year ago

If we can get to a point where there is a nixos-unstable-riscv channel that is uptodate, it wouldn't be more work for us. But I don't know how far along the RISCV port is. It's a bit of a chicken and egg problem.

zowoq commented 1 year ago

If we can get to a point where there is a nixos-unstable-riscv channel that is uptodate, it wouldn't be more work for us. But I don't know how far along the RISCV port is. It's a bit of a chicken and egg problem.

We can't leave outdated machines running indefinitely, if the "community" doesn't keep up with maintenance on the riscv platform it'll end being the nix-community admins who have to deal with it themselves.

To move forward with this I'd want at least one admin to take personal responsibility for it, i.e. they would deal with organising this platform and handling ongoing maintenance so the other admins don't need to.

zimbatm commented 1 year ago

Sounds good, I am OK taking up that role. I agree that we want nixpkgs to be in a relatively stable state before adding the builders to the infrastructure.

zowoq commented 1 year ago

Just to be really clear about my position:

I'm not willing to accept any responsibility for fixing non-trivial issues with riscv.

I expect the admin who takes responsibility for this will be responsive and fairly quick to deal with any issues we have, either fixing it themselves or getting someone from the "community" to fix it.

I think our other machines and services should always take priority over riscv.

Mic92 commented 1 year ago

I am maintaining a riscv64 machine in our university cluster. I do cross-compiling from x86_64 linux and my last patches have been up-streamed during the 22.09 release and had no regressions since than. It works surprisingly well given how niche the architecture seems. A lot of companies and enthusiast seem to fix stuff everywhere.

Mic92 commented 1 year ago

Just to be really clear about my position:

I'm not willing to accept any responsibility for fixing non-trivial issues with riscv.

I expect the admin who takes responsibility for this will be responsive and fairly quick to deal with any issues we have, either fixing it themselves or getting someone from the "community" to fix it.

I think our other machines and services should always take priority over riscv.

If you don't have the capacity to maintain this, we should still allow someone else to step up and allow them to provide a hercules-ci machine for nix-community. There could potentially even separate nixpkgs channel in case there is blockage.

zowoq commented 1 year ago

If you don't have the capacity to maintain this, we should still allow someone else to step up and allow them to provide a hercules-ci machine for nix-community.

I think you've misunderstood my point.

In the last two months we've expanded a fair bit, we've gone from four to eight machines, two of those are a different platform (nix-darwin/aarch64-darwin), we've added a public facing service (lemmy) and we also proposing adding another aarch64-linux machine.

I doubt that we (the nix-community admins) have the capacity to maintain riscv. I'm willing to give it a go provided someone is willing to take responsibility for it. To put it simply: I'm not going to be left holding the bag.