nix-community / infra

nix-community infrastructure [maintainer=@zowoq]
https://nix-community.org
MIT License
104 stars 62 forks source link

Policy for third-party hardware donation #1343

Open zimbatm opened 3 days ago

zimbatm commented 3 days ago

Sometimes, it's easier for organizations or individuals to lend out hardware (rather than open collective). There is an opportunity to gain access to build capacity. And different kinds of hardware (e.g., GPU, Riscv5, MIPS, ...)

Before pursuing this, let's discuss what that would look like.

What are the requirements on our side?

Some threads:

zowoq commented 3 days ago

Anything exotic will be a problem for the hercules agent as it is haskell. Just as a remote builder for buildbot/hydra sould mean the cache key isn't an issue?

Mic92 commented 3 days ago

We would still need to trust the build results.

zowoq commented 2 days ago

We would still need to trust the build results.

I don't understand your point? Isn't trusting the build results a given?

Mic92 commented 2 days ago

We would still need to trust the build results.

I don't understand your point? Isn't trusting the build results a given?

I think we should communicate how builders for different architectures are secured. i.e. Hetzner will have a safer access policies than the machines in someones basement. Then users can decide if they are OK with this.

zimbatm commented 2 days ago

Remote builders sounds good.

One requirement could be that we are the only admins on the machine. It doesn't prevent physical tampering but reduces the attack surface if the host provider gets hacked.

ConnorBaker commented 2 days ago

I’m not familiar with how the infrastructure you’ve set up for builds vs caching works, but one of the concerns I’ve had when trying to stand up infrastructure for consistently building CUDA packages, and for serving a binary cache, but there can be a lot of traffic between nodes. It was enough of a bottleneck between the three desktops in my basement that I moved over to 10GBe networking for everything, and I’m still saturating it.

I don’t know if the remote build protocol takes into account closure size or data locality when deciding which machines should build different things, but there can be a lot of movement on the network, which can be a bottleneck or, in the case of cloud providers, a hefty egress fee.

So a couple of questions from me:

  1. Do the machines you’ve set up for Hydra also serve as a cache? If they are, have you run into issues with the amount of egress you’re doing?
  2. Is your Cachix the main binary cache for projects under this umbrella? If so, how large is it (if you’re willing or able to disclose)?
  3. Are the machines working as hydra builders all on the same network (or Hetzner region)?
  4. What are your thoughts on ephemeral Azure builders? I’ve found the HBv3 spot instances in US East extremely competitive price-wise, and have been looking into tooling to automate scaling them up and down.
  5. File-level deduplication in the cache would be fairly important, as CUDA enablement involves realizing a new copy of a derivation rather than building on an existing one —- most projects don’t separate the code generation for CPUs from GPUs well enough for us to be able to re-use existing portions of builds. Is there anything like that set up currently?

I’d love to learn more about any of the challenges you all have faced setting up and maintaining this infrastructure!

zowoq commented 2 days ago

remote build protocol takes into account closure size or data locality when deciding which machines should build different things

No, it doesn't.

Do the machines you’ve set up for Hydra also serve as a cache?

No.

Is your Cachix the main binary cache for projects under this umbrella? If so, how large is it (if you’re willing or able to disclose)?

Yes. 1tb, sponsored by Cachix.

Are the machines working as hydra builders all on the same network (or Hetzner region)?

Yes, our linux machines are all in HEL1. Though that isn't exactly intentional, we usually choose based on price. We also have two macos builders in FSN1.

What are your thoughts on ephemeral Azure builders? I’ve found the HBv3 spot instances in US East extremely competitive price-wise, and have been looking into tooling to automate scaling them up and down.

Haven't used them. IIRC there way some discussion about Azure stuff in the nixos org, maybe with the infra team or the foundation?

File-level deduplication in the cache would be fairly important, as CUDA enablement involves realizing a new copy of a derivation rather than building on an existing one —- most projects don’t separate the code generation for CPUs from GPUs well enough for us to be able to re-use existing portions of builds. Is there anything like that set up currently?

No.

I’d love to learn more about any of the challenges you all have faced setting up and maintaining this infrastructure!

I don't think we've had any real technical challenges so far, basically just limited by funding. Once we started the opencollective we just expanded as the funding increased. Building cuda, rocm, etc is probably going to be the first time we've really needed to give thought to some of these topics.

zimbatm commented 16 hours ago

Yes, our linux machines are all in HEL1.

See also https://docs.hetzner.com/robot/general/traffic/ Each server comes with 10TB of egress.

Mic92 commented 14 hours ago

Yes, our linux machines are all in HEL1.

See also https://docs.hetzner.com/robot/general/traffic/ Each server comes with 10TB of egress.

No, for 1Gbit-Links and physical machines, it's unmetered. The traffic limit only applies to VMs.