moby / buildkit

concurrent, cache-efficient, and Dockerfile-agnostic builder toolkit
https://github.com/moby/moby/issues/34227
Apache License 2.0
8.08k stars 1.14k forks source link

Discussion: BuildKit network dependencies #4099

Open tonistiigi opened 1 year ago

tonistiigi commented 1 year ago

ref https://github.com/moby/buildkit/issues/1337 ref https://github.com/moby/buildkit/pull/3960

message ExecOp {
  Meta meta = 1;
  repeated Mount mounts = 2;
  NetMode network = 3;
  SecurityMode security = 4;
  repeated SecretEnv secretenv = 5;

  repeated NetworkInterface networks = 6;

  // note that Inputs array for this execOp needs to also define all
  // the inputs used by any mount of the ServicePeer that ExecOp has access
}

message NetworkInterface {
  string name = 1; // unique per execop
  string IP = 2; // eg. 10.0.0.1/24
  int32 mtu = 3; // probably could be just fixed
  repeated Peer peers = 4;
}

message Peer {
  string IP = 1;
  oneof {
    SessionPeer = 2;
    ServicePeer = 3;
    Link = 4;
  }
}

message SessionPeer {
  string sessionID = 1;
  string peerID = 2; // client needs to configure handler for such ID
}

message ServicePeer {
  string id = 1; // Same string with same Job group means same process (assuming that they also have equal inputs)
  NewContainerRequest container = 2;
  InitMessage init = 3;

  // stopsignal?
}

// Link can be used to link two containers already defined under execop.
// The definitions or peers have already been defined so only need to link to
// existing ServicePeer ID. If no such ID is found then it is runtime error.
message Link {
  string ID = 1;
}

message NewContainerRequest {
  string ContainerID = 1;
  // NEW: these mounts can use input indexes from the ExecOp inputs array
  repeated pb.Mount Mounts = 2;
  pb.NetMode Network = 3;
  pb.Platform platform = 4;
  pb.WorkerConstraints constraints = 5;
  repeated pb.HostIP extraHosts = 6;
  string hostname = 7;

  repeated NetworkInterface networks = 8;
}

Network dependencies can be created between:

All communication via WireGuard packets. No support or dependency for any custom CNI backend or special setup.

Implementation is based on go-wireguard/netstack https://pkg.go.dev/golang.zx2c4.com/wireguard@v0.0.0-20230704135630-469159ecf7d1/tun/netstack library. This is imported in client and buildkitd. buildkitd side can (optional) natively create WireGuard interface instead, but that would require that 2 BuildKit containers have discoverable endpoint between them (eg. simplest is to use CNI bridge for that).

// example connection logic

tunDev, gNet, err := netstack.CreateNetTUN([]netip.Addr{myIP}, []netip.Addr{}, mtu)
if err != nil {
    return errors.Wrap(err, "failed to create tun device")
}

wgDev := device.NewDevice(tunDev, conn.NewDefaultBind(), device.NewLogger(device.LogLevelVerbose, "wireguard: "))

for _, peer := range me.Peers {
  // keys are auto-generated by library. public keys shared via session API when needed
    wgConf := bytes.NewBuffer(nil)
    fmt.Fprintf(wgConf, "private_key=%s\n", hex.EncodeToString(me.PrivateKey[:]))
    fmt.Fprintf(wgConf, "public_key=%s\n", hex.EncodeToString(peer.PublicKey[:]))
    if peer.Endpoint != nil {
        fmt.Fprintf(wgConf, "endpoint=%s\n", peer.Endpoint) // Endpoint is dummy unique value for custom bind
    }

    ips := make([]string, len(peer.AllowedIPs))
    for i, ip := range peer.AllowedIPs {
        ips[i] = ip.String()
    }

    fmt.Fprintf(wgConf, "allowed_ip=%s\n", strings.Join(ips, ","))
    fmt.Fprintf(wgConf, "persistent_keepalive_interval=%d\n", 10)

    if err := wgDev.IpcSetOperation(bufio.NewReader(wgConf)); err != nil {
        return errors.Wrap(err, "failed to set wg device config")
    }
}
if err := wgDev.Up(); err != nil {
    return errors.Wrap(err, "failed to bring wg device up")
}

There are no open ports or extra communication channels for network traffic. Everything is on the gRPC connection from session endpoint. Daemon side communication is either between buildkitd(running netstack) and tuntap device, or between two wireguard interfaces directly.

For that, instead of conn.NewDefaultBind() call above a custom implementation of conn.Bind is needed. https://pkg.go.dev/golang.zx2c4.com/wireguard@v0.0.0-20230704135630-469159ecf7d1/conn#Bind

type Bind interface {
    // Open puts the Bind into a listening state on a given port and reports the actual
    // port that it bound to. Passing zero results in a random selection.
    // fns is the set of functions that will be called to receive packets.
    Open(port uint16) (fns []ReceiveFunc, actualPort uint16, err error)

    // Close closes the Bind listener.
    // All fns returned by Open must return net.ErrClosed after a call to Close.
    Close() error

    // SetMark sets the mark for each packet sent through this Bind.
    // This mark is passed to the kernel as the socket option SO_MARK.
    SetMark(mark uint32) error

    // Send writes one or more packets in bufs to address ep. The length of
    // bufs must not exceed BatchSize().
    Send(bufs [][]byte, ep Endpoint) error

    // ParseEndpoint creates a new endpoint from a string.
    ParseEndpoint(s string) (Endpoint, error)

    // BatchSize is the number of buffers expected to be passed to
    // the ReceiveFuncs, and the maximum expected to be passed to SendBatch.
    BatchSize() int
}

Eg. there is a new API registered in the Session endpoint that defines streaming endpoint that sends init/data/close etc packets that implement this interface.

netstack.CreateNetTun() https://pkg.go.dev/golang.zx2c4.com/wireguard@v0.0.0-20230704135630-469159ecf7d1/tun/netstack#CreateNetTUN

@vito @pchico83

sipsma commented 1 year ago

I like this idea a lot; a nice side-effect is that it seems like it would probably work if buildkitd was not root and spinning up rootless container via oci worker, thanks to running netstack in userspace?

A mild implementation-detail concern is how much a buildkit client (i.e. buildx, the dagger cli, etc.) is going to get bloated by importing+running wireguard's netstack, both in terms of binary size and performance overhead.

If the client is using the docker-container connhelper and buildkitd->client connections are via session attachable (w/ grpc-in-grpc), I think at that point we'd be tunneling wireguard-over-grpc-over-grpc-over-stdio-pipes. Entirely possible that works great and doesn't matter, just seems like something to look out for.

tonistiigi commented 1 year ago

spinning up rootless container via oci worker, thanks to running netstack in userspace?

Possibly yes. I guess that case creating a native wg interface would be too privileged.

A mild implementation-detail concern is how much a buildkit client (i.e. buildx, the dagger cli, etc.) is going to get bloated by importing+running wireguard's netstack, both in terms of binary size and performance overhead.

I haven't measured how big these imports are. On buildkitd side, it is needed. On client side, it would probably be a good idea to use a different import that enables the sessionpeer handler support

jedevc commented 1 year ago

If the client is using the docker-container connhelper and buildkitd->client connections are via session attachable (w/ grpc-in-grpc), I think at that point we'd be tunneling wireguard-over-grpc-over-grpc-over-stdio-pipes. Entirely possible that works great and doesn't matter, just seems like something to look out for.

Potentially, for the moby and dagger engine case, we could have a new ClientOpt WithNetworkDialer (bad name), similar to how today we have a WithSessionDialer. The client could open a direct connection into a reachable endpoint, and then perform wireguard connections through there, without an intermediate GRPC tunneling layer.