feat(docker): Support buildkitd moby worker in dockerd 22.06-beta

gaocegege commented 2 years ago

Description

[+] Building 103.2s (11/11) FINISHED                                                                                                                                                                       
 => docker-image://docker.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04                                                                                                                              2.8s
 => => resolve docker.io/nvidia/cuda:11.2.0-cudnn8-devel-ubuntu20.04                                                                                                                                  2.6s
 => local://context                                                                                                                                                                                   3.1s
 => => transferring context: 50.52kB                                                                                                                                                                  0.1s
 => CACHED sh -c apt-get update && apt-get install -y --no-install-recommends python3 python3-pip                                                                                                     0.0s
 => CACHED apt install gcc                                                                                                                                                                            0.0s
 => CACHED pip install -i https://mirror.sjtu.edu.cn/pypi/web/simple jupyter ormb                                                                                                                     0.0s
 => CACHED mkdir /var/midi/remote                                                                                                                                                                     0.0s
 => CACHED mkdir /var/midi/bin                                                                                                                                                                        0.0s
 => CACHED copy /examples/ssh_keypairs/public.pub /var/midi/remote/authorized_keys                                                                                                                    0.0s
 => CACHED copy /bin/midi-ssh /var/midi/bin/midi-ssh                                                                                                                                                  0.0s
 => CACHED merge (apt install gcc, pip install -i https://mirror.sjtu.edu.cn/pypi/web/simple jupyter ormb, copy /bin/midi-ssh /var/midi/bin/midi-ssh)                                                 0.0s
 => exporting to oci image format                                                                                                                                                                    96.6s
 => => exporting layers                                                                                                                                                                               0.0s
 => => exporting manifest sha256:32bbf82b70e17ca70b11b88310ef02450db2bed3b94765415650a2227baa63cf                                                                                                     3.1s
 => => exporting config sha256:9bcaf4d291970033f2a6316dbf11912e77de402c81f0b11896f16c8bab19360b                                                                                                       1.2s
 => => sending tarball                                                                                                                                                                               91.6s

The image is built in buildkit, and it does not exist in the docker host. Thus we need to pipe the buildkit build image into the docker host. It takes about 100s for a 20G base image docker load. It is too slow. We need to optimize it.

gaocegege commented 2 years ago

We merged multiple base layers into the image, thus the size is large. diff should be used to reduce the size. Ref https://github.com/tensorchord/MIDI/pull/54/commits/e64786b91c821b250b06d228e8cbc5be0bcad59d

gaocegege commented 2 years ago

The base image nvidia:cuda:11.2-devel-ubuntu2004 is 4GB, but our base image is 7GB. We should figure out where the 3GB comes from

knight42 commented 2 years ago

Perhaps we could make MIDI work in a way similar to docker-buildx, which uses the BuildKit library bundled into the Docker daemon with docker driver, so that the image is actually built by dockerd and we don't need to load the image manually.

gaocegege commented 2 years ago

Cool! I think it is a great idea. Let's investigate how docker buildx plugin does.

gaocegege commented 2 years ago

buildx does not work in the local docker daemon too. We need to specify --load to load the artifact into docker.

But there is an optimization we could use:

$ docker volume inspect
[
    {
        "CreatedAt": "2022-05-05T17:30:23+08:00",
        "Driver": "local",
        "Labels": null,
        "Mountpoint": "/var/lib/docker/volumes/buildx_buildkit_amazing_albattani0_state/_data",
        "Name": "buildx_buildkit_amazing_albattani0_state",
        "Options": null,
        "Scope": "local"
    }
]

$ docker inspect container
        {
          "Type": "volume",
          "Source": "buildx_buildkit_amazing_albattani0_state",
          "Target": "/var/lib/buildkit"
        }

We could create a volume to keep the cache persistent.

gaocegege commented 2 years ago

Ref https://github.com/docker/buildx/blob/3adca1c17db9439fb94499ccef62f6a40dcaf899/build/build.go#L1354:#L1375

Buildx also relies on docker load.

    w = &waitingWriter{
        PipeWriter: pw,
        f: func() {
            resp, err := c.ImageLoad(ctx, pr, false)
            defer close(done)
            if err != nil {
                pr.CloseWithError(err)
                w.mu.Lock()
                w.err = err
                w.mu.Unlock()
                return
            }
            prog := progress.WithPrefix(status, "", false)
            progress.FromReader(prog, "importing to docker", resp.Body)
        },
        done:   done,
        cancel: cancel,
    }
    return w, func() {
        pr.Close()
    }, nil
}

gaocegege commented 2 years ago

Now we use diff and merge to reduce the size. But docker load is still slow

7GB image load takes ~17s

knight42 commented 2 years ago

7GB image load takes ~17s

Wow! It is really awsome! 💯

gaocegege commented 2 years ago

Yeah, It comes from containerd diff, I think.

Maybe we can have a look at how docker build does when loading the image into its local image store.

VoVAllen commented 2 years ago

I found the sending tarball still take ~30s on my machine. Is this expected?

gaocegege commented 2 years ago

Yeah, it is expected in the current design. send tarball is the docker load process

knight42 commented 2 years ago

I dived a little deeper into buildx and I am sure that docker load is not necessarily required by buildx. Excerpt from the official documentaion: https://docs.docker.com/engine/reference/commandline/buildx_create/#driver

docker driver Uses the builder that is built into the docker daemon. With this driver, the --load flag is implied by default on buildx build. However, building multi-platform images or exporting cache is not currently supported.

A PoC is available as well: https://gist.github.com/knight42/6c128a2edf7cebcb6816343da833295a. The built image is present in docker images without docker load.

After learning about that, I have been trying to get rid of docker load in envd, but it is unfortunate that the version of bundled buildkitd in docker engine 20.10.14 is v0.8.3-4-gbc07b2b8, while mergeop is introduced in v0.10.3.

That said, even though the bundled builkitd in docker might be new enough to support mergeop in the future, I think we still need some fallback mechanism, like using docker-container driver as what we did now.

gaocegege commented 2 years ago

@knight42 Thanks for the research!

but it is unfortunate that the version of bundled buildkitd in docker engine 20.10.14 is v0.8.3-4-gbc07b2b8, while mergeop is introduced in v0.10.3

I am wondering why we should use docker 20.10.14, is it the version that supports built-in load?

gaocegege commented 2 years ago

while mergeop is introduced in v0.10.3.

Currently, we use buildkit v0.10.1, and merge op is supported in this version. I am not sure if it only works after v0.10.3 :thinking:

gaocegege commented 2 years ago

Got the problem here.

failed to solve LLB: failed to solve: failed to load LLB: unknown API capability mergeop

The client returns the error that we cannot use merge op if we eliminate docker load.

gaocegege commented 2 years ago

Ref https://github.com/docker/buildx/issues/1132

knight42 commented 2 years ago

I am wondering why we should use docker 20.10.14, is it the version that supports built-in load?

Nope, it is just the version of the dockerd in my laptop..

while mergeop is introduced in v0.10.3.

Currently, we use buildkit v0.10.1, and merge op is supported in this version. I am not sure if it only works after v0.10.3 🤔

Sorry I double-checked the MergeOp PR, the merge op is actually introduced in v0.10.0.

The client returns the error that we cannot use merge op if we eliminate docker load.

Yeah, since we heavily leverage merge op in envd, if we want to get rid of docker load, we need to make sure the bundled buildkitd in dockerd has the support of merge op.

gaocegege commented 2 years ago

20.10.16 still uses v0.8.3-4-gbc07b2b8. I am afraid that we need to wait until the next milestone of docker.

https://github.com/moby/moby/blob/v20.10.16/vendor.conf#L36

gaocegege commented 2 years ago

Things that we need to confirm:

How buildkit stores layers
How buildx loads images into the local image store
How buildkit is initialized in docker
What is changed in earthly customized buildkit? https://github.com/moby/buildkit/compare/master...earthly:earthly-main

gaocegege commented 2 years ago

Using docker buildkitd directly is not possible now. We will figure out if we can mount some dir in the envd_buildkitd container to achieve a similar experience.

gaocegege commented 2 years ago

Here https://github.com/moby/moby/blob/master/builder/builder-next/controller.go#L44:#L220 docker creates a buildkit daemon (control.Controller).

And the most important part is https://github.com/moby/moby/blob/master/builder/builder-next/worker/worker.go#L83 . Docker has a new worker type moby

gaocegege commented 2 years ago

    bk, err := buildkit.New(buildkit.Opt{
        SessionManager:      sm,
        Root:                filepath.Join(config.Root, "buildkit"),
        Dist:                d.DistributionServices(),
        NetworkController:   d.NetworkController(),
        DefaultCgroupParent: cgroupParent,
        RegistryHosts:       d.RegistryHosts(),
        BuilderConfig:       config.Builder,
        Rootless:            d.Rootless(),
        IdentityMapping:     d.IdentityMapping(),
        DNSConfig:           config.DNSConfig,
        ApparmorProfile:     daemon.DefaultApparmorProfile(),
    })

buildkit (builder-next.Controller) uses dockerd's DistributionService, thus the images' blobs and metadata are stored in docker image store directly. Thus there is no need to load images.

gaocegege commented 2 years ago

https://github.com/moby/moby/issues/9935#issuecomment-297586182

The maintainers said it is not possible to run multi docker daemons on one data root /var/lib/docker

gaocegege commented 2 years ago

The Docker daemon was explicitly designed to have exclusive access to /var/lib/docker. Nothing else should touch, poke, or tickle any of the Docker files hidden there.

Why is that? It’s one of the hard learned lessons from the dotCloud days. The dotCloud container engine worked by having multiple processes accessing /var/lib/dotcloud simultaneously. Clever tricks like atomic file replacement (instead of in-place editing), peppering the code with advisory and mandatory locking, and other experiments with safe-ish systems like SQLite and BDB only got us so far; and when we refactored our container engine (which eventually became Docker) one of the big design decisions was to gather all the container operations under a single daemon and be done with all that concurrent access nonsense.

This means that if you share your /var/lib/docker directory between multiple Docker instances, you’re gonna have a bad time. Of course, it might work, especially during early testing. “Look ma, I can docker run ubuntu!” But try to do something more involved (pull the same image from two different instances…) and watch the world burn.

gaocegege commented 2 years ago

But, I still think it is possible to run a (minimal) docker daemon in our envd_buildkitd container. /var/lib/docker/image is only needed in envd_buildkitd.

The main concern above is that multiple daemons on the same data root (/var/lib/docker/) may break the consistency. Let's have a look at the dir architecture of the image part in /var/lib/docker

/var/lib/docker/image/overlay2
├── distribution
│   ├── diffid-by-digest
│   └── v2metadata-by-diffid
├── imagedb
│   ├── content
│   └── metadata
├── layerdb
│   ├── mounts
│   ├── sha256
│   └── tmp
└── repositories.json

distribution is used to communicate with OCI image registry, thus it is not used in envd_buildkitd. imagedb and layerdb are actually a key-value store and the key is the file name (which is a HEX). Then it should not be affected by concurrent daemons.

The last one repositories.json is to store the map from image tag name to image ID:

{
    "ubuntu": {
        "ubuntu:20.04": "sha256:53df61775e8856a464ca52d4cd9eabbf4eb3ceedbde5afecc57e417e7b7155d5",
        "ubuntu@sha256:47f14534bda344d9fe6ffd6effb95eefe579f4be0d508b7445cf77f61a0e5724": "sha256:53df61775e8856a464ca52d4cd9eabbf4eb3ceedbde5afecc57e417e7b7155d5"
    }
}

It may be affected by the concurrent daemons. But we may have some workarounds for it. For example, we can rename the image using docker API, and not tag it in the low level. We avoid manipulating this JSON file directly.

Thus in the buildkit exporter code, we should remove logic like this:

    if e.opt.ReferenceStore != nil {
        targetNames := strings.Split(e.targetName, ",")
        for _, targetName := range targetNames {
            tagDone := oneOffProgress(ctx, "naming to "+targetName)
            tref, err := distref.ParseNormalizedNamed(targetName)
            if err != nil {
                return nil, err
            }
            if err := e.opt.ReferenceStore.AddTag(tref, digest.Digest(id), true); err != nil {
                return nil, tagDone(err)
            }
            _ = tagDone(nil)
        }
    }

Or we set ReferenceStore to nil

gaocegege commented 2 years ago

We can create the image service outside of docker. Then I will try if it is possible to embed it into the buildkitd process.

package main

import (
    "context"
    "path/filepath"

    "github.com/docker/docker/api/types"
    _ "github.com/docker/docker/daemon/graphdriver/overlay2"
    "github.com/docker/docker/daemon/images"
    dmetadata "github.com/docker/docker/distribution/metadata"
    "github.com/docker/docker/image"
    "github.com/docker/docker/layer"
    "github.com/docker/docker/pkg/idtools"
    refstore "github.com/docker/docker/reference"
)

func main() {
    root := "/var/lib/docker"
    graphDriver := "overlay2"
    layerStore, err := layer.NewStoreFromOptions(layer.StoreOptions{
        Root:                      root,
        MetadataStorePathTemplate: filepath.Join(root, "image", "%s", "layerdb"),
        GraphDriver:               graphDriver,
        GraphDriverOptions:        []string{},
        IDMapping:                 idtools.IdentityMapping{},
        ExperimentalEnabled:       false,
    })
    if err != nil {
        panic(err)
    }
    m := layerStore.Map()
    for k, v := range m {
        println(k, v)
    }
    imageRoot := filepath.Join(root, "image", graphDriver)
    ifs, err := image.NewFSStoreBackend(filepath.Join(imageRoot, "imagedb"))
    if err != nil {
        panic(err)
    }

    imageStore, err := image.NewImageStore(ifs, layerStore)
    if err != nil {
        panic(err)
    }
    im := imageStore.Map()
    for k, v := range im {
        println(k, v.Size)
    }

    refStoreLocation := filepath.Join(imageRoot, `repositories.json`)
    rs, err := refstore.NewReferenceStore(refStoreLocation)
    if err != nil {
        panic(err)
    }
    _ = rs
    distributionMetadataStore, err := dmetadata.NewFSMetadataStore(filepath.Join(imageRoot, "distribution"))
    if err != nil {
        panic(err)
    }
    _ = distributionMetadataStore
    imgSvcConfig := images.ImageServiceConfig{
        DistributionMetadataStore: distributionMetadataStore,
        ImageStore:                imageStore,
        LayerStore:                layerStore,
        ReferenceStore:            rs,
    }
    imageService := images.NewImageService(imgSvcConfig)
    is, err := imageService.Images(context.TODO(), types.ImageListOptions{})
    imageService.DistributionServices()
    if err != nil {
        panic(err)
    }
    for _, i := range is {
        println(i.ID)
    }
}

gaocegege commented 2 years ago

https://github.com/tensorchord/buildkit/pull/1/files

I am working on it. It is not easy.. :worried:

Things we may need to change:

source/containerimage/pull.go to use docker/docker/image.Store
worker

VoVAllen commented 2 years ago

nerdctl uses a newer buildkit https://github.com/containerd/nerdctl/blob/e77e05b5fd252274e3727e0439e9a2d45622ccb9/Dockerfile.d/SHA256SUMS.d/buildkit-v0.10.3. Can we leverage this?

gaocegege commented 2 years ago

nerdctl uses a newer buildkit https://github.com/containerd/nerdctl/blob/e77e05b5fd252274e3727e0439e9a2d45622ccb9/Dockerfile.d/SHA256SUMS.d/buildkit-v0.10.3. Can we leverage this?

We are using newer than nerdctl.

gaocegege commented 2 years ago

It is possible! https://github.com/gaocegege/buildkit/pull/1/files reuses the docker image store when caching the docker image in the buildkit.

Builtkit instance in the container owns its own image cache. This PR reuses /var/lib/docker/overlay2/image/ instead of using its own separate cache.

gaocegege commented 2 years ago

A new exporter envd is introduced in the buildkit container.

The image is loaded into the docker host successfully but it requires a dockerd reboot to find the new image. Seems that dockerd does not watch the filesystem, I will figure out.

buildctl build ... --output type=envd,name=gaoce
[+] Building 1.5s (4/4) FINISHED                                                                                                                                      
 => docker-image://docker.io/library/python:3.8                                                                                                                  1.5s
 => => resolve docker.io/library/python:3.8                                                                                                                      1.5s
 => CACHED ls                                                                                                                                                    0.0s
 => CACHED pip install -i https://mirror.sjtu.edu.cn/pypi/web/simple jupyter                                                                                     0.0s
 => exporting to image                                                                                                                                           0.0s
 => => exporting layers                                                                                                                                          0.0s
 => => writing image sha256:470747d54520023ee32931048063d1f383d52046ba95625a3d41411805850893                                                                     0.0s
 => => naming to gaoce                                                                                                                                           0.0s

gaocegege commented 2 years ago

Found the root cause. docker/docker/layer.Store loads /var/lib/docker/image/overlay2/layerdb in the New func. Thus the new layer cannot be found. :worried:

// newStoreFromGraphDriver creates a new Store instance using the provided
// metadata store and graph driver. The metadata store will be used to restore
// the Store.
func newStoreFromGraphDriver(root string, driver graphdriver.Driver) (Store, error) {
    caps := graphdriver.Capabilities{}
    if capDriver, ok := driver.(graphdriver.CapabilityDriver); ok {
        caps = capDriver.Capabilities()
    }

    ms, err := newFSMetadataStore(root)
    if err != nil {
        return nil, err
    }

    ls := &layerStore{
        store:       ms,
        driver:      driver,
        layerMap:    map[ChainID]*roLayer{},
        mounts:      map[string]*mountedLayer{},
        locker:      locker.New(),
        useTarSplit: !caps.ReproducesExactDiffs,
    }

    ids, mounts, err := ms.List()
    if err != nil {
        return nil, err
    }

    for _, id := range ids {
        l, err := ls.loadLayer(id)
        if err != nil {
            logrus.Debugf("Failed to load layer %s: %s", id, err)
            continue
        }
        if l.parent != nil {
            l.parent.referenceCount++
        }
    }

    for _, mount := range mounts {
        if err := ls.loadMount(mount); err != nil {
            logrus.Debugf("Failed to load mount %s: %s", mount, err)
        }
    }

    return ls, nil
}

gaocegege commented 2 years ago

It's not really possible without significant changes in the code-base (and adding a lot of complexity); as mentioned: those daemon's won't know what's still being used by other daemons, so if one daemon pulls an image, the other daemon's won't know it's being pulled (so don't "see" the image in the list of images that's available locally), and if a daemon removes an image, the other daemons will fail (because an image that they expected to be there is suddenly gone).

https://github.com/moby/moby/issues/37008

gaocegege commented 2 years ago

I think it is the end game. I am closing the issue since it is not possible.

gaocegege commented 2 years ago

https://github.com/docker/buildx/issues/1132#issuecomment-1146518194

docker 22.06-beta supports merge op. We can have a check in envd.

If the docker version is 20.xx, then we use runc worker, if the docker version is 22.xx, we use moby worker.

kemingy commented 1 year ago

https://github.com/moby/moby/issues/44369

tensorchord / envd

feat(docker): Support buildkitd moby worker in dockerd 22.06-beta #51

Description