Open gaocegege opened 2 years ago
We merged multiple base layers into the image, thus the size is large. diff should be used to reduce the size. Ref https://github.com/tensorchord/MIDI/pull/54/commits/e64786b91c821b250b06d228e8cbc5be0bcad59d
The base image nvidia:cuda:11.2-devel-ubuntu2004
is 4GB, but our base image is 7GB. We should figure out where the 3GB comes from
Perhaps we could make MIDI
work in a way similar to docker-buildx
, which uses the BuildKit library bundled into the Docker daemon with docker driver, so that the image is actually built by dockerd and we don't need to load the image manually.
Cool! I think it is a great idea. Let's investigate how docker buildx plugin does.
buildx does not work in the local docker daemon too. We need to specify --load to load the artifact into docker.
But there is an optimization we could use:
$ docker volume inspect
[
{
"CreatedAt": "2022-05-05T17:30:23+08:00",
"Driver": "local",
"Labels": null,
"Mountpoint": "/var/lib/docker/volumes/buildx_buildkit_amazing_albattani0_state/_data",
"Name": "buildx_buildkit_amazing_albattani0_state",
"Options": null,
"Scope": "local"
}
]
$ docker inspect container
{
"Type": "volume",
"Source": "buildx_buildkit_amazing_albattani0_state",
"Target": "/var/lib/buildkit"
}
We could create a volume to keep the cache persistent.
Buildx also relies on docker load.
w = &waitingWriter{
PipeWriter: pw,
f: func() {
resp, err := c.ImageLoad(ctx, pr, false)
defer close(done)
if err != nil {
pr.CloseWithError(err)
w.mu.Lock()
w.err = err
w.mu.Unlock()
return
}
prog := progress.WithPrefix(status, "", false)
progress.FromReader(prog, "importing to docker", resp.Body)
},
done: done,
cancel: cancel,
}
return w, func() {
pr.Close()
}, nil
}
Now we use diff and merge to reduce the size. But docker load is still slow
7GB image load takes ~17s
7GB image load takes ~17s
Wow! It is really awsome! π―
Yeah, It comes from containerd diff, I think.
Maybe we can have a look at how docker build does when loading the image into its local image store.
I found the sending tarball
still take ~30s on my machine. Is this expected?
Yeah, it is expected in the current design. send tarball is the docker load process
I dived a little deeper into buildx and I am sure that docker load
is not necessarily required by buildx
.
Excerpt from the official documentaion: https://docs.docker.com/engine/reference/commandline/buildx_create/#driver
docker driver Uses the builder that is built into the docker daemon. With this driver, the --load flag is implied by default on buildx build. However, building multi-platform images or exporting cache is not currently supported.
A PoC is available as well: https://gist.github.com/knight42/6c128a2edf7cebcb6816343da833295a. The built image is present in docker images
without docker load
.
After learning about that, I have been trying to get rid of docker load
in envd, but it is unfortunate that the version of bundled buildkitd in docker engine 20.10.14 is v0.8.3-4-gbc07b2b8, while mergeop is introduced in v0.10.3.
That said, even though the bundled builkitd in docker might be new enough to support mergeop in the future, I think we still need some fallback mechanism, like using docker-container driver as what we did now.
@knight42 Thanks for the research!
but it is unfortunate that the version of bundled buildkitd in docker engine 20.10.14 is v0.8.3-4-gbc07b2b8, while mergeop is introduced in v0.10.3
I am wondering why we should use docker 20.10.14, is it the version that supports built-in load?
while mergeop is introduced in v0.10.3.
Currently, we use buildkit v0.10.1, and merge op is supported in this version. I am not sure if it only works after v0.10.3 :thinking:
Got the problem here.
failed to solve LLB: failed to solve: failed to load LLB: unknown API capability mergeop
The client returns the error that we cannot use merge op if we eliminate docker load.
I am wondering why we should use docker 20.10.14, is it the version that supports built-in load?
Nope, it is just the version of the dockerd in my laptop..
while mergeop is introduced in v0.10.3.
Currently, we use buildkit v0.10.1, and merge op is supported in this version. I am not sure if it only works after v0.10.3 π€
Sorry I double-checked the MergeOp PR, the merge op is actually introduced in v0.10.0.
The client returns the error that we cannot use merge op if we eliminate docker load.
Yeah, since we heavily leverage merge op in envd, if we want to get rid of docker load, we need to make sure the bundled buildkitd in dockerd has the support of merge op.
20.10.16 still uses v0.8.3-4-gbc07b2b8. I am afraid that we need to wait until the next milestone of docker.
Things that we need to confirm:
Using docker buildkitd directly is not possible now. We will figure out if we can mount some dir in the envd_buildkitd container to achieve a similar experience.
Here https://github.com/moby/moby/blob/master/builder/builder-next/controller.go#L44:#L220 docker creates a buildkit daemon (control.Controller).
And the most important part is https://github.com/moby/moby/blob/master/builder/builder-next/worker/worker.go#L83 . Docker has a new worker type moby
bk, err := buildkit.New(buildkit.Opt{
SessionManager: sm,
Root: filepath.Join(config.Root, "buildkit"),
Dist: d.DistributionServices(),
NetworkController: d.NetworkController(),
DefaultCgroupParent: cgroupParent,
RegistryHosts: d.RegistryHosts(),
BuilderConfig: config.Builder,
Rootless: d.Rootless(),
IdentityMapping: d.IdentityMapping(),
DNSConfig: config.DNSConfig,
ApparmorProfile: daemon.DefaultApparmorProfile(),
})
buildkit (builder-next.Controller) uses dockerd's DistributionService, thus the images' blobs and metadata are stored in docker image store directly. Thus there is no need to load images.
https://github.com/moby/moby/issues/9935#issuecomment-297586182
The maintainers said it is not possible to run multi docker daemons on one data root /var/lib/docker
The Docker daemon was explicitly designed to have exclusive access to /var/lib/docker. Nothing else should touch, poke, or tickle any of the Docker files hidden there.
Why is that? Itβs one of the hard learned lessons from the dotCloud days. The dotCloud container engine worked by having multiple processes accessing /var/lib/dotcloud simultaneously. Clever tricks like atomic file replacement (instead of in-place editing), peppering the code with advisory and mandatory locking, and other experiments with safe-ish systems like SQLite and BDB only got us so far; and when we refactored our container engine (which eventually became Docker) one of the big design decisions was to gather all the container operations under a single daemon and be done with all that concurrent access nonsense.
This means that if you share your /var/lib/docker directory between multiple Docker instances, youβre gonna have a bad time. Of course, it might work, especially during early testing. βLook ma, I can docker run ubuntu!β But try to do something more involved (pull the same image from two different instancesβ¦) and watch the world burn.
But, I still think it is possible to run a (minimal) docker daemon in our envd_buildkitd container. /var/lib/docker/image
is only needed in envd_buildkitd.
The main concern above is that multiple daemons on the same data root (/var/lib/docker/) may break the consistency. Let's have a look at the dir architecture of the image part in /var/lib/docker
/var/lib/docker/image/overlay2
βββ distribution
βΒ Β βββ diffid-by-digest
βΒ Β βββ v2metadata-by-diffid
βββ imagedb
βΒ Β βββ content
βΒ Β βββ metadata
βββ layerdb
βΒ Β βββ mounts
βΒ Β βββ sha256
βΒ Β βββ tmp
βββ repositories.json
distribution is used to communicate with OCI image registry, thus it is not used in envd_buildkitd. imagedb and layerdb are actually a key-value store and the key is the file name (which is a HEX). Then it should not be affected by concurrent daemons.
The last one repositories.json is to store the map from image tag name to image ID:
{
"ubuntu": {
"ubuntu:20.04": "sha256:53df61775e8856a464ca52d4cd9eabbf4eb3ceedbde5afecc57e417e7b7155d5",
"ubuntu@sha256:47f14534bda344d9fe6ffd6effb95eefe579f4be0d508b7445cf77f61a0e5724": "sha256:53df61775e8856a464ca52d4cd9eabbf4eb3ceedbde5afecc57e417e7b7155d5"
}
}
It may be affected by the concurrent daemons. But we may have some workarounds for it. For example, we can rename the image using docker API, and not tag it in the low level. We avoid manipulating this JSON file directly.
Thus in the buildkit exporter code, we should remove logic like this:
if e.opt.ReferenceStore != nil {
targetNames := strings.Split(e.targetName, ",")
for _, targetName := range targetNames {
tagDone := oneOffProgress(ctx, "naming to "+targetName)
tref, err := distref.ParseNormalizedNamed(targetName)
if err != nil {
return nil, err
}
if err := e.opt.ReferenceStore.AddTag(tref, digest.Digest(id), true); err != nil {
return nil, tagDone(err)
}
_ = tagDone(nil)
}
}
Or we set ReferenceStore to nil
We can create the image service outside of docker. Then I will try if it is possible to embed it into the buildkitd process.
package main
import (
"context"
"path/filepath"
"github.com/docker/docker/api/types"
_ "github.com/docker/docker/daemon/graphdriver/overlay2"
"github.com/docker/docker/daemon/images"
dmetadata "github.com/docker/docker/distribution/metadata"
"github.com/docker/docker/image"
"github.com/docker/docker/layer"
"github.com/docker/docker/pkg/idtools"
refstore "github.com/docker/docker/reference"
)
func main() {
root := "/var/lib/docker"
graphDriver := "overlay2"
layerStore, err := layer.NewStoreFromOptions(layer.StoreOptions{
Root: root,
MetadataStorePathTemplate: filepath.Join(root, "image", "%s", "layerdb"),
GraphDriver: graphDriver,
GraphDriverOptions: []string{},
IDMapping: idtools.IdentityMapping{},
ExperimentalEnabled: false,
})
if err != nil {
panic(err)
}
m := layerStore.Map()
for k, v := range m {
println(k, v)
}
imageRoot := filepath.Join(root, "image", graphDriver)
ifs, err := image.NewFSStoreBackend(filepath.Join(imageRoot, "imagedb"))
if err != nil {
panic(err)
}
imageStore, err := image.NewImageStore(ifs, layerStore)
if err != nil {
panic(err)
}
im := imageStore.Map()
for k, v := range im {
println(k, v.Size)
}
refStoreLocation := filepath.Join(imageRoot, `repositories.json`)
rs, err := refstore.NewReferenceStore(refStoreLocation)
if err != nil {
panic(err)
}
_ = rs
distributionMetadataStore, err := dmetadata.NewFSMetadataStore(filepath.Join(imageRoot, "distribution"))
if err != nil {
panic(err)
}
_ = distributionMetadataStore
imgSvcConfig := images.ImageServiceConfig{
DistributionMetadataStore: distributionMetadataStore,
ImageStore: imageStore,
LayerStore: layerStore,
ReferenceStore: rs,
}
imageService := images.NewImageService(imgSvcConfig)
is, err := imageService.Images(context.TODO(), types.ImageListOptions{})
imageService.DistributionServices()
if err != nil {
panic(err)
}
for _, i := range is {
println(i.ID)
}
}
https://github.com/tensorchord/buildkit/pull/1/files
I am working on it. It is not easy.. :worried:
Things we may need to change:
nerdctl uses a newer buildkit https://github.com/containerd/nerdctl/blob/e77e05b5fd252274e3727e0439e9a2d45622ccb9/Dockerfile.d/SHA256SUMS.d/buildkit-v0.10.3. Can we leverage this?
nerdctl uses a newer buildkit https://github.com/containerd/nerdctl/blob/e77e05b5fd252274e3727e0439e9a2d45622ccb9/Dockerfile.d/SHA256SUMS.d/buildkit-v0.10.3. Can we leverage this?
We are using newer than nerdctl.
It is possible! https://github.com/gaocegege/buildkit/pull/1/files reuses the docker image store when caching the docker image in the buildkit.
Builtkit instance in the container owns its own image cache. This PR reuses /var/lib/docker/overlay2/image/ instead of using its own separate cache.
A new exporter envd is introduced in the buildkit container.
The image is loaded into the docker host successfully but it requires a dockerd reboot to find the new image. Seems that dockerd does not watch the filesystem, I will figure out.
buildctl build ... --output type=envd,name=gaoce
[+] Building 1.5s (4/4) FINISHED
=> docker-image://docker.io/library/python:3.8 1.5s
=> => resolve docker.io/library/python:3.8 1.5s
=> CACHED ls 0.0s
=> CACHED pip install -i https://mirror.sjtu.edu.cn/pypi/web/simple jupyter 0.0s
=> exporting to image 0.0s
=> => exporting layers 0.0s
=> => writing image sha256:470747d54520023ee32931048063d1f383d52046ba95625a3d41411805850893 0.0s
=> => naming to gaoce 0.0s
Found the root cause. docker/docker/layer.Store loads /var/lib/docker/image/overlay2/layerdb in the New func. Thus the new layer cannot be found. :worried:
// newStoreFromGraphDriver creates a new Store instance using the provided
// metadata store and graph driver. The metadata store will be used to restore
// the Store.
func newStoreFromGraphDriver(root string, driver graphdriver.Driver) (Store, error) {
caps := graphdriver.Capabilities{}
if capDriver, ok := driver.(graphdriver.CapabilityDriver); ok {
caps = capDriver.Capabilities()
}
ms, err := newFSMetadataStore(root)
if err != nil {
return nil, err
}
ls := &layerStore{
store: ms,
driver: driver,
layerMap: map[ChainID]*roLayer{},
mounts: map[string]*mountedLayer{},
locker: locker.New(),
useTarSplit: !caps.ReproducesExactDiffs,
}
ids, mounts, err := ms.List()
if err != nil {
return nil, err
}
for _, id := range ids {
l, err := ls.loadLayer(id)
if err != nil {
logrus.Debugf("Failed to load layer %s: %s", id, err)
continue
}
if l.parent != nil {
l.parent.referenceCount++
}
}
for _, mount := range mounts {
if err := ls.loadMount(mount); err != nil {
logrus.Debugf("Failed to load mount %s: %s", mount, err)
}
}
return ls, nil
}
It's not really possible without significant changes in the code-base (and adding a lot of complexity); as mentioned: those daemon's won't know what's still being used by other daemons, so if one daemon pulls an image, the other daemon's won't know it's being pulled (so don't "see" the image in the list of images that's available locally), and if a daemon removes an image, the other daemons will fail (because an image that they expected to be there is suddenly gone).
I think it is the end game. I am closing the issue since it is not possible.
https://github.com/docker/buildx/issues/1132#issuecomment-1146518194
docker 22.06-beta supports merge op. We can have a check in envd.
If the docker version is 20.xx, then we use runc worker, if the docker version is 22.xx, we use moby worker.
Description
The image is built in buildkit, and it does not exist in the docker host. Thus we need to pipe the buildkit build image into the docker host. It takes about 100s for a 20G base image docker load. It is too slow. We need to optimize it.