[Docker] reduce image size, use targeted builds

x4e-jonas commented 1 year ago

The Docker v3.12.4 image is > 4GB and keeps growing. To compare, the eclipse-temurin:17-jre bas image is only 90MB.

Docker images should only contain the required runtime libraries and the aim of containers is to be lightweight. Even on a modern system, it takes several minutes to download and extract the vitrivr/cineast image. This makes the deployment harder and leads to longer downtimes. Furthermore, if you use cloud native infrastructure, this could lead in higher costs due the bandwidth and storage requirements. Since all dependencies are compressed into two single JARs it's also impossible for Docker to cache or deduplicate image layers.

I took a look into the image: 2GB are used by resources (I'm not sure if there is any room for optimization) and the cineast-api.jar and cineast-cli.jar are about 1GB each. When you look into the JARs, you will see, that they share a large amount of data. If you further look into that data, most of it are native binaries (e.g. tensorflow or ffmpeg). It even contains the same libraries for multiple platforms, like x86, arm, windows, macos etc.

So here are some thoughts:

Can the two JARs share a common dependency to avoid duplication?
Are both JARs required in the context of Docker or would it make sense, two have to different images/tags for api and cli?
The image runs only on linux/amd64 and therefore it should only contain the libraries for linux/amd64. Is there a way to use targeted builds with gradle for Docker? In the future, docker buildx could be used to support other platforms.
Alternately, use the OS package manager to install native libraries if available.
This may also apply for the general release process of the JARs. This would also avoid any unexpected issues with platforms, that are not included in the JAR archives, e.g. anything else than glibc or older ARM platforms.

lucaro commented 1 year ago

Yes, the way the current build process is set up leads to some redundancy between cineast-api and cineast-cli since they are both built to be self-contained and work independently of each other but generally both generated. At least for some of the libraries that come with binary dependencies, it should also be possible to exclude the ones for different operating systems explicitly. This would however necessitate that we have a specific build path for linux/amd64 and prevent the docker image to run on any other platform (which probably won't work anyway due to some other dependencies). OS-level native libraries aren't really an option due to the way they are linked. The resources are just the way they are, there isn't really a way to reduce the size there.

lucaro commented 1 year ago

If you want to have a look at a targeted build via gradle that only builds the cineast-api target as an application with only the linux/amd64 binary dependencies, contributions are always appreciated 😉

Spiess commented 1 year ago

Regarding the resources, there is the possibility of implementing a "lazy download" strategy, such that Cineast downloads missing resources at runtime if they are missing.

This would lead to the benefit, that we could build significantly smaller images, and that use cases that do not require the deep learning features do not require this lengthy download, with the downside of not having the resources contained in the image by default.

x4e-jonas commented 1 year ago

The "lazy download" strategy would be very easy to implement in the Docker image. I'm just worried that this slows down the (re)start of the application. Can you share some details at what stage those resources are required? I'd propose to cache those resources in a attached volume rather than inside the image and just check for updates at runtime.

x4e-jonas commented 1 year ago

I just noticed that the releases also contain Libraries like JUnit, Mockito etc. Is this intentional? They are not huge but I doubt that anyone is using them in a production environment.

silvanheller commented 1 year ago

Generally, all uses are used by features; during extraction or at runtime. I like a lazy download strategy, I'm also open to caching those resources in an attached volume. As for the second point, I'm sure there's some optimization to be done w.r.t library naming, e.g. no testing library is required at normal runtime / in a production environment. Feel free to open a PR for both issues.

vitrivr / cineast

[Docker] reduce image size, use targeted builds #354