Open x4e-jonas opened 1 year ago
Yes, the way the current build process is set up leads to some redundancy between cineast-api
and cineast-cli
since they are both built to be self-contained and work independently of each other but generally both generated. At least for some of the libraries that come with binary dependencies, it should also be possible to exclude the ones for different operating systems explicitly. This would however necessitate that we have a specific build path for linux/amd64
and prevent the docker image to run on any other platform (which probably won't work anyway due to some other dependencies). OS-level native libraries aren't really an option due to the way they are linked. The resources
are just the way they are, there isn't really a way to reduce the size there.
If you want to have a look at a targeted build via gradle that only builds the cineast-api
target as an application with only the linux/amd64
binary dependencies, contributions are always appreciated 😉
Regarding the resources, there is the possibility of implementing a "lazy download" strategy, such that Cineast downloads missing resources at runtime if they are missing.
This would lead to the benefit, that we could build significantly smaller images, and that use cases that do not require the deep learning features do not require this lengthy download, with the downside of not having the resources contained in the image by default.
The "lazy download" strategy would be very easy to implement in the Docker image. I'm just worried that this slows down the (re)start of the application. Can you share some details at what stage those resources are required? I'd propose to cache those resources in a attached volume rather than inside the image and just check for updates at runtime.
I just noticed that the releases also contain Libraries like JUnit, Mockito etc. Is this intentional? They are not huge but I doubt that anyone is using them in a production environment.
Generally, all uses are used by features; during extraction or at runtime. I like a lazy download strategy, I'm also open to caching those resources in an attached volume. As for the second point, I'm sure there's some optimization to be done w.r.t library naming, e.g. no testing library is required at normal runtime / in a production environment. Feel free to open a PR for both issues.
The Docker v3.12.4 image is > 4GB and keeps growing. To compare, the
eclipse-temurin:17-jre
bas image is only 90MB.Docker images should only contain the required runtime libraries and the aim of containers is to be lightweight. Even on a modern system, it takes several minutes to download and extract the
vitrivr/cineast
image. This makes the deployment harder and leads to longer downtimes. Furthermore, if you use cloud native infrastructure, this could lead in higher costs due the bandwidth and storage requirements. Since all dependencies are compressed into two single JARs it's also impossible for Docker to cache or deduplicate image layers.I took a look into the image: 2GB are used by
resources
(I'm not sure if there is any room for optimization) and thecineast-api.jar
andcineast-cli.jar
are about 1GB each. When you look into the JARs, you will see, that they share a large amount of data. If you further look into that data, most of it are native binaries (e.g. tensorflow or ffmpeg). It even contains the same libraries for multiple platforms, like x86, arm, windows, macos etc.So here are some thoughts:
linux/amd64
and therefore it should only contain the libraries forlinux/amd64
. Is there a way to use targeted builds with gradle for Docker? In the future,docker buildx
could be used to support other platforms.