opencontainers / image-spec

OCI Image Format
https://www.opencontainers.org/
Apache License 2.0
3.34k stars 624 forks source link

performance: what can image-spec do to improve handling of large images? #1190

Open rchincha opened 1 month ago

rchincha commented 1 month ago

Now that OCI artifacts has landed and getting mindshare and use cases, some issues are popping up. Best to standardize them.

Perhaps time to resurrect this? https://groups.google.com/a/opencontainers.org/g/dev/c/Zk3yf45HIdA

rchincha commented 1 month ago

https://github.com/opencontainers/runtime-spec/issues/1254

samuelkarp commented 1 month ago

https://github.com/kubernetes/enhancements/pull/4642 is relevant to this too

rchincha commented 1 month ago

https://docs.google.com/document/d/1Bs4fnP8rhPMaoPoLSYVvuRq-z9vkGPQ0rKbmfH4I7js/edit#heading=h.xw1gqgyqs5b ^ from the kubeflow community

cyphar commented 1 month ago

https://github.com/project-machine/puzzlefs was made to solve the problems my OCIv2 proposal discussed quite a few years ago. I haven't looked into it very deeply unforuntately, and I don't think it will help much with large artefact-filled images.

(My view has slowly moved to thinking that CDC and other compression methods make more sense on the distribution side. If we did that, it would be possible to make large images with any content equally deduplicated. There are downsides to this approach too, but embedding CDC parameters into the image-spec seems like a repeat of the nightmare we've had with compression algorithm settings but now with the added issue that changing the settings would cause you to waste cross-image deduplication.)

hsiangkao commented 1 month ago

https://github.com/project-machine/puzzlefs was made to solve the problems my OCIv2 proposal discussed quite a few years ago. I haven't looked into it very deeply unforuntately, and I don't think it will help much with large artefact-filled images.

Recently I occasionally found @gregkh already mentioned EROFS many years ago in OCI community :-).. https://groups.google.com/a/opencontainers.org/g/dev/c/icXssT3zQxE/m/N4YZsbZcAwAJ

I may need to rephrase EROFS again here: Instead of just reinventing a wheel for Android only, the original goal is to address Squashfs runtime performance issue since it doesn't fulfill for high-performance use cases like smartphones. Users cannot accept unacceptable dynamic app latencies (and currently most Android vendors already switch to EROFS since they have the same issue with applying compression). Squashfs on-disk format hasn't been even updated for a decade (currently even without a filesystem UUID), and various previous improved attempts (at least at the time when I decided to redesign a high-performance image filesystem format) was ignored [1][2][3].

The goal of EROFS filesystem was to launch a general image filesystem project for various use cases with high-performance. It may vary from system firmware, container images, app sandboxes, and even AI data model, etc. For example, people could use the same image for system firmwares on raw block devices (like Container OS use cases) and container image. If some new on-disk feature could benefit to most image use cases, we will consider to add with discussion too and new contributors are always welcome.

From my own perspective, although OCI tar format has many flaws, but the format is quite simple at least and various operation systems can support parsing tar without any barrier. Besides, the docker image format has been existed almost for a decade too, many base layers are already formed in tar layer format. As a public cloud vendor (like my current employer, Alibaba Cloud), image compatibility is quite important for our customers, and I guess that other cloud vendors may have the same concern since there is enough old OCI-compatible runtimes to be considered. If people would like some on-demand fetching, there is already technologies to resolve that like SOCI, stargz, etc. If people want to directly mount a filesystem in-kernel (although I'm not sure why such requirement is really important compared with performance and osboot concerns unlike system firmware use cases), they could use a Squashfs or EROFS index with OCI tar data instead.

I'm very happy if OCI community could have a chance to consider using EROFS in some form, but my opinion is that we may need to improve the current OCI format to overcome some current high-priority OCI image concern first. But if some specific areas like AI model needs some specific filesystem blob, I think some EROFS layer blobs for such specific use cases are fine too, btw, EROFS already has a IANA-registered media type as "vnd.erofs"

My own experience is that EROFS just becomes slowly used recently because many server users are still in 3.10 or 4.18 kernels, it doesn't matter for system image use cases like our original Android system images (because users will upgrade the whole system if they decide to use EROFS), but it may take many years before actual users use a new in-kernel feature like container images.

(My view has slowly moved to thinking that CDC and other compression methods make more sense on the distribution side. If we did that, it would be possible to make large images with any content equally deduplicated. There are downsides to this approach too, but embedding CDC parameters into the image-spec seems like a repeat of the nightmare we've had with compression algorithm settings but now with the added issue that changing the settings would cause you to waste cross-image deduplication.)

Actually EROFS already has a varient-CDC since Linux 6.1 although it's unlike the traditional CDC, but the result is almost the same. My experience is that CDC is good at text meterials but it has little benefit to executable binaries (I guess that is what we care about more in term of image sizes and runtime performance) because jump and data load instructions will kill all the possiblity of such data deduplication like the following code snippets of two minor versions of libc: image

In reality, the end result for executable binaries or something will be eventually like a page-unaligned block-based deduplication (like reflink) or file-based deduplication (like ostree).

IMHO, CDC-like approach without compression is suitable for archive uses and transfer uses (like casync or likewise), but I would have certain reservations as a kernel filesystem developer for runtime uses due to its block/page-unaligned chunks. CDC is unfriendly to page cache sharing (or FSDAX secure container memory sharing) and data movement is almost always needed. Extra data movement also slows down the performance compared to reflink approaches unless compression is also considered, yet EROFS already has compressed data deduplication feature for two years since 2022.

I think the only way to deduplicate these executable binaries is "delta compression", but I'm not sure if it's really a new on-disk feature for Linux kernel anyway. I guess most users are already happy with ostree or likewise, it needs carefully evaluation though.

[1] https://lore.kernel.org/all/af77c1f80e2725c4cf1bf106d6add820b3b0eed5.1523276963.git.geliangtang@gmail.com https://lore.kernel.org/all/975b0f7acbb65445551ee374a2dd38d553ac2e6a.1523326310.git.geliangtang@gmail.com https://lore.kernel.org/all/1702a314dc9de4626fbefc788213a578be88f184.1533630854.git.geliangtang@gmail.com https://lore.kernel.org/all/15428d5047390927114ad49d7721b3da2bdf40ef.1548403955.git.geliangtang@gmail.com https://lore.kernel.org/all/d6cbe74944ad1a6be21cc74b99b30d18cba140c5.1548406694.git.geliangtang@gmail.com [2] https://lore.kernel.org/all/20190717114151.10508-1-zbestahu@gmail.com https://lore.kernel.org/all/20190717120644.11128-1-zbestahu@gmail.com https://lore.kernel.org/all/20190719020653.8396-1-zbestahu@gmail.com [3] https://lore.kernel.org/all/81a996d7-ba4c-e5a0-d0ce-11951f1fd612@huawei.com

rchincha commented 1 month ago

@hsiangkao thanks for calling out all the salient points.

Just curious, how well is in-kernel erofs supported? in terms of community size and history etc? is there a recommended minimum Linux kernel version? is there a recommended userspace erofs implementation?

gregkh commented 1 month ago

On Thu, May 30, 2024 at 11:57:17AM -0700, Ramkumar Chinchani wrote:

@hsiangkao thanks for calling out all the salient points.

Just curious, how well is in-kernel erofs supported? in terms of community size and history etc? is there a recommended minimum Linux kernel version?

It is very well supported and used in a few hundred million, if not over a billion, devices everyday (i.e. it is one of the very very few file systems that Android allows to be used in their systems.)

Highly recommended.

As for "minimum Linux kernel version", please always just use the latest stable Linux kernel version for any kernel feature. To use an older one is never recommended :)

hsiangkao commented 1 month ago

@hsiangkao thanks for calling out all the salient points.

Just curious, how well is in-kernel erofs supported? in terms of community size and history etc? is there a recommended minimum Linux kernel version? is there a recommended userspace erofs implementation?

I fully agree with Greg's point: always use the latest stable kernel. Anyway, to your question, it depends on the feature requirement. If the intention is just to use EROFS format as an index (like a stargz-like TOC) to refer tar data (for lazy pulling), I think Linux 5.4+ is enough. The current distro configs could be checked out by https://oracle.github.io/kconfigs/?config=UTS_RELEASE&config=EROFS_FS If compatibility is really the main concern, I'd suggest using Squashfs. And erofs-utils has an official userspace implementation as an alternative approach anyway.

rchincha commented 1 month ago

Additional considerations ...

overlayfs (Linux kernel version 4.x but also supported by various BSD), squashfs (Linux kernel version 2.6.x, also supported by various BSD at least recently) and erofs (Linux kernel 5.x, not supported on *BSD?). MS Win support is another matter altogether.

hsiangkao commented 1 month ago

Additional considerations ...

overlayfs (Linux kernel version 4.x but also supported by various BSD), squashfs (Linux kernel version 2.6.x, also supported by various BSD at least recently) and erofs (Linux kernel 5.x, not supported on *BSD?). MS Win support is another matter altogether.

I'm quite open to that since EROFS is not designed for some specific use case. If OCI community considers EROFS in some form (or as an alternative), that is quite awecome. If not, EROFS will still improve new features consistently to fulfill generic image use cases. The feature development of EROFS is always active from Android vendors, some cloud vendors, etc.

rchincha commented 4 weeks ago

Adding more notes ...

OCI artifacts may package "many small-ish files" such as container image rootfs or "a few very large files" such as AI models.

ChaoyiHuang commented 3 weeks ago

some thought here. large model file is really large, for example the size of LLMA3 70b fp16 is about 141GB.

one way to handling such huge file is to use same storage for image registry and compute node, i.e., the model file can be stored in the distributed file system with raw format, which is shared among the registry and compute node, no data tranfer between registry backend and compute node.

when the client in the compute node pull the model blob, the registry returns the location of the model file, the client find that it's located in the file system the compute node can access, no blob downloading is requried.

hsiangkao commented 3 weeks ago

some thought here. large model file is really large, for example the size of LLMA3 70b fp16 is about 141GB.

one way to handling such huge file is to use same storage for image registry and compute node, i.e., the model file can be stored in the distributed file system with raw format, which is shared among the registry and compute node, no data tranfer between registry backend and compute node.

when the client in the compute node pull the model blob, the registry returns the location of the model file, the client find that it's located in the file system the compute node can access, no blob downloading is requried.

Anyway, you could also treat OCI artifacts (like a kind of object storage) as shared immutable storage (like a read-only mini- gfs2, ocfs2), in which way you also don't need to download any blob locally in advance (like hundreds of GiB), just virtual block device clients with nbd/tcmu/ublk or (if you really need some local caching) caching framework like fscache.

rchincha commented 3 weeks ago

some thought here. large model file is really large, for example the size of LLMA3 70b fp16 is about 141GB.

^ how compressible is this model file?

rchincha commented 3 weeks ago

Is there interest in porting erofs-utils to golang? since most utilities in this world are golang-based? Mainly interested in creating a erofs layer/image (so that it is compatible with overlayfs).

gregkh commented 3 weeks ago

On Thu, Jun 06, 2024 at 01:10:01PM -0700, Ramkumar Chinchani wrote:

Is there interest in porting erofs-utils to golang? since most utilities in this world are golang-based?

The language the code is in should not matter, as you end up with a binary in the end. So this should not be an issue at all.

Mainly interested in creating a erofs layer/image (so that it is compatible with overlayfs).

Great, but the language of the tools does not prevent this :)

good luck!

greg k-h

rchincha commented 3 weeks ago

If the goal is to produce and consume erofs layers - so that they can just be copied over and mounted, then there are two touch points, which may or may not be ok with binary invocations.

  1. Tools that produce said layers and images
  2. Container runtimes that overlay mount these layers

Maybe as a initial poc, go bindings (cgo) instead?

tianon commented 3 weeks ago

I recall reading (perhaps incorrectly! 🙈❤️) that many kernel filesystems are not designed to be hardened against attacker-controlled raw input, but given the use cases for erofs, I'm guessing that its implementation is hardened against malicious inputs? 👀😇

hsiangkao commented 3 weeks ago

I recall reading (perhaps incorrectly! 🙈❤️) that many kernel filesystems are not designed to be hardened against attacker-controlled raw input, but given the use cases for erofs, I'm guessing that its implementation is hardened against malicious inputs? 👀😇

This is really a best-effort stuff. Unlike generic fses with complex metadata and journalling (so some consistency issues between different kinds of metadata are always challenging), EROFS core on-disk format is quite simple [1]. EROFS project addresses any new syzkaller fuzzing reports and we also have our own fuzzer to find potential bugs. However, EROFS is not a complete freezed filesystem project, thus new useful ondisk/runtime features will be added by the time according to new scenarios/inputs, so that there may be some new issues raised (as we are all humans and not bug-free.)

Unlike some other fses, EROFS will address new found/reported issues in time and that is all the guarantee I could give. So yes, in brief, the implementation is hardened against malicious inputs with best efforts. Or we could find some ways to let users only use core stable features but it looks like a non-technical issue anyway (again, latest stable kernels are always preferred to address all kernel issues).

[1] https://erofs.docs.kernel.org/en/latest/core_ondisk.html

hsiangkao commented 3 weeks ago

Is there interest in porting erofs-utils to golang? since most utilities in this world are golang-based? Mainly interested in creating a erofs layer/image (so that it is compatible with overlayfs).

Some runtime like gVisor [1] already landed core on-disk EROFS support in their own go form to enable efficient image passthrough to sandboxes. But some other alternative way (like cgo) is helpful since EROFS is still actively under development, so maintaining various language implementations up to date is somewhat challenging due to limited time & engineering resources (although we may have some experimental Rust implementation developped by students later). Anyway, C is still the quite portable language among all architectures / platforms / distributions.

[1] https://github.com/google/gvisor/pull/9486