nlewo / nix2container

An archive-less dockerTools.buildImage implementation
Apache License 2.0
501 stars 45 forks source link

External hinting system for automatic layering #113

Open mikepurvis opened 7 months ago

mikepurvis commented 7 months ago

Some limitations of the "popularity" based approach for automatic layering:

None of these are huge issues with "small" images, but they start to really limit the effectiveness of the layering once there are thousands of store paths going into an image.

One possible way to improve this situation would be to have some kind external scanner tool that could examine a bunch of related images, and maybe also instances of the images/closures over time, and produce an output that could be checked into source control and used to better optimize automatic layer generation for successive builds. By checking it in, builds remain pure and the developer is in control of how frequently to update the hint file (likely in conjunction with dependency changes or flake updates).

If there's interest in such a thing, perhaps this ticket can be a place to discuss what such a file could look like and how would be most effective to collect the data.

nlewo commented 6 months ago

That would be really fun to implement: collecting all image graphs and find common subgraphs to isolate them in layers!

some kind external scanner tool that could examine a bunch of related images

I don't know what exactly you mean by "related images" but i think to generate a pertinent profile, we would need to have the whole closure graph of all images which are not available in the built images. This means this profile should have to be generated by consuming image Nix expressions. Maybe we could generate useful profile from the image JSON file, but this profile would be suboptimal, and i don't see an advantage of consuming the image JSON file instead of Nix expression.

It doesn't account for the size of the store paths.

I think this should be added to the current algorithm because generating tiny deep layers doesn't make sense.

It doesn't have any temporal context, for example optimizing blobs for how much their contents change over time.

In practice, i'm not sure this could be convenient since the analyzer would have to checkout several commits to compute a profile. Or, we could store the graph in the image filesystem or image metadata: this would also to fetch a bunch of image from a registry to compute a profile.