Improve mutate.Squash so that it handles opaque whiteouts and hard links.
Prior to this PR, oci-tools relied on Extract from the github.com/google/go-containerregistry (GGCR) module. Unfortunately, Extract has a couple of known issues:
Fixing these is not trivial due to the way the GGCR code works. The original code reads the layers in reverse order, allowing it to handle files that are modified/deleted efficiently, since the "upper" layer content is always read first. Although this works well in the vast majority of cases, the bugs above highlight some of the challenges:
In general, whiteouts apply only to the content of lower level layers. This is simple to implement for explicit whiteouts, since an explicit whiteout (which deletes an entry) should never appear in a layer containing a matching entry (which modifies an entry.) Opaque whiteouts are more challenging, since they apply recursively to a path, and may overlap with entries in the layer containing the whiteout. In order to handle opaque whiteouts correctly, we must ensure that only higher level whiteouts are applied.
Hard links are especially challenging, since they reference content from another entryThis creates subtle edge cases such as:
A hard link need not appear in the same layer as the content it references. With GGCR's approach of reading layers in reverse order, this means that a hard link could be encountered before the content it references. The output squashed layer must contain the content before hard link(s) that reference it, which is the opposite order we encounter them in.
To work around these issues, the proposed code here:
Reads image layers in reverse order, accumulating several pieces of state from the current layer:
layerWhiteouts accumulates whiteouts from the current layer.
Hard links encountered are appended to imageLinks for later processing.
layerEntries accumulates entries that are encountered in the current layer that are not directories, hard links, or whiteouts. If these are not "shadowed", they are written to the flattened output; otherwise, the contents are temporarily stored.
imageShadows accumulates the effects upper level layers have on lower level layers. This covers both regular entries as well as explicit/opaque whiteouts.
When the end of a layer is encountered:
The effects of the layer's whiteouts (layerWhiteouts) are merged into the imageShadows.
We write out any entries from imageLinks that reference entries present in the current layer (layerEntries.) If the link target was shadowed, we ensure the content is written out with the link name, and that any other links are updated to reference it.
Improve mutate.Squash so that it handles opaque whiteouts and hard links.
Prior to this PR,
oci-tools
relied on Extract from thegithub.com/google/go-containerregistry
(GGCR) module. Unfortunately,Extract
has a couple of known issues:Fixing these is not trivial due to the way the GGCR code works. The original code reads the layers in reverse order, allowing it to handle files that are modified/deleted efficiently, since the "upper" layer content is always read first. Although this works well in the vast majority of cases, the bugs above highlight some of the challenges:
To work around these issues, the proposed code here:
layerWhiteouts
accumulates whiteouts from the current layer.imageLinks
for later processing.layerEntries
accumulates entries that are encountered in the current layer that are not directories, hard links, or whiteouts. If these are not "shadowed", they are written to the flattened output; otherwise, the contents are temporarily stored.imageShadows
accumulates the effects upper level layers have on lower level layers. This covers both regular entries as well as explicit/opaque whiteouts.layerWhiteouts
) are merged into theimageShadows
.imageLinks
that reference entries present in the current layer (layerEntries
.) If the link target was shadowed, we ensure the content is written out with the link name, and that any other links are updated to reference it.