Open DYefimov opened 2 years ago
In general, I see reproducibility as a best effort, but not a guarantee. For the guarantee, you'd need to match the tooling that produced the image, and that tooling would need to provide a reproducibility guarantee itself. There are a lot of variables, including things like gzip compression levels, various attributes in the tar headers, seekable tar formats (estargz), and various digest algorithms. The JSON schemas can be extended with custom fields, and some implementations aren't consistent with ordering of those fields or the white space used in the JSON.
Ideally we'll identify as many of these as possible, and specify a canonical standard for everyone to follow to maximize the possibility of reproducibility. However, consumers of image content will also be flexible in when they allow to maximize the portability of content and compatibility between tools.
Given this, are there any specific changes needed to the image-spec right now, or should this be closed and we can revisit individual spec issues on a case-by-case basis?
Abstract
OCI chose tar format as a basis for images storage layer, while not specifying any constrains on the tar format itself AFAIK.
In #805 @vbatts says:
I found out that it not always holds true. Thus the content addressable scheme might be affected.
Steps to reproduce (with linux + GNU tar):
skopeo
gunzip
thetar.gz
layer holding the hello binarytar x
the hello binary insidetar c
the extracted hello binaryI refined the case down to differences in GNU tar implementation vs Golang one.
Given that most of the containerization software nowadays written in Go, someone might find this useful. As a side note, I don't have any intention of digging deeper into this and hope for more experienced OCI/Golang (-related) guys picking it up.
Testcase and explanation
Please, take a look at this testcase (click for tar_issue_test.sh source)
```sh #!/usr/bin/env sh set -e SKOPEO_IMG=quay.io/skopeo/stable:latest uname -srvmpio docker --version docker run --rm \ --security-opt seccomp=unconfined \ $SKOPEO_IMG --version tar --version | head -n 1 echo '================================' IMAGE_NAME=hello-world TMP_DIR=$(mktemp -dt tar_issue_test.XXXXXXXX) mkdir "$TMP_DIR/$IMAGE_NAME" echo "Created \"$TMP_DIR\"" trap "echo \"Removing \\\"$TMP_DIR\\\"\"; rm -rf \"$TMP_DIR\"" EXIT docker run --rm \ --security-opt seccomp=unconfined \ --user $(id -u):$(id -g) \ -v "$TMP_DIR/$IMAGE_NAME":"/$IMAGE_NAME" \ $SKOPEO_IMG \ copy docker://$IMAGE_NAME oci:$IMAGE_NAME:latest mkdir "$TMP_DIR/$IMAGE_NAME/testdir" cd "$TMP_DIR/$IMAGE_NAME/testdir" mv \ "$TMP_DIR/$IMAGE_NAME/blobs/sha256/2db29710123e3e53a794f2694094b9b4338aa9ee5c40b930cb8063a1be392c54" \ "./src.tar.gz" gunzip -q ./src.tar.gz echo '================================' tar xvf ./src.tar # contains just the "hello" binary SOURCE_DATE_EPOCH=$(date +%s) tar \ --format=ustar \ -b 1 \ --sort=name \ --numeric-owner --owner=0 --group=0 \ --mtime="@${SOURCE_DATE_EPOCH}" --clamp-mtime \ -cf repacked.tar hello chmod g-w repacked.tar echo '================================' set -x ls -lt --time-style=full-iso tar -tvf src.tar tar -tvf repacked.tar cmp -l src.tar repacked.tar || true hexdump -C src.tar | head hexdump -C repacked.tar | head set +x echo '================================' ```and it's output in my environment (kernel a bit outdated for irrelevant reasons):
There are two differences between original and recompressed tar files (bytes 102 and 154) The second one is the different CRC and a direct consequence of the first one.
As you can see the original tar file has 1 bit extra at 0x65 offset. Bytes 101-109 in the header correspond to the file stat mode of the entry. So in the src.tar hello binary mode string (octal) is
0100775.
while inside GNU compressed one it is0000775.
That extra bit corresponds to the
S_IFREG
returned by stat() syscall for regular files."Possible" root cause and question
GNU tar truncates first three triplets of the modestring while Golang tar does not
GNU states:
POSIX tar.h knows nothing about S_IFREG.
Doesn't it mean Golang tar is not POSIX compliant? Am I missing something?
There are some inconsistencies in the above, like
POSIX
vs--format=ustar
e.t.c. - of cause I double/triple checked all of them with the same result.[UPDATE] After a bit more tracing...
Golang
seems to be fine, at least for theS_IFREG
part - it truncates it here right after the fstatat call, and here in the tar itselfSkopeo
seems to be alright: it pullsapplication/vnd.docker.image.rootfs.diff.tar.gzip
from docker.io and silently puts it asapplication/vnd.oci.image.layer.v1.tar+gzip
according to spec:So in the end, somehow docker.io registry stores non-canonical tarball in its
library/hello-world
's rootfs blob. Where that extraS_IFREG
bit came from is unknown. Nevertheless, it violates the statement by @vbattsThat unpacking and repacking of the _same_ content is deterministic
also affecting content addressable scheme and reproducibility.