Does OCI allow "cross-layer hardlinks"? And how should they be handled?

Hi, OCI. I was writing an OCI image parser, and quickly realized there's some serious undefined behaviors about hardlinks.

First, let's recall that a hardlink is a filesystem entry that actually points to the same "file" (inode) as another filesystem entry. So, modifying a hardlink can lead to implicit and unpredictable changes to other filesystem entries, which actually provides a mean of implicit communication. Treating hardlinks as independent normal files can cause runtime error if the application relies on the implicit communication assumption of hardlinks. Second, to remind all of us, OCI image layers are in the tar format, e.g. POSIX pax/ustar/cpio standard, which allows hardlinks and duplicate paths.

Indeed, there has been some content about hardlinks in current specification. But they are not enough to answer the following questions:

What if a layer contains an invalid hardlink, for example, pointing to an non-existent path? Should we consider the image as invalid or just ignore it?
tar files can be simply considered as an array of POSIX files' metadata and content. As far as I know, most tar programs handle the tar archives in order. That is, they scan the file content from head to tail and do the file/dir/hardlink/symlink creating job according to the entry header, leaving the validity check to the OS filesystem.

For example, the following tar file can be successfully extracted, where ./b is an hard link to ./a:
```
./a
./b => /a
```
However, the following tar file may fails to work, as ./a is not created when ./b is scanned:
```
./b => /a
./a
```
When creating the filesystem bundle, what should we do if a subsequent layer has an entry that is a hardlink in previous layer (sharing an inode with many other filesystem entries)? Should we unlink the filesystem entry with the previous inode and create a new inode with the data in the new layer, or to update the existing inode with the data (so that all hardlinked filesystem entries are affected)?
When building the OCI image, how should it be recorded in the image if the user creates hardlinks to files of previous layer? In such case, the layer itself may be an error tar file, but can be extracted successfully under the condition that the previous layers are extracted in order.
(I believe there are more problems with regard to the tar format. Comments are welcomed.)

There has been an issue about hardlink and symlink: https://github.com/opencontainers/image-spec/issues/857 . But I believe it does not covers all the problems I list above here.

Personally, for question 3, I did an experiment with Docker. I write a simple static-linked C program that creates a copy, a hardlink, a symlink, and print their inode id:

#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

int
copy(const char* source, const char* target)
{
  FILE* sf = fopen(source, "r");
  if (sf == NULL) {
    perror("Error opening source file");
    return EXIT_FAILURE;
  }

  FILE* tf = fopen(target, "w");
  if (tf == NULL) {
    perror("Error opening target file");
    fclose(sf);
    return EXIT_FAILURE;
  }

  char buffer[4096];
  size_t bytesRead;
  while ((bytesRead = fread(buffer, 1, sizeof(buffer), sf)) > 0) {
    if (fwrite(buffer, 1, bytesRead, tf) != bytesRead) {
      perror("Error writing to target file");
      fclose(sf);
      fclose(tf);
      return EXIT_FAILURE;
    }
  }

  fclose(sf);
  fclose(tf);
  return EXIT_SUCCESS;
}

int
main(int argc, char* argv[])
{
  if (argc == 1) {
    fprintf(
      stderr, "Usage: %s <c|h|s> <source_file> <destination_file>\n", argv[0]);
    return EXIT_FAILURE;
  }

  const char* mode = argv[1];
  const char* source = argv[2];
  const char* target = argv[3];

  if (mode[0] == 'c') {
    if (copy(source, target) == 0) {
      printf("File copied successfully: %s -> %s\n", source, target);
    } else {
      return EXIT_FAILURE;
    }
  }

  else if (mode[0] == 'h') {
    if (link(source, target) == 0) {
      printf("Hard link created successfully: %s -> %s\n", target, source);
    } else {
      perror("Error creating hard link");
      return EXIT_FAILURE;
    }
  }

  else if (mode[0] == 's') {
    if (symlink(source, target) == 0) {
      printf("Symbolic link created successfully: %s -> %s\n", target, source);
    } else {
      perror("Error creating symbolic link");
      return EXIT_FAILURE;
    }
  }

  else if (mode[0] == 'p') {
    for (int i = 2; i < argc; ++i) {
      printf("%s: ", argv[i]);

      struct stat st;
      if (stat(argv[i], &st) == 0) {
        printf("%lu-%lu ", st.st_dev, st.st_ino);
      } else {
        perror(argv[i]);
      }

      struct stat lst;
      if (lstat(argv[i], &lst) == 0) {
        printf("%lu-%lu ", lst.st_dev, lst.st_ino);
      } else {
        perror(argv[i]);
      }

      printf("\n");
    }
  }

  else {
    fprintf(stderr, "Invalid mode: %s\n", mode);
    return EXIT_FAILURE;
  }

  return EXIT_SUCCESS;
}

Then I build an image from scratch with the compiled C program:

FROM scratch
COPY a.out /a.out
RUN ["/a.out", "c", "/a.out", "/a"]
RUN ["/a.out", "h", "/a.out", "/b"]
RUN ["/a.out", "s", "/a.out", "/c"]
CMD ["/a.out", "p", "/a.out", "/a", "/b", "/c"]

When I run the image on the same machine that built it, here's the output:

/a.out: 97-19830717 97-19830717 
/a: 97-19830716 97-19830716 
/b: 97-19830717 97-19830717 
/c: 97-19830717 97-19830718

We can see that /b is a hardlink to /a.out, as expected.

However, if I use docker save to dump the image into a .tar.gz file, I find that the /b entry in layer 3 actually has a type 0, which means it is stored as a normal file, instead of hardlink. To further validate my suspicion, I copy the .tar.gz file to another machine with Docker, and the result is:

/a.out: 120-962490983 120-962490983 
/a: 120-962490987 120-962490987 
/b: 120-962490991 120-962490991 
/c: 120-962490983 120-962782553

This means /b is now a regular file, which is not expected, or it is? Anyway, this example indicates that even Docker is confused with such situation.

@vsoch @cyphar @reidpr

For anyone curious, here's the OCI image (linux/amd64) that I got from docker save:

cross-layer-link.oci.tar.gz

Personally, I think the following 2 rules should be specified in the specification:

The tar entries in the layer are extracted in order as normal tar format, unless that .wh. whiteout files should be applied first. The extraction starts from an empty root directory /, and if any error occurs during extraction, the Image is considered as invalid.
If a subsequent layer overrides some paths that are hardlinks created in previous layers, only the files located by the paths are affected. They are recreated with the tar entry data in the subsequent layer, instead of updating the existing inode.

I can't think of a good solution for this one because there are multiple issues a layered filesystem creates with hard links. First, we can't assume layer order from within the context of a single layer. Each layer may be reused in multiple images, with different preceding layers. That means extracting one image could create a hard link to a file in a preceding layer that doesn't match when another image is extracted, potentially allowing a data leakage across images.

The other challenge I have with creating a solution is that hard links are to an inode, but one file that is modified across layers could have multiple inodes, and I don't think there's a way to know with the layered filesystem that's performing a copy-on-write whether that write would have changed the original inode or not. That means a file created in one layer, and either modified in the running container or modified in a later layer, could lose the associated connection with a hard link.

My initial inclination is to say that hard links only have limited support within a single layer, and once a hard link attempts to go beyond a layer boundary, either by creating that hard link in a later layer, or by performing a copy-on-write of any file included in a hard link, it may be converted to a regular file and no longer be associated with the other file.

I ran a small test using overlay2 and it looks like the hard link source is pulled up to the new layer, but not other files also linked to that file:

$ cat df.hard-link
FROM alpine

RUN echo hello >foo.txt \
 && ln foo.txt foo-same-layer.txt
RUN ln foo.txt foo-new-layer.txt

$ docker run -it --rm test-hard-link
/ # ls -li foo*.txt
39231816 -rw-r--r--    2 root     root             6 Oct 24 20:24 foo-new-layer.txt
39231807 -rw-r--r--    2 root     root             6 Oct 24 20:24 foo-same-layer.txt
39231816 -rw-r--r--    2 root     root             6 Oct 24 20:24 foo.txt
/ # ln foo.txt foo-container.txt
/ # ls -li foo*.txt
39232186 -rw-r--r--    2 root     root             6 Oct 24 20:24 foo-container.txt
39231816 -rw-r--r--    2 root     root             6 Oct 24 20:24 foo-new-layer.txt
39231807 -rw-r--r--    2 root     root             6 Oct 24 20:24 foo-same-layer.txt
39232186 -rw-r--r--    2 root     root             6 Oct 24 20:24 foo.txt
/ # echo world >foo.txt
/ # ls -li foo*.txt
39232186 -rw-r--r--    2 root     root             6 Oct 24 20:32 foo-container.txt
39231816 -rw-r--r--    2 root     root             6 Oct 24 20:24 foo-new-layer.txt
39231807 -rw-r--r--    2 root     root             6 Oct 24 20:24 foo-same-layer.txt
39232186 -rw-r--r--    2 root     root             6 Oct 24 20:32 foo.txt

This aligns with the kernel documentation on overlay filesystems without the index configured:

Enabled with the mount option or module option “index=on” or with the kernel config option CONFIG_OVERLAY_FS_INDEX=y.

If this feature is disabled and a file with multiple hard links is copied up, then this will “break” the link. Changes will not be propagated to other names referring to the same inode.

https://docs.kernel.org/filesystems/overlayfs.html#index

2. When creating the filesystem bundle, what should we do if a subsequent layer has an entry that is a hardlink in previous layer (sharing an inode with many other filesystem entries)? Should we unlink the filesystem entry with the previous inode and create a new inode with the data in the new layer, or to update the existing inode with the data (so that all hardlinked filesystem entries are affected)?

I think most implementations delegate this to the underlying union filesystem implementation. However the link should not go outside of the layer (aka change set). If it does, the link across layers would not be recorded, breaking the link and storing the individual file. Here's the relevant spec text that covers that scenario:

The corresponding files that share the link with the > 1 linkcount may be outside the directory that the changeset is being produced from, in which case the linkname is not recorded in the changeset.

opencontainers / image-spec

Does OCI allow "cross-layer hardlinks"? And how should they be handled? #1204