opencontainers / image-spec

OCI Image Format
https://www.opencontainers.org/
Apache License 2.0
3.45k stars 636 forks source link

Does OCI allow "cross-layer hardlinks"? And how should they be handled? #1204

Open yhgu2000 opened 1 week ago

yhgu2000 commented 1 week ago

Hi, OCI. I was writing an OCI image parser, and quickly realized there's some serious undefined behaviors about hardlinks.

First, let's recall that a hardlink is a filesystem entry that actually points to the same "file" (inode) as another filesystem entry. So, modifying a hardlink can lead to implicit and unpredictable changes to other filesystem entries, which actually provides a mean of implicit communication. Treating hardlinks as independent normal files can cause runtime error if the application relies on the implicit communication assumption of hardlinks. Second, to remind all of us, OCI image layers are in the tar format, e.g. POSIX pax/ustar/cpio standard, which allows hardlinks and duplicate paths.

Indeed, there has been some content about hardlinks in current specification. But they are not enough to answer the following questions:

  1. What if a layer contains an invalid hardlink, for example, pointing to an non-existent path? Should we consider the image as invalid or just ignore it?

    tar files can be simply considered as an array of POSIX files' metadata and content. As far as I know, most tar programs handle the tar archives in order. That is, they scan the file content from head to tail and do the file/dir/hardlink/symlink creating job according to the entry header, leaving the validity check to the OS filesystem.

    For example, the following tar file can be successfully extracted, where ./b is an hard link to ./a:

    ./a
    ./b => /a

    However, the following tar file may fails to work, as ./a is not created when ./b is scanned:

    ./b => /a
    ./a
  2. When creating the filesystem bundle, what should we do if a subsequent layer has an entry that is a hardlink in previous layer (sharing an inode with many other filesystem entries)? Should we unlink the filesystem entry with the previous inode and create a new inode with the data in the new layer, or to update the existing inode with the data (so that all hardlinked filesystem entries are affected)?

  3. When building the OCI image, how should it be recorded in the image if the user creates hardlinks to files of previous layer? In such case, the layer itself may be an error tar file, but can be extracted successfully under the condition that the previous layers are extracted in order.

  4. (I believe there are more problems with regard to the tar format. Comments are welcomed.)

There has been an issue about hardlink and symlink: https://github.com/opencontainers/image-spec/issues/857 . But I believe it does not covers all the problems I list above here.


Personally, for question 3, I did an experiment with Docker. I write a simple static-linked C program that creates a copy, a hardlink, a symlink, and print their inode id:

#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>

int
copy(const char* source, const char* target)
{
  FILE* sf = fopen(source, "r");
  if (sf == NULL) {
    perror("Error opening source file");
    return EXIT_FAILURE;
  }

  FILE* tf = fopen(target, "w");
  if (tf == NULL) {
    perror("Error opening target file");
    fclose(sf);
    return EXIT_FAILURE;
  }

  char buffer[4096];
  size_t bytesRead;
  while ((bytesRead = fread(buffer, 1, sizeof(buffer), sf)) > 0) {
    if (fwrite(buffer, 1, bytesRead, tf) != bytesRead) {
      perror("Error writing to target file");
      fclose(sf);
      fclose(tf);
      return EXIT_FAILURE;
    }
  }

  fclose(sf);
  fclose(tf);
  return EXIT_SUCCESS;
}

int
main(int argc, char* argv[])
{
  if (argc == 1) {
    fprintf(
      stderr, "Usage: %s <c|h|s> <source_file> <destination_file>\n", argv[0]);
    return EXIT_FAILURE;
  }

  const char* mode = argv[1];
  const char* source = argv[2];
  const char* target = argv[3];

  if (mode[0] == 'c') {
    if (copy(source, target) == 0) {
      printf("File copied successfully: %s -> %s\n", source, target);
    } else {
      return EXIT_FAILURE;
    }
  }

  else if (mode[0] == 'h') {
    if (link(source, target) == 0) {
      printf("Hard link created successfully: %s -> %s\n", target, source);
    } else {
      perror("Error creating hard link");
      return EXIT_FAILURE;
    }
  }

  else if (mode[0] == 's') {
    if (symlink(source, target) == 0) {
      printf("Symbolic link created successfully: %s -> %s\n", target, source);
    } else {
      perror("Error creating symbolic link");
      return EXIT_FAILURE;
    }
  }

  else if (mode[0] == 'p') {
    for (int i = 2; i < argc; ++i) {
      printf("%s: ", argv[i]);

      struct stat st;
      if (stat(argv[i], &st) == 0) {
        printf("%lu-%lu ", st.st_dev, st.st_ino);
      } else {
        perror(argv[i]);
      }

      struct stat lst;
      if (lstat(argv[i], &lst) == 0) {
        printf("%lu-%lu ", lst.st_dev, lst.st_ino);
      } else {
        perror(argv[i]);
      }

      printf("\n");
    }
  }

  else {
    fprintf(stderr, "Invalid mode: %s\n", mode);
    return EXIT_FAILURE;
  }

  return EXIT_SUCCESS;
}

Then I build an image from scratch with the compiled C program:

FROM scratch
COPY a.out /a.out
RUN ["/a.out", "c", "/a.out", "/a"]
RUN ["/a.out", "h", "/a.out", "/b"]
RUN ["/a.out", "s", "/a.out", "/c"]
CMD ["/a.out", "p", "/a.out", "/a", "/b", "/c"]

When I run the image on the same machine that built it, here's the output:

/a.out: 97-19830717 97-19830717 
/a: 97-19830716 97-19830716 
/b: 97-19830717 97-19830717 
/c: 97-19830717 97-19830718 

We can see that /b is a hardlink to /a.out, as expected.

However, if I use docker save to dump the image into a .tar.gz file, I find that the /b entry in layer 3 actually has a type 0, which means it is stored as a normal file, instead of hardlink. To further validate my suspicion, I copy the .tar.gz file to another machine with Docker, and the result is:

/a.out: 120-962490983 120-962490983 
/a: 120-962490987 120-962490987 
/b: 120-962490991 120-962490991 
/c: 120-962490983 120-962782553 

This means /b is now a regular file, which is not expected, or it is? Anyway, this example indicates that even Docker is confused with such situation.

yhgu2000 commented 1 week ago

@vsoch @cyphar @reidpr

yhgu2000 commented 1 week ago

For anyone curious, here's the OCI image (linux/amd64) that I got from docker save:

cross-layer-link.oci.tar.gz

yhgu2000 commented 1 week ago

Personally, I think the following 2 rules should be specified in the specification:

  1. The tar entries in the layer are extracted in order as normal tar format, unless that .wh. whiteout files should be applied first. The extraction starts from an empty root directory /, and if any error occurs during extraction, the Image is considered as invalid.

  2. If a subsequent layer overrides some paths that are hardlinks created in previous layers, only the files located by the paths are affected. They are recreated with the tar entry data in the subsequent layer, instead of updating the existing inode.