uktrade / stream-zip

Python function to construct a ZIP archive on the fly
https://stream-zip.docs.trade.gov.uk/
MIT License
110 stars 9 forks source link

Creating nested path hierarchies and empty directories inside the ZIP archive? #55

Closed Arcitec closed 1 year ago

Arcitec commented 1 year ago

Hey, small question.

From what I've observed from the ZIP spec, it seems like the "directory hierarchy" of ZIP files are just an illusion, and that it's really just POSIX-style names (forward slash separator), where each file has its own full path embedded straight into the file "name" data (no "actual nested folders"), such as name = foo/bar/baz/readme.txt. I've also noticed that empty directories are stored as "files" which are named things like foo/bar/baz/ with a trailing slash at the end, and with a DATA LENGTH of ZERO. The trailing slash seems to be the only signifier that a "file" is actually a directory?

For instance, when I created a ZIP with the standard utilities on Linux, I observed that it has an entry for the "parent folder" as a standalone ZIP entry, before it has an entry for the actual file inside that folder.

The hierarchy in this example ZIP is:

Here's my test ZIP:

backslashes.zip

So with that backstory about how directories seem to work in ZIPs, I'm very curious about the correct ways to handle directories in stream-zip:

  1. If I want to create a deeply nested directory, is there anything I need to think about and be careful about? Or can I just immediately do name = foo/bar/baz/readme.txt without having to manually create the parent path components (foo/, foo/bar/ and foo/bar/baz/)?
  2. Is it safe to do interleaved writes, like creating foo/a/1.txt, then bar/b/1.txt, then foo/a/2.txt? Meaning is it safe to add directories in a jumbled order to the ZIP file contents? I assume that stream-zip remembers which directories it has already written to the ZIP and doesn't create those paths again.
  3. And lastly... is there any way to add empty directories? Sometimes there are empty ones in the data I am packing, and I'd prefer to preserve that empty hierarchy if it's possible. Using some kind of "if directory doesn't exist in ZIP yet, create an empty one in the ZIP"? I thought of possibly creating a file named the/directory/ with a trailing slash and providing b"" (empty bytes) as the data?
  4. If the idea in point 3 would work, would there be any risk of clashing with or overwriting any actual files/subfolders from that hierachy if they already existed? Or any other failures, such as it generating new directory entries for the same "directory file/" over and over again? (i.e. if a/b/ was already auto-generated and added earlier, and I then manually try to write a/b/, I guess I'd end up with a duplicate directory entry)?
  5. If it's not currently possible to add empty directories, is there any chance of some API for adding empty directories? It could be as simple as "if the user provides a tuple which ONLY contains a name and timestamp and nothing else, add that as a directory entry to the ZIP, if it hasn't been added before".
Arcitec commented 1 year ago

Oh, I just had a look inside a ZIP createad with stream-zip and it doesn't create empty entries for the other directories in the hierarchy. Very interesting. The entries went straight to the deeply nested file:

somefile.dat
deeply/nested/file.dat

Whereas if I had created the same ZIP on Linux with regular tools, I would have had these entries in the ZIP:

somefile.dat
deeply/
deeply/nested/
deeply/nested/file.dat

But I suppose that the ZIP spec is flexible enough that it doesn't need the intermediary "this is a parent folder on the road to glory" stuff.

So if this is the case, I suppose that if I want to add empty folders, it's as easy as something like this:

  1. Do a pass through all files only, and add them to the archive.
  2. Do a pass through all leafs (deepest path nodes) regardless of type, and then check if is_dir() and len(children) == 0 to find all empty folders that weren't added yet (due to not containing any files), and then manually insert those empty directories all at the end of the archive via simple foo/bar/thisemptyfolder/ entries at the end, with 0-byte b"" data.

Would something like that solve all of these issues?

michalc commented 1 year ago

So if this is the case, I suppose that if I want to add empty folders, it's as easy as something like this:

Do a pass through all files only, and add them to the archive. Do a pass through all leafs (deepest path nodes) regardless of type, and then check if is_dir() and len(children) == 0 to find all empty folders that weren't added yet (due to not containing any files), and then manually insert those empty directories all at the end of the archive via simple foo/bar/thisemptyfolder/ entries at the end, with 0-byte b"" data.

Would something like that solve all of these issues?

Sounds reasonable to me. I guess you don't absolutely have to have two passes - you can in one pass output files, while maintaining a list of folders that don't have any children. Or just output all folders like it sounds like other tools do. But that's a detail.

I have to admit I haven't tested zipping folders much, especially empty folders - it was never really a priority. So I suspect would be good to check the output using a few different unzip-ers, just in case.

michalc commented 1 year ago

If it's not currently possible to add empty directories, is there any chance of some API for adding empty directories?

On this - I think ending the name in a forward slash should create a directory: https://github.com/uktrade/stream-zip/commit/66c500b71127121e0f795e44d28ce21e08b16361

Arcitec commented 1 year ago

Hi Michal! I've been digging deeply into this, so before answering, I'll just separately dump the information I've dug up here, for reference since it could be useful (you mentioned not looking much into directory storage in ZIPs)! :)

ZIP Specification

   4.3.8  File data

      Immediately following the local header for a file
      SHOULD be placed the compressed or stored data for the file.
      If the file is encrypted, the encryption header for the file 
      SHOULD be placed after the local header and before the file 
      data. The series of [local file header][encryption header]
      [file data][data descriptor] repeats for each file in the 
      .ZIP archive. 

      Zero-byte files, directories, and other file types that 
      contain no content MUST NOT include file data.
   4.4.17 file name: (Variable)

       4.4.17.1 The name of the file, with optional relative path.
       The path stored MUST NOT contain a drive or
       device letter, or a leading slash.  All slashes
       MUST be forward slashes '/' as opposed to
       backwards slashes '\' for compatibility with Amiga
       and UNIX file systems etc.  If input came from standard
       input, there is no file name field.  
 4.4.3 version needed to extract (2 bytes)

        4.4.3.1 The minimum supported ZIP specification version needed 
        to extract the file, mapped as above.  This value is based on 
        the specific format features a ZIP program MUST support to 
        be able to extract the file.  If multiple features are
        applied to a file, the minimum version MUST be set to the 
        feature having the highest value. New features or feature 
        changes affecting the published format specification will be 
        implemented using higher version numbers than the last 
        published value to avoid conflict.

        4.4.3.2 Current minimum feature versions are as defined below:

         1.0 - Default value
         1.1 - File is a volume label
         2.0 - File is a folder (directory)
         2.0 - File is compressed using Deflate compression
         2.0 - File is encrypted using traditional PKWARE encryption
         2.1 - File is compressed using Deflate64(tm)
         2.5 - File is compressed using PKWARE DCL Implode 
         2.7 - File is a patch data set 
         4.5 - File uses ZIP64 format extensions
         4.6 - File is compressed using BZIP2 compression*
         5.0 - File is encrypted using DES
         5.0 - File is encrypted using 3DES
         5.0 - File is encrypted using original RC2 encryption
         5.0 - File is encrypted using RC4 encryption
         5.1 - File is encrypted using AES encryption
         5.1 - File is encrypted using corrected RC2 encryption**
         5.2 - File is encrypted using corrected RC2-64 encryption**
         6.1 - File is encrypted using non-OAEP key wrapping***
         6.2 - Central directory encryption
         6.3 - File is compressed using LZMA
         6.3 - File is compressed using PPMd+
         6.3 - File is encrypted using Blowfish
         6.3 - File is encrypted using Twofish
   4.4.15 external file attributes: (4 bytes)

       The mapping of the external attributes is
       host-system dependent (see 'version made by').  For
       MS-DOS, the low order byte is the MS-DOS directory
       attribute byte.  If input came from standard input, this
       field is set to zero.

This document taught us a few things:

image

External File Attributes

   4.4.15 external file attributes: (4 bytes)

       The mapping of the external attributes is
       host-system dependent (see 'version made by').  For
       MS-DOS, the low order byte is the MS-DOS directory
       attribute byte.  If input came from standard input, this
       field is set to zero.

How to interpret the "external file attributes" depends on the OS that made the ZIP, which is controlled by this field:

   4.4.2 version made by (2 bytes)

        4.4.2.1 The upper byte indicates the compatibility of the file
        attribute information.  If the external file attributes 
        are compatible with MS-DOS and can be read by PKZIP for 
        DOS version 2.04g then this value will be zero.  If these 
        attributes are not compatible, then this value will 
        identify the host system on which the attributes are 
        compatible.  Software can use this information to determine
        the line record format for text files etc.  

        4.4.2.2 The current mappings are:

         0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems)
         1 - Amiga                     2 - OpenVMS
         3 - UNIX                      4 - VM/CMS
         5 - Atari ST                  6 - OS/2 H.P.F.S.
         7 - Macintosh                 8 - Z-System
         9 - CP/M                     10 - Windows NTFS
        11 - MVS (OS/390 - Z/OS)      12 - VSE
        13 - Acorn Risc               14 - VFAT
        15 - alternate MVS            16 - BeOS
        17 - Tandem                   18 - OS/400
        19 - OS X (Darwin)            20 thru 255 - unused

        4.4.2.3 The lower byte indicates the ZIP specification version 
        (the version of this document) supported by the software 
        used to encode the file.  The value/10 indicates the major 
        version number, and the value mod 10 is the minor version 
        number.  

Alright, so how do we identify directories?

We simply have to look at what 7-zip does:

// Just checks if a string ends in a "/".
bool HasTailSlash(const AString &name, UINT
  #if defined(_WIN32) && !defined(UNDER_CE)
    codePage
  #endif
  )
{
  if (name.IsEmpty())
    return false;
  char c;
    #if defined(_WIN32) && !defined(UNDER_CE)
    if (codePage != CP_UTF8)
      c = *CharPrevExA((WORD)codePage, name, name.Ptr(name.Len()), 0);
    else
    #endif
    {
      c = name.Back();
    }
  return (c == '/');
}

bool CItem::IsDir() const
{
  // FIXME: we can check InfoZip UTF-8 name at first.
  if (NItemName::HasTailSlash(Name, GetCodePage()))
    return true;

  Byte hostOS = GetHostOS();

  if (Size == 0 && PackSize == 0 && !Name.IsEmpty() && Name.Back() == '\\')
  {
    // do we need to use CharPrevExA?
    // .NET Framework 4.5 : System.IO.Compression::CreateFromDirectory() probably writes backslashes to headers?
    // so we support that case
    switch (hostOS)
    {
      case NHostOS::kFAT:
      case NHostOS::kNTFS:
      case NHostOS::kHPFS:
      case NHostOS::kVFAT:
        return true;
    }
  }

  if (!FromCentral)
    return false;

  UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF);

  switch (hostOS)
  {
    case NHostOS::kAMIGA:
      switch (highAttrib & NAmigaAttrib::kIFMT)
      {
        case NAmigaAttrib::kIFDIR: return true;
        case NAmigaAttrib::kIFREG: return false;
        default: return false; // change it throw kUnknownAttributes;
      }
    case NHostOS::kFAT:
    case NHostOS::kNTFS:
    case NHostOS::kHPFS:
    case NHostOS::kVFAT:
      return ((ExternalAttrib & FILE_ATTRIBUTE_DIRECTORY) != 0);
    case NHostOS::kAtari:
    case NHostOS::kMac:
    case NHostOS::kVMS:
    case NHostOS::kVM_CMS:
    case NHostOS::kAcorn:
    case NHostOS::kMVS:
      return false; // change it throw kUnknownAttributes;
    case NHostOS::kUnix:
      return MY_LIN_S_ISDIR(highAttrib);
    default:
      return false;
  }
}

So, breaking the algorithm down:

So that's a full overview of "what is a directory?" in the ZIP spec.

Arcitec commented 1 year ago

@michalc Thanks again for the answer and the link to your commit, it was super helpful! :)

If it's not currently possible to add empty directories, is there any chance of some API for adding empty directories?

On this - I think ending the name in a forward slash should create a directory: 66c500b

Now, with the backstory of the ZIP spec, it becomes possible to evaluate what that code is doing! :)

The only missing puzzle pieces to ensure that stream-zip is 100% valid are:

Arcitec commented 1 year ago

@michalc Okay, it was good to do this investigation! I see that stream-zip has a bug.

https://github.com/uktrade/stream-zip/blob/31245ce4f89aeede3e260db1430f4fb42911a51d/stream_zip.py#L157C1-L160C49

            return central_directory_header_struct.pack(
                45,           # Version made by
                3,            # System made by (UNIX)
                45,           # Version required

It's marking the directories ("ends with slash" as MS-DOS DIRECTORY which is a flag that only makes sense if the directory is marked as "Made on FAT/NTFS" as mentioned above) but it marks the archive itself as "Made on Unix".

So we'll need to fix this. Thankfully I doubt that many people ever used the "embed directory" feature of stream-zip. And extractors such as 7-Zip use other heuristics (namely, the filenames) to detect directories when the Extended Attributes are wrong. :)

Arcitec commented 1 year ago

Here are the correct attributes when creating "Made on UNIX" ZIP archives, which are necessary to match the UNIX heuristics 7-Zip uses when looking at Extended Attributes:

    UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF);

// ...

    case NHostOS::kUnix:  // kUnix     =  3, [The same value that stream-zip uses]
      return MY_LIN_S_ISDIR(highAttrib);
#define MY_LIN_S_IFMT  00170000

#define MY_LIN_S_IFDIR  0040000

#define MY_LIN_S_ISDIR(m)   (((m) & MY_LIN_S_IFMT) == MY_LIN_S_IFDIR)

I can't fully interpret this at the moment. The first part with UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF); means that it's shifting the ExternalAttrib value right by 16 bits. So already, that's something that I think stream-zip isn't doing properly. It looks like stream-zip instead writes the attribute (0x10 at the moment) to the LOW 16 bits of External Attributes. (Edit: As you pointed out, this bug also exists in Python's own zipfile module).

Next, after it has shifted the high bits into becoming low bits instead, it then needs to check the actual attributes.

To check whether something has the Unix "is a directory" attribute, it's first AND-ing the attributes with 00170000 and then checking if the remaining result is exactly 0040000. (Sidenote: I've never seen this way of writing the numbers before. I think it's just zero-padded base10 integers, but that would have to be verified to ensure that it's not Octal data!)

stream-zip needs to use that technique when marking things as directory. Not using DOS 0x10. :)

I can see that stream-zip currently does this, which means that it's storing UNIX PERMISSIONS in the top 16 bits of External Attributes. And then incorrectly placing 0x10 in the lower bits:

            external_attr = \
                (perms << 16) | \
                (0x10 if name_encoded[-1:] == b'/' else 0x0)  # MS-DOS directory

The solution would be something like this:

            external_attr = \
                (perms << 16) | \
                ((0040000 << 16) if name_encoded[-1:] == b'/' else 0x0)  # Unix directory

(I have not tested this! I simply assumed that the 00170000 masking is totally useless for our encoding purposes and is only used in 7-zip to filter out "permissions" etc from the attribute block, so the solution I proposed is most likely correct...)

I hope this helps. I would make a pull request if I fully understood the Unix External Attributes, but I don't. At least "all the info" to solve it exists in this ticket now. :)

Edit in case it got lost in all the noise: stream-zip also needs to throw a fatal error if anyone tries to create a directory (ends with slash) where input bytes length is > 0. Because that's an illegal ZIP.

Edit: The ultimate "success test" afterwards would be to try encoding a directory WITHOUT trailing slash, marked as "Made on UNIX", with the "is a directory" External Attribute, and then opening it in 7-Zip. If it shows as a directory in 7-Zip, then the metadata is correct.

michalc commented 1 year ago

So lots here - have to admit I haven't read it all quite yet. But: stream-zip is very similar to Python's zipfile module https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L546-L552

Arcitec commented 1 year ago

@michalc Yeah, there's lots of bugs in Python's zipfile module. This doesn't surprise me. :) It was old and crufty even when it was first invented like 20 years ago.

Arcitec commented 1 year ago

@michalc I have edited the bug summary/solution post, the most up to date info is all in this one:

https://github.com/uktrade/stream-zip/issues/55#issuecomment-1603859720

Edit: Added potential code solution.

Arcitec commented 1 year ago

I might be able to make a pull request for this if you want to wait. :) Gonna investigate the weird 00 padding in C++ (most likely meaningless visual padding for a regular int), and then do some tests of the proposed solution.

Edit: Nope, the leading 0 was octal, as was my other theory...! I'll be making a pull request and updating the tests today. The findings are below. It's really just a bitmask (IFMT) + a single bit flag (IFDIR).

#include <iostream>

#define MY_LIN_S_IFMT  00170000

#define MY_LIN_S_IFDIR  0040000

int main() {
    // Write C++ code here
    std::cout << "MY_LIN_S_IFMT: " << MY_LIN_S_IFMT << std::endl;
    std::cout << "MY_LIN_S_IFDIR: " << MY_LIN_S_IFDIR << std::endl;

    return 0;
}

Result:

MY_LIN_S_IFMT: 61440
MY_LIN_S_IFDIR: 16384

This number looked funny (power of 2) so I used a binary converter and confirmed that it's really just a power-of-2 bit flag:

// MY_LIN_S_IFMT  61440 (octal 00170000) as binary representation:
1111000000000000

// MY_LIN_S_IFDIR 16384 (octal 0040000) as binary representation:
0100000000000000

So 7-Zip just used a super weird way of writing what I'd personally write as 1 << 14 (16384) for more clarity.

And both are 16-bit values (2 bytes), which matches up with the fact that 7-Zip does UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF); to discard the lower two bytes and cleanly filter out the top two bytes.

I'm normally on Linux where my dev environment is, but I'll set up dev tools here on Windows now and send in a pull request today. :)

Edit: Pull request submitted.

michalc commented 1 year ago

I've opted to go for a documentation approach to this in https://github.com/uktrade/stream-zip/pull/60, documenting the directories must end in a forward slash, and should have the S_IFDIR mode. Potentially a code change could be made down the line, but I'm tempted to leave it for now since from what I can tell, this works in all clients, and is The Right Thing in terms of what ends up in the file.