Closed Arcitec closed 1 year ago
Oh, I just had a look inside a ZIP createad with stream-zip
and it doesn't create empty entries for the other directories in the hierarchy. Very interesting. The entries went straight to the deeply nested file:
somefile.dat
deeply/nested/file.dat
Whereas if I had created the same ZIP on Linux with regular tools, I would have had these entries in the ZIP:
somefile.dat
deeply/
deeply/nested/
deeply/nested/file.dat
But I suppose that the ZIP spec is flexible enough that it doesn't need the intermediary "this is a parent folder on the road to glory" stuff.
So if this is the case, I suppose that if I want to add empty folders, it's as easy as something like this:
if is_dir() and len(children) == 0
to find all empty folders that weren't added yet (due to not containing any files), and then manually insert those empty directories all at the end of the archive via simple foo/bar/thisemptyfolder/
entries at the end, with 0-byte b""
data.Would something like that solve all of these issues?
So if this is the case, I suppose that if I want to add empty folders, it's as easy as something like this:
Do a pass through all files only, and add them to the archive. Do a pass through all leafs (deepest path nodes) regardless of type, and then check if is_dir() and len(children) == 0 to find all empty folders that weren't added yet (due to not containing any files), and then manually insert those empty directories all at the end of the archive via simple foo/bar/thisemptyfolder/ entries at the end, with 0-byte b"" data.
Would something like that solve all of these issues?
Sounds reasonable to me. I guess you don't absolutely have to have two passes - you can in one pass output files, while maintaining a list of folders that don't have any children. Or just output all folders like it sounds like other tools do. But that's a detail.
I have to admit I haven't tested zipping folders much, especially empty folders - it was never really a priority. So I suspect would be good to check the output using a few different unzip-ers, just in case.
If it's not currently possible to add empty directories, is there any chance of some API for adding empty directories?
On this - I think ending the name in a forward slash should create a directory: https://github.com/uktrade/stream-zip/commit/66c500b71127121e0f795e44d28ce21e08b16361
Hi Michal! I've been digging deeply into this, so before answering, I'll just separately dump the information I've dug up here, for reference since it could be useful (you mentioned not looking much into directory storage in ZIPs)! :)
4.3.8 File data
Immediately following the local header for a file
SHOULD be placed the compressed or stored data for the file.
If the file is encrypted, the encryption header for the file
SHOULD be placed after the local header and before the file
data. The series of [local file header][encryption header]
[file data][data descriptor] repeats for each file in the
.ZIP archive.
Zero-byte files, directories, and other file types that
contain no content MUST NOT include file data.
4.4.17 file name: (Variable)
4.4.17.1 The name of the file, with optional relative path.
The path stored MUST NOT contain a drive or
device letter, or a leading slash. All slashes
MUST be forward slashes '/' as opposed to
backwards slashes '\' for compatibility with Amiga
and UNIX file systems etc. If input came from standard
input, there is no file name field.
4.4.3 version needed to extract (2 bytes)
4.4.3.1 The minimum supported ZIP specification version needed
to extract the file, mapped as above. This value is based on
the specific format features a ZIP program MUST support to
be able to extract the file. If multiple features are
applied to a file, the minimum version MUST be set to the
feature having the highest value. New features or feature
changes affecting the published format specification will be
implemented using higher version numbers than the last
published value to avoid conflict.
4.4.3.2 Current minimum feature versions are as defined below:
1.0 - Default value
1.1 - File is a volume label
2.0 - File is a folder (directory)
2.0 - File is compressed using Deflate compression
2.0 - File is encrypted using traditional PKWARE encryption
2.1 - File is compressed using Deflate64(tm)
2.5 - File is compressed using PKWARE DCL Implode
2.7 - File is a patch data set
4.5 - File uses ZIP64 format extensions
4.6 - File is compressed using BZIP2 compression*
5.0 - File is encrypted using DES
5.0 - File is encrypted using 3DES
5.0 - File is encrypted using original RC2 encryption
5.0 - File is encrypted using RC4 encryption
5.1 - File is encrypted using AES encryption
5.1 - File is encrypted using corrected RC2 encryption**
5.2 - File is encrypted using corrected RC2-64 encryption**
6.1 - File is encrypted using non-OAEP key wrapping***
6.2 - Central directory encryption
6.3 - File is compressed using LZMA
6.3 - File is compressed using PPMd+
6.3 - File is encrypted using Blowfish
6.3 - File is encrypted using Twofish
4.4.15 external file attributes: (4 bytes)
The mapping of the external attributes is
host-system dependent (see 'version made by'). For
MS-DOS, the low order byte is the MS-DOS directory
attribute byte. If input came from standard input, this
field is set to zero.
This document taught us a few things:
2.0
version identifier does NOT have to be used by the zipped data. In fact, when using Linux utilities to make ZIPs of folders, the "minimum version required" field for the local file headers of the directory entries are set to 0A (10) 00 (0)
as seen here (the "version required" is 2 bytes according to the spec), regardless of whether we interpret the two bytes as Little Endian or Big Endian, so that means the value is "10" (which is then cast into a string and split in the middle, so "10" decimal means version 1.0 of the ZIP format, which is before directory support was even added to the spec itself, hehe), and this means that the "minimum version required" attribute of the file is totally useless for the purpose of identifying folders and doesn't have any role in marking folders:0A (10) 00 (0)
version value for directories in their own ZIPs. 4.4.15 external file attributes: (4 bytes)
The mapping of the external attributes is
host-system dependent (see 'version made by'). For
MS-DOS, the low order byte is the MS-DOS directory
attribute byte. If input came from standard input, this
field is set to zero.
How to interpret the "external file attributes" depends on the OS that made the ZIP, which is controlled by this field:
4.4.2 version made by (2 bytes)
4.4.2.1 The upper byte indicates the compatibility of the file
attribute information. If the external file attributes
are compatible with MS-DOS and can be read by PKZIP for
DOS version 2.04g then this value will be zero. If these
attributes are not compatible, then this value will
identify the host system on which the attributes are
compatible. Software can use this information to determine
the line record format for text files etc.
4.4.2.2 The current mappings are:
0 - MS-DOS and OS/2 (FAT / VFAT / FAT32 file systems)
1 - Amiga 2 - OpenVMS
3 - UNIX 4 - VM/CMS
5 - Atari ST 6 - OS/2 H.P.F.S.
7 - Macintosh 8 - Z-System
9 - CP/M 10 - Windows NTFS
11 - MVS (OS/390 - Z/OS) 12 - VSE
13 - Acorn Risc 14 - VFAT
15 - alternate MVS 16 - BeOS
17 - Tandem 18 - OS/400
19 - OS X (Darwin) 20 thru 255 - unused
4.4.2.3 The lower byte indicates the ZIP specification version
(the version of this document) supported by the software
used to encode the file. The value/10 indicates the major
version number, and the value mod 10 is the minor version
number.
We simply have to look at what 7-zip does:
CItem::IsDir
HasTailSlash
// Just checks if a string ends in a "/".
bool HasTailSlash(const AString &name, UINT
#if defined(_WIN32) && !defined(UNDER_CE)
codePage
#endif
)
{
if (name.IsEmpty())
return false;
char c;
#if defined(_WIN32) && !defined(UNDER_CE)
if (codePage != CP_UTF8)
c = *CharPrevExA((WORD)codePage, name, name.Ptr(name.Len()), 0);
else
#endif
{
c = name.Back();
}
return (c == '/');
}
bool CItem::IsDir() const
{
// FIXME: we can check InfoZip UTF-8 name at first.
if (NItemName::HasTailSlash(Name, GetCodePage()))
return true;
Byte hostOS = GetHostOS();
if (Size == 0 && PackSize == 0 && !Name.IsEmpty() && Name.Back() == '\\')
{
// do we need to use CharPrevExA?
// .NET Framework 4.5 : System.IO.Compression::CreateFromDirectory() probably writes backslashes to headers?
// so we support that case
switch (hostOS)
{
case NHostOS::kFAT:
case NHostOS::kNTFS:
case NHostOS::kHPFS:
case NHostOS::kVFAT:
return true;
}
}
if (!FromCentral)
return false;
UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF);
switch (hostOS)
{
case NHostOS::kAMIGA:
switch (highAttrib & NAmigaAttrib::kIFMT)
{
case NAmigaAttrib::kIFDIR: return true;
case NAmigaAttrib::kIFREG: return false;
default: return false; // change it throw kUnknownAttributes;
}
case NHostOS::kFAT:
case NHostOS::kNTFS:
case NHostOS::kHPFS:
case NHostOS::kVFAT:
return ((ExternalAttrib & FILE_ATTRIBUTE_DIRECTORY) != 0);
case NHostOS::kAtari:
case NHostOS::kMac:
case NHostOS::kVMS:
case NHostOS::kVM_CMS:
case NHostOS::kAcorn:
case NHostOS::kMVS:
return false; // change it throw kUnknownAttributes;
case NHostOS::kUnix:
return MY_LIN_S_ISDIR(highAttrib);
default:
return false;
}
}
So, breaking the algorithm down:
If the name ends in a /
, it's immediately treated as a directory.Is 0 bytes, ends in a backslash \, and the ZIP was made by a Windows host OS
: That's a directory. This is just a "special case" meant for old versions of Windows and PowerShell which used to create illegal ZIPs that used backslashes as directory separators, even though that's forbidden in the ZIP spec. Modern versions of Windows (starting from like 2 years ago) and its tools finally properly use forward slashes instead. "Thanks", Microsoft for making things harder to parse, as usual./
and (illegal) \
./some/thing
even though thing
IS a directory.Look at the External Attributes field of the entry, check which Host OS made the ZIP, and check if those OS-native file attributes contain what that exact OS uses to signal "this is a directory"
: If so, this is a directory too.So that's a full overview of "what is a directory?" in the ZIP spec.
@michalc Thanks again for the answer and the link to your commit, it was super helpful! :)
If it's not currently possible to add empty directories, is there any chance of some API for adding empty directories?
On this - I think ending the name in a forward slash should create a directory: 66c500b
Now, with the backstory of the ZIP spec, it becomes possible to evaluate what that code is doing! :)
/
, stream-zip
marks the "External Attribute" with the bit for 0x10
.#define FILE_ATTRIBUTE_DIRECTORY 0x0010
.stream-zip
marks it as a MS-DOS Directory.The only missing puzzle pieces to ensure that stream-zip
is 100% valid are:
stream-zip
must mark the "Host OS that created this ZIP" as MS-DOS, otherwise the "External Attribute" for the file breaks/becomes meaningless. I haven't verified if it does this. It needs to do this regardless of what Host OS stream-zip
is running on. The list of correct operating systems is listed in the 4.4.2 version made by (2 bytes)
table that I copied from the ZIP spec in the previous message.MS-DOS 0x10 == directory
attribute are seen in 7-zips implementation (quoted in previous message), and those are in turn defined in CPP/7zip/Archive/Zip/ZipHeader.h:
NHostOS::kFAT = 0,
NHostOS::kNTFS = 11, // filesystem used by Windows NT
NHostOS::kHPFS = 6, // filesystem used by OS/2 (and NT 3.x)
NHostOS::kVFAT = 14, // filesystem used by Windows 95, NT
stream-zip
must mark the "version made by" as one of those values to be legal when using 0x10
to signify directories. I would say that "NTFS" is the most appropriate of those alternatives. I'll try to find out if stream-zip
already does this.kTOPS20 = 10, // pkzip 2.50 NTFS
. So that's some kind of legacy NTFS. To use modern NTFS as the format, the "version made by" value must be 11
.stream-zip
should also always throw an error if input data length is > 0
if something is a directory (ends in a trailing slash). I don't think it does that sanity check at the moment, and would therefore create illegal directories. The spec says "Zero-byte files, directories, and other file types that contain no content MUST NOT include file data."@michalc Okay, it was good to do this investigation! I see that stream-zip
has a bug.
return central_directory_header_struct.pack(
45, # Version made by
3, # System made by (UNIX)
45, # Version required
It's marking the directories ("ends with slash" as MS-DOS DIRECTORY which is a flag that only makes sense if the directory is marked as "Made on FAT/NTFS" as mentioned above) but it marks the archive itself as "Made on Unix".
So we'll need to fix this. Thankfully I doubt that many people ever used the "embed directory" feature of stream-zip
. And extractors such as 7-Zip use other heuristics (namely, the filenames) to detect directories when the Extended Attributes are wrong. :)
Here are the correct attributes when creating "Made on UNIX" ZIP archives, which are necessary to match the UNIX heuristics 7-Zip uses when looking at Extended Attributes:
UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF);
// ...
case NHostOS::kUnix: // kUnix = 3, [The same value that stream-zip uses]
return MY_LIN_S_ISDIR(highAttrib);
#define MY_LIN_S_IFMT 00170000
#define MY_LIN_S_IFDIR 0040000
#define MY_LIN_S_ISDIR(m) (((m) & MY_LIN_S_IFMT) == MY_LIN_S_IFDIR)
I can't fully interpret this at the moment. The first part with UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF);
means that it's shifting the ExternalAttrib value right by 16 bits. So already, that's something that I think stream-zip
isn't doing properly. It looks like stream-zip
instead writes the attribute (0x10
at the moment) to the LOW 16 bits of External Attributes. (Edit: As you pointed out, this bug also exists in Python's own zipfile module).
Next, after it has shifted the high bits into becoming low bits instead, it then needs to check the actual attributes.
To check whether something has the Unix "is a directory" attribute, it's first AND-ing the attributes with 00170000
and then checking if the remaining result is exactly 0040000
. (Sidenote: I've never seen this way of writing the numbers before. I think it's just zero-padded base10 integers, but that would have to be verified to ensure that it's not Octal data!)
stream-zip
needs to use that technique when marking things as directory. Not using DOS 0x10
. :)
I can see that stream-zip
currently does this, which means that it's storing UNIX PERMISSIONS in the top 16 bits of External Attributes. And then incorrectly placing 0x10 in the lower bits:
external_attr = \
(perms << 16) | \
(0x10 if name_encoded[-1:] == b'/' else 0x0) # MS-DOS directory
The solution would be something like this:
external_attr = \
(perms << 16) | \
((0040000 << 16) if name_encoded[-1:] == b'/' else 0x0) # Unix directory
(I have not tested this! I simply assumed that the 00170000
masking is totally useless for our encoding purposes and is only used in 7-zip to filter out "permissions" etc from the attribute block, so the solution I proposed is most likely correct...)
I hope this helps. I would make a pull request if I fully understood the Unix External Attributes, but I don't. At least "all the info" to solve it exists in this ticket now. :)
Edit in case it got lost in all the noise: stream-zip
also needs to throw a fatal error if anyone tries to create a directory (ends with slash) where input bytes length is > 0. Because that's an illegal ZIP.
Edit: The ultimate "success test" afterwards would be to try encoding a directory WITHOUT trailing slash, marked as "Made on UNIX", with the "is a directory" External Attribute, and then opening it in 7-Zip. If it shows as a directory in 7-Zip, then the metadata is correct.
So lots here - have to admit I haven't read it all quite yet. But: stream-zip is very similar to Python's zipfile module https://github.com/python/cpython/blob/3.11/Lib/zipfile.py#L546-L552
@michalc Yeah, there's lots of bugs in Python's zipfile module. This doesn't surprise me. :) It was old and crufty even when it was first invented like 20 years ago.
@michalc I have edited the bug summary/solution post, the most up to date info is all in this one:
https://github.com/uktrade/stream-zip/issues/55#issuecomment-1603859720
Edit: Added potential code solution.
I might be able to make a pull request for this if you want to wait. :) Gonna investigate the weird 00
padding in C++ (most likely meaningless visual padding for a regular int), and then do some tests of the proposed solution.
Edit: Nope, the leading 0 was octal, as was my other theory...! I'll be making a pull request and updating the tests today. The findings are below. It's really just a bitmask (IFMT) + a single bit flag (IFDIR).
#include <iostream>
#define MY_LIN_S_IFMT 00170000
#define MY_LIN_S_IFDIR 0040000
int main() {
// Write C++ code here
std::cout << "MY_LIN_S_IFMT: " << MY_LIN_S_IFMT << std::endl;
std::cout << "MY_LIN_S_IFDIR: " << MY_LIN_S_IFDIR << std::endl;
return 0;
}
Result:
MY_LIN_S_IFMT: 61440
MY_LIN_S_IFDIR: 16384
This number looked funny (power of 2) so I used a binary converter and confirmed that it's really just a power-of-2 bit flag:
// MY_LIN_S_IFMT 61440 (octal 00170000) as binary representation:
1111000000000000
// MY_LIN_S_IFDIR 16384 (octal 0040000) as binary representation:
0100000000000000
So 7-Zip just used a super weird way of writing what I'd personally write as 1 << 14
(16384) for more clarity.
And both are 16-bit values (2 bytes), which matches up with the fact that 7-Zip does UInt16 highAttrib = (UInt16)((ExternalAttrib >> 16 ) & 0xFFFF);
to discard the lower two bytes and cleanly filter out the top two bytes.
I'm normally on Linux where my dev environment is, but I'll set up dev tools here on Windows now and send in a pull request today. :)
Edit: Pull request submitted.
I've opted to go for a documentation approach to this in https://github.com/uktrade/stream-zip/pull/60, documenting the directories must end in a forward slash, and should have the S_IFDIR
mode. Potentially a code change could be made down the line, but I'm tempted to leave it for now since from what I can tell, this works in all clients, and is The Right Thing in terms of what ends up in the file.
Hey, small question.
From what I've observed from the ZIP spec, it seems like the "directory hierarchy" of ZIP files are just an illusion, and that it's really just POSIX-style names (forward slash separator), where each file has its own full path embedded straight into the file "name" data (no "actual nested folders"), such as
name = foo/bar/baz/readme.txt
. I've also noticed that empty directories are stored as "files" which are named things likefoo/bar/baz/
with a trailing slash at the end, and with a DATA LENGTH of ZERO. The trailing slash seems to be the only signifier that a "file" is actually a directory?For instance, when I created a ZIP with the standard utilities on Linux, I observed that it has an entry for the "parent folder" as a standalone ZIP entry, before it has an entry for the actual file inside that folder.
The hierarchy in this example ZIP is:
A folder\with backslashes
: This is the actual folder name, it's a POSIX name so backslash is legal. And it's stored in the ZIP asA folder\with backslashes/
A folder\with backslashes/my file\withbackslash.txt
: This is the filename within the folder. It's just a file with a backslash in the name.Here's my test ZIP:
backslashes.zip
So with that backstory about how directories seem to work in ZIPs, I'm very curious about the correct ways to handle directories in
stream-zip
:name = foo/bar/baz/readme.txt
without having to manually create the parent path components (foo/
,foo/bar/
andfoo/bar/baz/
)?foo/a/1.txt
, thenbar/b/1.txt
, thenfoo/a/2.txt
? Meaning is it safe to add directories in a jumbled order to the ZIP file contents? I assume thatstream-zip
remembers which directories it has already written to the ZIP and doesn't create those paths again.the/directory/
with a trailing slash and providingb""
(empty bytes) as the data?a/b/
was already auto-generated and added earlier, and I then manually try to writea/b/
, I guess I'd end up with a duplicate directory entry)?