API limits the maximum archive entry size on 32bit systems

r0ptr commented 3 years ago

The current API, namely the functions: ar_entry_get_size and ar_entry_uncompress use size_t in their signature, which limits the maximum size of archive entries on 32bit systems. Would you consider changing the API to use a fixed-width 64bit unsigned integer type, such that both 32bit and 64bit systems would be able to handle large archive entries?

r0ptr commented 3 years ago

Oh shoot, after looking into it a bit more, it seems not even the LZMA SDK supports archive entries which size exceed 4294967295 bytes, when compiled for 32bits (at least not the C version of the 7z SDK)

selmf commented 3 years ago

The 7z SDK decompresses all data into memory before returning it. This will of course fail on systems that can't address that much memory.

selmf commented 3 years ago

The problem with size_t and huge file entries is that size_t is the maximum chunk of memory your system can address. This means even if you'd change the uncompress function to always use a 64 bit buffer this would fail on a 32 bit system for the simple reason that it is too huge to address. There is a simple way around this, though - use a smaller buffer and call uncompress repeatedly to decompress the data in chunks that you write to disk.

This still leaves the problem with the entry size. Size_t in this context is mainly used because it indicates a size. If the data type prevents us from getting the true size this is a bug and it should be fixed. The issue I see with this is that changing the return type will require digging deeper into the respective archive implementations and we need to take care to only change size_t to 64 bit in the correct contexts.

I also need to consider how to handle the (minor) API breakage this might incur for 32 bit users.

r0ptr commented 3 years ago

The 7z SDK has this piece of code in the SzArEx_Extract function:

  if (*tempBuf == NULL || *blockIndex != folderIndex)
  {
    UInt64 unpackSizeSpec = SzAr_GetFolderUnpackSize(&p->db, folderIndex);
    /*
    UInt64 unpackSizeSpec =
        p->UnpackPositions[p->FolderToFile[(size_t)folderIndex + 1]] -
        p->UnpackPositions[p->FolderToFile[folderIndex]];
    */
    size_t unpackSize = (size_t)unpackSizeSpec;

    if (unpackSize != unpackSizeSpec)
      return SZ_ERROR_MEM;

This is clearly designed to fail on 32bit systems, I'm guessing they wanted to fail early here because the implementation also has problems deeper down...

Regarding the problem you mention that addressing such a large piece of memory is impossible, that is true, but the way I work around that is by using memory mapping, where I map the largest possible free chunk of memory, unpack that much data into the chunk, unmap that chunk, map the next chunk, unpack, and so on.

selmf commented 3 years ago

Don't bother too much with the 7z SDK. This memory limitation is the main reason I have marked 7z support as experimental. The underlying decompression code should be able to handle large files just fine, but the C code for archive insists on decompressing huge blocks into a memory cache instead of returning the files as they are decompressed. Fixing this would need a partial rewrite of the SDK.

r0ptr commented 3 years ago

I see, does it mean that extraction of large files (>4GB) with 7z will never be supported on 32bit systems?

selmf commented 3 years ago

No, it just means that there is much more work involved to make this work. SzArEx_Extract and probably other parts of 7zDec.c need to be rewritten to support decompressing directly to a buffer and keeping the decompression state and dictionary instead of decompressing everything into a private buffer and allowing access to that.

The question is – is it worth it? It actually might be a better idea to rebuild the parsing code from scratch in C99 and only use the low-level parts of the 7z SDK. That way we have better control over what is happening and it is easier to implement full support for 7z archives and not only the limited the SDK provides.

selmf commented 2 years ago

Hey, just a quick heads up. I recently rechecked this issue to see if I could include a fix into the upcoming unarr release, but the problem goes deeper than just 7z SDK and the API signature. The bad pattern of using size_t for filesizes is present in a lot of the internal code and structures. Working on this without having proper unit tests to catch regressions is asking for trouble.

I will check if I can improve the situation in the next development cycle. By then I should have a proper test system set up.

selmf / unarr

API limits the maximum archive entry size on 32bit systems #15