icsharpcode / SharpZipLib

#ziplib is a Zip, GZip, Tar and BZip2 library written entirely in C# for the .NET platform.
http://icsharpcode.github.io/SharpZipLib/
MIT License
3.71k stars 978 forks source link

Different binary produced when producing ZIP to Azure Blob Stream #744

Closed carlreid closed 2 years ago

carlreid commented 2 years ago

We found that our ZIP files were not being opened correctly on Macs, producing an error with ditto: ditto: Couldn't read pkzip signature.

After further investigation, it seems that ZIP files produced to a local file (which can be opened on a Mac) are different to that produced to a stream that is persisting to Azure Blob Store. However, I also tested simply opening a file stream, and using the blob stream to persist it, which resulted in the same binary. I think this indicates that something with the zipping process is causing the problem.

From a binary comparison, it seems the differences are in the start/end of the files, though I am note sure which parts are significant, or show what could be going wrong. On left is file persisted via Blob stream. Right is locally to disk. Start:
image
End:
image

You should be able to reproduce the issue with the following code. You can uncomment the various sections if you wish to write a local zip, or use a local file to upload to a blob store. Otherwise the uncommented code should produce a ZIP that would be invalid when opening on a Mac.

using System;
using System.Net.Http;
using System.Threading.Tasks;
using Azure.Storage.Blobs;
using Azure.Storage.Blobs.Specialized;
using ICSharpCode.SharpZipLib.Zip;

internal class CreateZipFile
{
    public static void Main(string[] args)
    {
        MainAsync(args).GetAwaiter().GetResult();
    }

    private static async Task MainAsync(string[] args)
    {
        var httpClient = new HttpClient();

        var _blobServiceClient = new BlobServiceClient("DefaultEndpointsProtocol=https;AccountName=<replace_with_account_name>;AccountKey=<replace_with_a_key>;EndpointSuffix=core.windows.net");
        var _containerClient = _blobServiceClient.GetBlobContainerClient("artifacts");
        var blobName = System.IO.Path.ChangeExtension(Guid.NewGuid().ToString(), "zip");
        var blockClient = _containerClient.GetBlockBlobClient(blobName);
        var artifactUploadStream = blockClient.OpenWrite(overwrite: true);

        try
        {
            string[] filenames = new[] {
                "https://peach.blender.org/wp-content/uploads/bbb-splash.png"
            };

            // Writing an already produced ZIP to blob store results in matching binary.

            //var localFileMs = new MemoryStream(await System.IO.File.ReadAllBytesAsync("C:\\DevTools\\Debug\\o-local.zip"));
            //localFileMs.Position = 0;
            //await localFileMs.CopyToAsync(artifactUploadStream);
            //await artifactUploadStream.FlushAsync();
            //await artifactUploadStream.DisposeAsync();
            //await localFileMs.DisposeAsync();
            //return;

            // Local file writing produces expected ZIP
            //using var zipStream = new ZipOutputStream(System.IO.File.Create("C:\\DevTools\\Debug\\o-local.zip"))
            using var zipStream = new ZipOutputStream(artifactUploadStream)
            {
                IsStreamOwner = true
            };

            var downloadStream = await httpClient.GetStreamAsync("https://peach.blender.org/wp-content/uploads/bbb-splash.png");

            var zipFileEntry = new ZipEntry("big-buck-bunny.png")
            {
                CompressionMethod = CompressionMethod.Deflated,
                DateTime = new DateTime(2021, 7, 13, 4, 28, 44),
            };

            zipStream.PutNextEntry(zipFileEntry);
            await downloadStream.CopyToAsync(zipStream);
            zipStream.CloseEntry();
            zipStream.Finish();
            zipStream.Close();
        }
        catch (Exception ex)
        {
            Console.WriteLine("Exception during processing {0}", ex);
        }
    }
}

Not sure where else I can investigate further, or if this is more an issue with the Blob Block Stream, or even a mix of both?

Edit: Here are the files to compare: remote.zip local.zip

piksel commented 2 years ago

Could you either upload the files to http://archivediag.azurewebsites.net/ or attach them to your issue?

carlreid commented 2 years ago

Sure! Not sure why I didn't attach them in the first place. The website you link just returned: Status: could not retrieve status... (The specified blob does not exist.)

Anyway, I attached them to the initial message πŸ™‚

piksel commented 2 years ago

Yeah, that's just because it hasn't began processing them yet. I really ought to fix that message...

piksel commented 2 years ago

Yeah, okay, the remote one is made using Descriptors, which is when the file sizes and CRC are append to the end of the file. This is because the BlobStream is non-seekable, so you cannot go back to the start of the file and put the correct values. There is no such constraint on a file, where you are free to seek backwards and forwards as much as you want.

These are the reports btw:

I have never used ditto, but perhaps it doesn't support descriptor-type zip files?

carlreid commented 2 years ago

Hmm alright, that would put it in line with an old topic I found on ditto not handling some things with the CRC.

It's not that we're using ditto directly, it's the built in Archive Utility on Mac makes use of it behind the scenes.

I had read on another issue related to adding a file to an already existing zip on a Blob, where you mentioned it's non-seekable and thought it may have somehow been related. From what you said, it makes complete sense.

Bit of a side question, but is that archivediag something I could generate? Seems very useful for debugging 😁

Regarding this issue, I guess you can close it, and we'll have to investigate our own way around the issue. Unless you have some further thoughts?

piksel commented 2 years ago

ArchiveDiag can be found in the tools/archivediag branch.

I will do some more investigation to see if it's possible to create .zip files that ditto (and Archive Utility) can read. I found the topic you linked as well, but no actual mention that Descriptors are not supported, so there might be something else that it's more sensitive to.

carlreid commented 2 years ago

Awesome, thanks for pointing me to the branch.

Alright, I'll let you do some further digging, really appreciate the effort!

carlreid commented 2 years ago

@piksel I guess you didn't manage to progress on a way to make an archive that ditto could handle?

Just did a bit more reading, and it seems at least that a PagedBlob would allow for seeking. Though it seems to come with a lot of other stuff like needing to be a multiple of 512 bytes for size, and seems I need to provide a total size up front, even though I do not know it.

I was thinking it may be possible to just set some max size (say 10GB), write as usual to the stream, tracking the actual amount of bytes written. Then resize it down after, which will require some padding to make it up to the nearest 512 bytes. Though will this still make it an invalid ZIP file? πŸ€”

I hit an issue trying to write though, something about a header range being invalid:

ErrorCode: InvalidHeaderValue

Additional Information:
HeaderName: Range
HeaderValue: bytes=0-1528176

Content:
<?xml version="1.0" encoding="utf-8"?><Error><Code>InvalidHeaderValue</Code><Message>The value for one of the HTTP headers is not in the correct format.
RequestId:d0e7ed93-b01e-0052-1712-563093000000
Time:2022-04-22T06:30:41.2944096Z</Message><HeaderName>Range</HeaderName><HeaderValue>bytes=0-1528176</HeaderValue></Error>
piksel commented 2 years ago

Perhaps it has something to do with the 512 byte limit? 1528176 is not evenly divisible by 512. You would have to get a stack trace to find where 1528176 comes from I guess. The Header looks to be per spec afaics: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range

piksel commented 2 years ago

I did some more investigation on zip files using descriptor on macOS. I produced a zip file using InfoZip:

dd if=/dev/random bs=1M count=100 | zip | dd of=rand3.zip

I confirmed that it was actually using descriptors, and also noted that TestArchive considered the archive invalid: https://pub.p1k.se/sharpziplib/archivediag/rand3.zip.html

But that one could be read by ditto. So there is clearly some discrepancy between the files that SharpZipLib produces, and the ones that InfoZip produces...

carlreid commented 2 years ago

But that one could be read by ditto. So there is clearly some discrepancy between the files that SharpZipLib produces, and the ones that InfoZip produces...

Oh you beauty! Thanks for taking the time to investigate further, and thanks for the fix!