icsharpcode / SharpZipLib

#ziplib is a Zip, GZip, Tar and BZip2 library written entirely in C# for the .NET platform.
http://icsharpcode.github.io/SharpZipLib/
MIT License
3.72k stars 976 forks source link

Getting error "Index was outside the bounds of the array" when writing uncompressed data #413

Closed daviddassau closed 4 years ago

daviddassau commented 4 years ago

Steps to reproduce

  1. Have a .txt.bz2 file that has more header columns than actual values Example: | first_name | last_name | middle_name | | John | Jones | |

Expected behavior

When running BZip2.Decompress(fileToDecompressAsStream, decompressedStream, true);, it should obviously be writing the data from the .bz2 file to the .txt file.

Actual behavior

I'm getting an error stating Index was outside the bounds of the array

I know that this issue is more of a "bad data" problem, rather than a problem with SharpZipLib. However, I was hoping you could help with finding a solution. Ideally, I would like to decompress the .bz2 file, and either remove the extra header column or give all the rows a NULL value. But I can't find a way to do this. Any help you could provide would be very much appreciated!

Version of SharpZipLib

1.2.0

Obtained from (only keep the relevant lines)

piksel commented 4 years ago

Yeah, the library does not care about the contents, so I am not sure what is going on. Are you processing the lines somehow? Are you splitting the rows on "|" and then assigning the values to the corresponding header keys or something like that? It does seem outside the scope of the library, but if you could provide the code you are using I can take I look.

daviddassau commented 4 years ago

@piksel thank you so much for replying! I would be more than happy to supply you with some of my code. Hopefully it will give you a better idea of what the issue may be. For reference, I always get the error on the first line of the try block: BZip2.Decompress(fileToDecompressAsStream, decompressedStream, true);

private static void DecompressBZ2File()
{
    string bz2FilePath = $"C:\\temp\\PandoraData\\pandoraData.txt.bz2";
    string txtFilePath = @"C:\temp\PandoraData\pandoraData.txt";

    FileInfo zipFileName = new FileInfo(bz2FilePath);

    using (FileStream fileToDecompressAsStream = zipFileName.OpenRead())
    {
        using (FileStream decompressedStream = File.Create(txtFilePath))
        {
            try
            {
                BZip2.Decompress(fileToDecompressAsStream, decompressedStream, true);
                Console.WriteLine("Successfully decompressed BZ2 file!");
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
            }
        }
    }
}
piksel commented 4 years ago

Okay, this should have nothing to do with the contents (header vs value count). Rather, it would seem like the bz2 format in the file is either incorrect, or incorrectly read by the library. Could you provide the full stacktrace of the error? It's in ex.StackTrace if you're not debugging through visual studio.

daviddassau commented 4 years ago

@piksel Sure thing! Here's what came out of the catch block when I set the Console.WriteLine(ex.StackTrace);

Index was outside the bounds of the array.
   at ICSharpCode.SharpZipLib.BZip2.BZip2InputStream.RecvDecodingTables() in C:\projects\sharpziplib\src\ICSharpCode.SharpZipLib\BZip2\BZip2InputStream.cs:line 466
   at ICSharpCode.SharpZipLib.BZip2.BZip2InputStream.GetAndMoveToFrontDecode() in C:\projects\sharpziplib\src\ICSharpCode.SharpZipLib\BZip2\BZip2InputStream.cs:line 579
   at ICSharpCode.SharpZipLib.BZip2.BZip2InputStream.InitBlock() in C:\projects\sharpziplib\src\ICSharpCode.SharpZipLib\BZip2\BZip2InputStream.cs:line 379
   at ICSharpCode.SharpZipLib.BZip2.BZip2InputStream..ctor(Stream stream) in C:\projects\sharpziplib\src\ICSharpCode.SharpZipLib\BZip2\BZip2InputStream.cs:line 112
   at ICSharpCode.SharpZipLib.BZip2.BZip2.Decompress(Stream inStream, Stream outStream, Boolean isStreamOwner) in C:\projects\sharpziplib\src\ICSharpCode.SharpZipLib\BZip2\BZip2.cs:line 27
   at StreamingUsageConsole.Services.Pandora.GetObjectTest.DecompressBZ2File(String pathAndFileName, String jsonFile) in C:\NaxosRepos\utility-data-streamingdatadownload\StreamingUsageConsole\Services\Pandora\GetObjectTest.cs:line 128
piksel commented 4 years ago

That's odd. That row just initializes an array. I have no idea how that could be throwing an IndexOutOfRangeException. Are you using the nuget package? What is your environment?

daviddassau commented 4 years ago

Yes, I am most definitely using the nuget package. Here are my using statements that are currently being utilized:

using System;
using System.IO;
using System.Threading.Tasks;
using Amazon;
using Amazon.S3;
using Amazon.S3.Model;
using ICSharpCode.SharpZipLib.BZip2;

Regarding my environment, I'm using Visual Studio 2019, coding with .Net Framework 4.7.2. I'm definitely willing to share more of my code, as well as the downloaded .bz2 file, if it would potentially be helpful to you. I am 100% grateful for all the help you've given me thus far, though!

piksel commented 4 years ago

I really have to go to bed, but if you could provide the bzip2 file I would probably have enough to reproduce. I'll take a look at it as soon as I can.

daviddassau commented 4 years ago

@piksel Ok thank you so much for your help! Let me know if you have any issues downloading/viewing this file. I had to zip it up, in order to upload to Github. naxos_US_2019-07-07.txt.zip

Numpsy commented 4 years ago

fwiw, I gave this a quick go with the latest source and the above file and didn't see any exception, but the extracted file was tiny (only included the headers I think). (7-Zip pulled a lot more out of it).

Could this be related to multi-stream BZip2 files (and the lack of support for them)?

daviddassau commented 4 years ago

@Numpsy I'm running into the same issue as well, in regards to it successfully extracting, but only containing the headers.

piksel commented 4 years ago

That definitely sounds like the issue. It would also make sense that a file like this would use multistreams. I found an interesting wrapper in a blog post: https://chaosinmotion.blog/2011/07/29/and-another-curiosity-multi-stream-bzip2-files/ The same approach should be possible to do with BZip2InputStream. Adding support for it in the library shouldn't be too hard either, but that's not a short term solution.

Numpsy commented 4 years ago

The (somewhat old) issue #162 references that same blog post.

piksel commented 4 years ago

Yeah, it was literally the first google result :D

piksel commented 4 years ago

That totally works!

SharpZipLib Issue 413

Decompressing WITHOUT MultiStreams:
  Successfully decompressed BZ2 file.
  Output file size: 206 byte(s) (1 line(s))
  Decompression time: 0.041s

Decompressing WITH MultiStreams:
  Successfully decompressed BZ2 file.
  Output file size: 172220713 byte(s) (969524 line(s))
  Decompression time: 16.279s

Source: https://gist.github.com/piksel/7ade2571713b992e4c532a93385067f8

I am currently working on a PR to fix this inside Bzip2InputStream instead.

daviddassau commented 4 years ago

Sorry I'm a little late getting back to the conversation. Thank you both so much for taking a look into this issue for me! I was, however, able to find a workaround that utilizes 7-Zip. All I had to do was reference the 7zip .exe from my project, call the Command Prompt, and pass in a single argument with the BZ2 file and where exactly it should be decompressed to. And it worked! However, I am looking forward to seeing the solution you come up with @piksel . Thank you once again!