squid-box / SevenZipSharp

Fork of SevenZipSharp on CodePlex
GNU Lesser General Public License v3.0
265 stars 99 forks source link

unimplemented and slow extraction of solid archives containing many entries #21

Open fartwhif opened 6 years ago

fartwhif commented 6 years ago
public void ExtractFiles(ExtractFileCallback extractFileCallback)
{
DisposedCheck();
InitArchiveFileData(false);
if (IsSolid)
{
    // solid strategy
}
else
{

I commented the first branch, so that it always uses the implementation for non solid, but:

I've been trying to extract all files from some solid 7z files, it's fast at first but the further it gets the slower it gets, slows to almost stopped, takes 6 hours instead of 1 minute or so because of this.

It's doing a triangle.

ExtractFile(Index, Stream);

exhibits the same slowdown

fartwhif commented 6 years ago

I'm looking at it now, it seems that it has the library (because maybe it has to?) extract a range in order to get to the right one, and that perhaps the overall strategy for it needs to be reworked to fix the slowdown, maybe catering to the library more and sacrificing the ability to skip certain files? I'll tinker more and post again.

fartwhif commented 6 years ago

I managed to hack the code to make this work.

before: 06:02:11.8560134 after: 00:00:12.6008404 that's about 1810 times the speed

I may make a PR for this, it's rather hacky but it's worth a shot.

fartwhif commented 6 years ago

here it is if you want to see: https://github.com/fartwhif/SevenZipSharp/tree/speed

I'm going to test further before making a PR

squid-box commented 6 years ago

I'll have a closer look a little later, but spontaneously I'm hesitant to change the public method signatures (but I haven't checked enough to see if it's possible to avoid).

I do like the speed increase, and will probably test it myself, do you have an example archive to test against? Or at least specs on the one you used (number of files and size)?

fartwhif commented 6 years ago

the test was done with three threads each doing a different 7z file. (not volume archives) each 7z contains thousands of files and all three added together is 770mb and 6984 files. One thing is for sure the massive memory usage occurring may need to be looked at (it shoots up to 2GB for a moment, not sure if that's my test apparatus' fault or not). The GC seems to do its job well with default settings, and it should be up to the caller, not SevenZipSharp, to detach memory streams so they can be cleaned up - perhaps that's all it is, the momentary lifetime of the MemoryStreams within my test apparatus causing the memory usage to skyrocket because things are moving so fast, plus there's a debugger attached so GC will be extra lax, all of this high memory usage flashes by, if a fake stream is used it would not happen i don't think. I highly doubt tinkering with the GC within SevenZipSharp is a thing that is able to change this aspect of the test.

the hockeystick / triangle pattern

the problem fixed here would be exacerbated if you threw together a single 7z solid of 770mb with 6984 files. the test i did i think reduced the magnitude of the improvement measured. a better test also is in order. another abstraction could be used to hide those stream classes. That's fine just add another overload please. I wouldn't expect a full rewrite so fast! an abstraction could be used to have the same signature as this: public void ExtractFiles(ExtractFileCallback extractFileCallback) as i understand it if the underlying stream is null then 7z skips the extraction

fartwhif commented 6 years ago

Before:

the hockeystick / triangle pattern

outer loop for (N = 0; N < indexes.length; N++){
    inner loop from within 7z for (G = 0; G < N; G++){
        if (G == N){
            assign my stream
        }
        else { assign fake stream, so i could be mistaken but
         i think 7z is actually extracting uninteresting entries 
        because it's not a null and is a "null stream".  
        i could very well be wrong about this }
    }
}
to obtain every entry in a solid by using an outer loop to infer the callback index
, while it's extracting uninteresting entries, you can imagine why it's 1810x slower 
than the direct method. and moderately slower for single entry extractions

After, with the hack it's doing something like this Func<OutStreamWrapper> getStream has to be added to many signatures to push 7z callback all the way to the caller so that the outer loop can be eliminated.

the direct pattern

tell 7z to extract 0 through indexes.length (all entries)

GetArchiveExtractCallback works for this too {
    callback from 7z
    caller assigns stream

    callback from 7z
    tell caller entry is extracted to stream and entry details
}

this is the caller i'm using. It has to check to see if the memorystream is null, it's important to know that (maybe it's just for folders, not sure, but perhaps not). Perhaps the previous signature could be implemented around this.

OutStreamWrapper osw = null;
MemoryStream ms = null;
Func<OutStreamWrapper> getStream = () =>
{
    ms = new MemoryStream();
    osw = new OutStreamWrapper(ms, false);
    osw.BytesWritten += (source, e) =>
    {
        //bw.Write(e.Value);
    };
    return osw;
};
Action<FileInfoEventArgs> fileExtractStart = new Action<FileInfoEventArgs>((args) =>
{

});
Action<FileInfoEventArgs> fileExtractComplete = new Action<FileInfoEventArgs>((args) =>
{
    if (ms != null && !args.FileInfo.IsDirectory && args.FileInfo.FileName != "[no name]")
    {
        Result res = new Result()
        {
            ArchiveIndex = args.FileInfo.Index,
            FileStream = ms,
            Path = args.FileInfo.FileName,
            Created = args.FileInfo.CreationTime,
            Modified = args.FileInfo.LastWriteTime,
            Length = (long)args.FileInfo.Size
        };
        using (MD5 md5 = MD5.Create())
        {
            res.FileStream.Seek(0, SeekOrigin.Begin);
            byte[] hash = md5.ComputeHash(res.FileStream);
            res.Md5 = hash.AsHex();
            res.Md5Split = hash.Md5Split();
            GetAllChecksums(res);
            //Global.IncrementCounterAndDisplay();
            res.FileStream.Dispose();
            res.FileStream = null;
            result.Files.Add(res);
        }
    }
});
arc.ExtractFiles(null, getStream, fileExtractStart, fileExtractComplete);

I don't know the extent of the slowdown, perhaps its only in situations where you need to extract entries from a solid 7z that contains many files totaling a few hundred MB compressed and you want all entries extracted? Perhaps the same type of problem is affecting other scenarios as well? I would definitely look into that "null stream" in the hockeystick / triangle pattern.

fartwhif commented 6 years ago

Do you want me to make a PR? I'll leave up my fork so you can reference it.

squid-box commented 6 years ago

Go ahead, I haven't really had time to look into this yet and can't promise when I will.

TimzChen commented 4 years ago

Using [sharpcompress] to decompress is a temporary solution for me.

TomArrow commented 3 years ago

I am not 100% sure this is the same issue, but I have a 7zip file that's about 20 MB in size and has about half a million text files in it (that vary from each other in very small degrees). Unpacked, we're looking at tens of gigabytes, hence working with the pure files is impractical. My software only needs to iterate through all the files in the archive in a straight order, but with the way things are now (I think), it gets progressively slower and slower.

What's also specifically important to me is to read the files into memory and not write them to disk as that would be ridiculously slow for obvious reasons.

The perfect API would be one of two in my opinion: Either a "ExtractNext()" function that remembers the state of the last extraction and can continue there, and/or some kind of asynchronous(?) call that is passed a callback function that gets called every time a file is extracted, so that it could be handled with a lambda like a loop, for example:

while(ExtractedFile file = extractor.ExtractNext()){
    // file.FileName contains filename if necessary
    byte[] data = file.getBytes();
}

or

extractor.SequentialExtractFiles((ExtractedFile file)=>{
    // file.FileName contains filename if necessary
    byte[] data = file.getBytes();
});

Edit: In the latter case, it would be ideal if the callbacks would be called in the correct order and the callback for file 2 would only get called after the callback was already called for file 1 and so on. It could (but wouldn't have to be) multithreadedly extracting the next files in the background. Maybe this could be made an option where you can specify if you care about having the callback called in the correct order, but I don't see any necessity for the option for it not to be.

I realize it's probably not a terribly common problem to have but it is a big hurt.

Edit 2: I can confirm sharpcompress does not have this issue. What took me 6 hours in optimized Release code with this library, sharpcompress did within a minute in Debug mode.