Memory issues when working with larger files

JordanPJamieson commented 3 years ago

I have a question related to a previously created issue and wanted to follow up. (https://github.com/tsolomko/SWCompression/issues/25)

Has there been any progress on resolving this issue regarding memory spikes when working with larger files.

For example, I'm currently trying to utilize the library to create a tar file and currently we are forced to pass the entire files data contents at once (load the entire files data into memory at a time) which causes rapid/exponential spikes in memory as Tar Entry objects are created. This could be fixed by reading chunks of a file at a time while creating the tar file.

Any feedback would be greatly appreciated.

Thanks a lot!

tsolomko commented 3 years ago

To answer your main question: no, there has been no progress on this issue.

However, I see now that there is a lot of demand for this kind of functionality. As such, I will try to do something about it, starting with your specific case (creating TAR). I will report back with some sort of implementation soon (hopefully, by the end of this week, but no guarantees).

JordanPJamieson commented 3 years ago

Thanks a lot for the quick response - I deeply appreciate it.

This functionality would help immensely.

For my specific case, I'm trying to create a tar file and write it to disk efficiently without holding so much data in memory. So a function that accepts tar entries (with a URL location of the tar/file entry rather than the Data of the file) and a URL to write the data in chunks to would be very helpful as well.

Thanks again!

tsolomko commented 3 years ago

So I did implement something. I added a set of relatively simple APIs that use Foundation's FileHandle:

public struct TarReader {
    public init(fileHandle: FileHandle)
    public mutating func next() throws -> TarEntry?
}

public struct TarWriter {
    public init(fileHandle: FileHandle, format: TarContainer.Format = .pax)
    public mutating func append(_ entry: TarEntry) throws
    public func finalize() throws
}

These should help avoid loading everything into memory at once and only do it as necessary.

I've tried to check if they indeed help with memory usage. I used /usr/bin/time -lp command to measure "maximum resident set size" and tarred "Microsoft Word.app" as a test file (2.26 GB). The results are a bit confusing. For the TarContainer.open function I've got 3.265 GB vs 1.746 GB for the version with TarReader. For the TarContainer.create and its TarWriter version I've got 3.515 GB vs. 2.279 GB, respectively. The full terminal output is available in the attached file (though it may be a bit cryptic). term_output.txt

So, what do you think about all of this? I should note that I have no idea whether or not this metric of "maximum resident set size" is a good one to evaluate memory usage. As I assume you did some memory-related measurements before, any advice on how to better estimate memory usage would be appreciated!

P.S. If you are interested in testing these new functions yourself, they are available in tar-rw branch.

philmitchell commented 2 years ago

I just submitted a PR to address this, but only for the case of untarring. If you want, I'll look at tarring, too ... but I happen to need only untarring :)

I used Xcode instruments to confirm that with the addition of autoreleasepool, the new rw-api branch can untar quite large files without any increase in transient or persistent memory usage.

Thanks Timofey ... and happy new year!

tsolomko commented 2 years ago

Hi @philmitchell,

Thank you for PR and happy new year to you too!

I had a look and, wow, this autoreleasepool thing does wonders: repeating the same tests as I did previously in this thread results in 0.19 GB "maximum resident set size" for when using TarReader + autoreleasepool vs. 2.93 GB for TarContainer.open.

As I understand now, the new TarReader/Writer types won't help with memory consumption issues by themselves: users will still have to use autoreleasepool to achieve the desirable effect. So I wonder, if it's possible to move the autoreleasepool inside TarReader, because otherwise these new types seem useless.

One way to achieve this would be to add a new function, something like TarReader.processNext<Result>(_ f: (TarEntry) -> Result) throws -> Result, which obtains a new TarEntry (similar tonext()) and calls the user-supplied function on it inside the autoreleasepool.

Anyway, this question is for me think about later in the week (alongside with merging your PR, etc.), so thank you once again!

tsolomko commented 2 years ago

I have some news to share.

First of all, as I discovered today, autoreleasepool is a Darwin-only thing, so it is not available on Linux and Windows. Apart from having to provide compatibility implementation for it on those platforms, I wonder, what is the situation with memory usage there. As in, if this new TarReader/Writer stuff will be even helpful on Linux/Windows. I've also found on Swift forums some relevant discussion about the autoreleasepool availability.

Secondly, I merged tar-rw branch into develop and made some additional changes:

TarReader.next() has been renamed into TarReader.read().
TarReader.process(_:) has been added (this is the function that automatically uses autoreleasepool).
Removed --use-rw-api option from swcomp tar: it will now automatically use TarReader/Writer, if possible (which means when there are no additional compression options supplied).

Note that there are no changes to TarWriter. As much as I would like to make it use autoreleasepool internally, so users wouldn't have to think about these subtleties, I don't think it is possible. So the only solution there would be to mention in documentation something along the lines of "you should wrap autoreleasepool around the TarWriter stuff".

philmitchell commented 2 years ago

Ha, that's super interesting! It was surprising to discover that we had to manually release memory during the file read, but now I'm really curious to know how it works in Swift on Linux/Windows. I'm swamped at the moment, but when I can will look into this.

Regarding removal of --use-rw-api: good call, I agree it should be the default. I actually ran some (very crude) tests to see if there was a performance hit when reading the tarfile in chunks and if anything it performs better. If that's correct, there's no reason to do it any other way.

tsolomko commented 2 years ago

4.8.0 has been released, which includes the new functionality discussed here!

tsolomko / SWCompression

Memory issues when working with larger files #28