Closed hasufell closed 10 months ago
@dcoutts
Ok, so I figured it out with help from #haskell
. The memory leaks here are expected and cannot be fixed with lazy bytestring:
The lazy bytestring that points to the file contents and the one that points to the rest (bs'
here) share the same thunk. bs'
is passed onto the next iteration of unfoldEntries
and so blocks stream fusion.
Then BS.writeFile
in unpack will force the entire file contents into memory. It will be freed on the next iteration though. So the memory peak of unpacking is determined by the largest file.
I've attempted a prototype of converting to streamly and it worked in constant memory. But I'm having efficiency issues that need sorting out: https://github.com/hasufell/tar-bytestring/commit/bd21bf979d4666806c7fd4850bf3103a532ffc36#diff-cafc5a404a211b294e8976f28f99048766774cc6414507cc2f5032dc5b5bc376R71
@Bodigrim I think going with streamly and getting rid of the Entries
type is the way to go. I'll do more experimentation.
I'm afraid that streamly
brings in too many dependencies. Any chance we can get away with a more lightweight (and probably low tech) solution?
I'm afraid that streamly brings in too many dependencies.
Why? Is there an upper bound of number of dependencies?
Any chance we can get away with a more lightweight (and probably low tech) solution?
I don't think so.
There is no hard limit. But I am afraid of a situation, when cabal-install
or hackage-server
cannot upgrade to a new GHC, because one of tar
dependencies is lagging behind. This is not a dig at streamly
(FWIW I think that it is excellently maintained), rather a general consideration that such decisions should not to be taken lightly.
On the other hand, I do understand that being unable to unpack GHC tar archive is pretty a deal breaker :)
CC @emilypi @fgaz @gbaz
I'm more afraid of tar
blowing up users machines due to memory leaks than I am afraid of adding a well-maintained high-performance library as a dependency.
@Bodigrim I agree that streamly
brings in too much. Contrary to what Julian says, there are a few options:
A low-tech, zero non-boot dependency solution would be vector streams:
type Octets = Data.Vector.Fusion.Stream.Monadic IO Word8
A lighter-cost but still very performant solution would be the streaming
package, which doesn't add too many non-boot packages: http://hackage.haskell.org/package/streaming. This one has stood the test of production for long-running apps and large streams.
Why? Is there an upper bound of number of dependencies?
Any new dependencies would add new dependencies upstream of 67 other packages. We have to carefully consider them. I'm not fundamentally opposed to adding dependencies, fwiw.
Dep tree for reference
lighter-cost but still very performant solution would be the streaming package
It's nowhere near streamly: https://github.com/composewell/streaming-benchmarks#monadic-streams
This one has stood the test of production for long-running apps and large streams.
Streamly is used in production as well.
Any new dependencies would add new dependencies upstream of 67 other packages.
Sorry, what do you mean with 67 packages? Afais the only additional packages it would pull in would be roughly network
, monad-control
and heaps
.
A low-tech, zero non-boot dependency solution would be vector streams:
Have never done anything like that. Would you be volunteering to implement a solution that way?
edit: also relevant https://github.com/composewell/streamly/issues/533
It's nowhere near streamly
Yes, benchmarks exist and they say things. Whether they say useful things is a question left up to the implementation, and streamly
's benchmarks are not super useful for gauging complicated behaviors. As a rule, anyone implementing solutions claiming to be a panacea need more vetting than otherwise. To be clear it's not off the table.
Streamly is used in production as well
Links?
Sorry, what do you mean with 67 packages? Afais the only additional packages it would pull in would be roughly network, monad-control and heaps.
Reverse dep lookup says many packages depend on tar
, and you would be inserting these new dependencies into their build stream.
Would you be volunteering to implement a solution that way?
It's not within my bandwidth currently.
Yes, benchmarks exist and they say things. Whether they say useful things is a question left up to the implementation, and streamly's benchmarks are not super useful for gauging complicated behaviors.
Tar isn't complicated. And tbf, it's mostly IO-bound I'd say. So even using conduit
would theoretically be an option.
Links?
I don't have a complete list.
I've also used it myself in a performance critical setting of an event sourced platform.
Reverse dep lookup says many packages depend on tar, and you would be inserting these new dependencies into their build stream.
Yes.
It's not within my bandwidth currently.
Well, the bandwidth and motivation of contributors has an impact on viable options.
This doesn't appear to be a leak, just an algorithm that has a footprint bounded by the largest file in the tarball, which for many use cases should be just fine. (In particularly, many things using this lib will need to have the unpacked files resident in memory anyway, to operate on them). In my opinion it would suffice simply to document this. A more purely streaming interface will have different tradeoffs, and there's certainly a use-case for that too -- but that could be provided by another library just fine.
I strongly disagree with that.
'tar' takes up the most common name people in search for a proper tar library will come across, but it is currently the worst library with a plethora of bugs and it fails to unpack most tarballs correctly found in the wild.
The only use case that seems to work somewhat is that of cabal-install and even there things are broken if you happen to create a filepath that is too long.
The state of this library is embarrassing and it is not enough to point out that it works for small archives and corner cases.
It should be the go-to library or should be incorporated into cabal-install and then abandoned on hackage, so users don't waste their time with tracking down functional bugs and performance issues.
I'm sorry if that sounds harsh, but these things affect people in production, who made the wrong choice of relying on this lib without investigating its state. Existing users aren't the only concern. Potential users are too.
I'm sorry if that sounds harsh, but these things affect people in production, who made the wrong choice of relying on this lib without investigating its state. Existing users aren't the only concern. Potential users are too.
@Bodigrim, @davean and myself just picked this package up from Duncan weeks ago to see about improving it. You are browbeating the people who are actually trying to help improve the state of this library, and it is completely uncalled for and unacceptable that you're acting with such hostility. I have repeatedly seen borderline abusive behavior from you, and I am not going to tolerate it any further. If you can't voice your negativity in a constructive manner, don't voice it at all.
@Bodigrim, @davean and myself just picked this package up from Duncan weeks ago to see about improving it. You are browbeating the people who are actually trying to help improve the state of this library, and it is completely uncalled for and unacceptable that you're acting with such hostility.
That's untrue. I haven't brought up a single ad-hominem argument and have explained why I think simply documenting the current behavior is not enough, because that's a too low standard for a core library.
I have repeatedly seen borderline abusive behavior from you
That's a strong claim to make in public without proper backup and I do not appreciate that.
If you can't voice your negativity in a constructive manner, don't voice it at all.
You mean all the patches I wrote are not constructive? :)
Let’s all take a step back and at least recognize we all want tar to be better and that we mostly just disagree on path but not the end point.
I’m willing to put some time into going through some of the outstanding PRs that have languished too long to help triage them if that would help provide an amicable way forward.
Folks, please let's be kinder to each other.
I suggest we wait some time for https://github.com/composewell/streamly/issues/533. Let's focus on other issues in the meantime.
Streamly is now split into streamly-core, which only depends on boot libraries: https://hackage.haskell.org/package/streamly-core
@hasufell could you give a try to d94a988be4311b830149a9f8fc16739927e5fc1c? I'm not really sure why ;) but it seems to make things better:
$ cabal run htar -- -x -f ghc-8.4.3-x86_64-fedora27-linux.tar +RTS -s -M30M
2,742,724,576 bytes allocated in the heap
14,786,128 bytes copied during GC
1,970,912 bytes maximum residency (175 sample(s))
908,576 bytes maximum slop
23 MiB total memory in use (4 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 337 colls, 0 par 0.008s 0.008s 0.0000s 0.0001s
Gen 1 175 colls, 0 par 0.016s 0.019s 0.0001s 0.0003s
INIT time 0.001s ( 0.001s elapsed)
MUT time 1.565s ( 1.860s elapsed)
GC time 0.024s ( 0.027s elapsed)
EXIT time 0.000s ( 0.001s elapsed)
Total time 1.590s ( 1.890s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 1,752,331,400 bytes per MUT second
Productivity 98.5% of total user, 98.4% of total elapsed
I've added a CI job to test memory consumption on large files, seems all good.
It seems to work.
Using a decompressed GHC bindist (e.g. ghc-8.4.3-x86_64-fedora27-linux.tar.xz) with the following line with profiling enabled shows a memory consumption of ~266mb:
The profiling info is attached here. All I see is PINNED memory. I can't make any sense out of it.
This also happens in my fork (which is the only tar package that can extract GHC bindists without errors): https://hackage.haskell.org/package/tar-bytestring
This does not happen with:
01-index.tar
archive (which is probably packed withtar
? GHC bindists are not)