haskell / tar

Reading, writing and manipulating ".tar" archive files.
https://hackage.haskell.org/package/tar
Other
40 stars 37 forks source link

Memory leak in unpack? #57

Closed hasufell closed 10 months ago

hasufell commented 4 years ago

Using a decompressed GHC bindist (e.g. ghc-8.4.3-x86_64-fedora27-linux.tar.xz) with the following line with profiling enabled shows a memory consumption of ~266mb:

htar +RTS -p -hc -RTS -x -f ghc-8.4.3-x86_64-fedora27-linux.tar

The profiling info is attached here. All I see is PINNED memory. I can't make any sense out of it.

This also happens in my fork (which is the only tar package that can extract GHC bindists without errors): https://hackage.haskell.org/package/tar-bytestring

This does not happen with:

hasufell commented 4 years ago

@dcoutts

hasufell commented 3 years ago

Ok, so I figured it out with help from #haskell. The memory leaks here are expected and cannot be fixed with lazy bytestring:

https://github.com/haskell/tar/blob/a0d722c1f6052bf144017d131a715ea1ae599964/Codec/Archive/Tar/Read.hs#L117-L119

The lazy bytestring that points to the file contents and the one that points to the rest (bs' here) share the same thunk. bs' is passed onto the next iteration of unfoldEntries and so blocks stream fusion.

Then BS.writeFile in unpack will force the entire file contents into memory. It will be freed on the next iteration though. So the memory peak of unpacking is determined by the largest file.

I've attempted a prototype of converting to streamly and it worked in constant memory. But I'm having efficiency issues that need sorting out: https://github.com/hasufell/tar-bytestring/commit/bd21bf979d4666806c7fd4850bf3103a532ffc36#diff-cafc5a404a211b294e8976f28f99048766774cc6414507cc2f5032dc5b5bc376R71

@Bodigrim I think going with streamly and getting rid of the Entries type is the way to go. I'll do more experimentation.

Bodigrim commented 3 years ago

I'm afraid that streamly brings in too many dependencies. Any chance we can get away with a more lightweight (and probably low tech) solution?

hasufell commented 3 years ago

I'm afraid that streamly brings in too many dependencies.

Why? Is there an upper bound of number of dependencies?

Any chance we can get away with a more lightweight (and probably low tech) solution?

I don't think so.

Bodigrim commented 3 years ago

There is no hard limit. But I am afraid of a situation, when cabal-install or hackage-server cannot upgrade to a new GHC, because one of tar dependencies is lagging behind. This is not a dig at streamly (FWIW I think that it is excellently maintained), rather a general consideration that such decisions should not to be taken lightly.

On the other hand, I do understand that being unable to unpack GHC tar archive is pretty a deal breaker :)

CC @emilypi @fgaz @gbaz

hasufell commented 3 years ago

I'm more afraid of tar blowing up users machines due to memory leaks than I am afraid of adding a well-maintained high-performance library as a dependency.

emilypi commented 3 years ago

@Bodigrim I agree that streamly brings in too much. Contrary to what Julian says, there are a few options:

  1. A low-tech, zero non-boot dependency solution would be vector streams:

    type Octets = Data.Vector.Fusion.Stream.Monadic IO Word8

  2. A lighter-cost but still very performant solution would be the streaming package, which doesn't add too many non-boot packages: http://hackage.haskell.org/package/streaming. This one has stood the test of production for long-running apps and large streams.

Why? Is there an upper bound of number of dependencies?

Any new dependencies would add new dependencies upstream of 67 other packages. We have to carefully consider them. I'm not fundamentally opposed to adding dependencies, fwiw.

fgaz commented 3 years ago

Dep tree for reference deps

hasufell commented 3 years ago

lighter-cost but still very performant solution would be the streaming package

It's nowhere near streamly: https://github.com/composewell/streaming-benchmarks#monadic-streams

This one has stood the test of production for long-running apps and large streams.

Streamly is used in production as well.

Any new dependencies would add new dependencies upstream of 67 other packages.

Sorry, what do you mean with 67 packages? Afais the only additional packages it would pull in would be roughly network, monad-control and heaps.

A low-tech, zero non-boot dependency solution would be vector streams:

Have never done anything like that. Would you be volunteering to implement a solution that way?


edit: also relevant https://github.com/composewell/streamly/issues/533

emilypi commented 3 years ago

It's nowhere near streamly

Yes, benchmarks exist and they say things. Whether they say useful things is a question left up to the implementation, and streamly's benchmarks are not super useful for gauging complicated behaviors. As a rule, anyone implementing solutions claiming to be a panacea need more vetting than otherwise. To be clear it's not off the table.

Streamly is used in production as well

Links?

Sorry, what do you mean with 67 packages? Afais the only additional packages it would pull in would be roughly network, monad-control and heaps.

Reverse dep lookup says many packages depend on tar, and you would be inserting these new dependencies into their build stream.

Would you be volunteering to implement a solution that way?

It's not within my bandwidth currently.

hasufell commented 3 years ago

Yes, benchmarks exist and they say things. Whether they say useful things is a question left up to the implementation, and streamly's benchmarks are not super useful for gauging complicated behaviors.

Tar isn't complicated. And tbf, it's mostly IO-bound I'd say. So even using conduit would theoretically be an option.

Links?

I don't have a complete list.

I've also used it myself in a performance critical setting of an event sourced platform.

Reverse dep lookup says many packages depend on tar, and you would be inserting these new dependencies into their build stream.

Yes.

It's not within my bandwidth currently.

Well, the bandwidth and motivation of contributors has an impact on viable options.

gbaz commented 3 years ago

This doesn't appear to be a leak, just an algorithm that has a footprint bounded by the largest file in the tarball, which for many use cases should be just fine. (In particularly, many things using this lib will need to have the unpacked files resident in memory anyway, to operate on them). In my opinion it would suffice simply to document this. A more purely streaming interface will have different tradeoffs, and there's certainly a use-case for that too -- but that could be provided by another library just fine.

hasufell commented 3 years ago

I strongly disagree with that.

'tar' takes up the most common name people in search for a proper tar library will come across, but it is currently the worst library with a plethora of bugs and it fails to unpack most tarballs correctly found in the wild.

The only use case that seems to work somewhat is that of cabal-install and even there things are broken if you happen to create a filepath that is too long.

The state of this library is embarrassing and it is not enough to point out that it works for small archives and corner cases.

It should be the go-to library or should be incorporated into cabal-install and then abandoned on hackage, so users don't waste their time with tracking down functional bugs and performance issues.


I'm sorry if that sounds harsh, but these things affect people in production, who made the wrong choice of relying on this lib without investigating its state. Existing users aren't the only concern. Potential users are too.

emilypi commented 3 years ago

I'm sorry if that sounds harsh, but these things affect people in production, who made the wrong choice of relying on this lib without investigating its state. Existing users aren't the only concern. Potential users are too.

@Bodigrim, @davean and myself just picked this package up from Duncan weeks ago to see about improving it. You are browbeating the people who are actually trying to help improve the state of this library, and it is completely uncalled for and unacceptable that you're acting with such hostility. I have repeatedly seen borderline abusive behavior from you, and I am not going to tolerate it any further. If you can't voice your negativity in a constructive manner, don't voice it at all.

hasufell commented 3 years ago

@Bodigrim, @davean and myself just picked this package up from Duncan weeks ago to see about improving it. You are browbeating the people who are actually trying to help improve the state of this library, and it is completely uncalled for and unacceptable that you're acting with such hostility.

That's untrue. I haven't brought up a single ad-hominem argument and have explained why I think simply documenting the current behavior is not enough, because that's a too low standard for a core library.

I have repeatedly seen borderline abusive behavior from you

That's a strong claim to make in public without proper backup and I do not appreciate that.

If you can't voice your negativity in a constructive manner, don't voice it at all.

You mean all the patches I wrote are not constructive? :)

cartazio commented 3 years ago

Let’s all take a step back and at least recognize we all want tar to be better and that we mostly just disagree on path but not the end point.

I’m willing to put some time into going through some of the outstanding PRs that have languished too long to help triage them if that would help provide an amicable way forward.

Bodigrim commented 3 years ago

Folks, please let's be kinder to each other.

I suggest we wait some time for https://github.com/composewell/streamly/issues/533. Let's focus on other issues in the meantime.

hasufell commented 10 months ago

Streamly is now split into streamly-core, which only depends on boot libraries: https://hackage.haskell.org/package/streamly-core

Bodigrim commented 10 months ago

@hasufell could you give a try to d94a988be4311b830149a9f8fc16739927e5fc1c? I'm not really sure why ;) but it seems to make things better:

$ cabal run htar -- -x -f ghc-8.4.3-x86_64-fedora27-linux.tar +RTS -s -M30M
   2,742,724,576 bytes allocated in the heap
      14,786,128 bytes copied during GC
       1,970,912 bytes maximum residency (175 sample(s))
         908,576 bytes maximum slop
              23 MiB total memory in use (4 MiB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       337 colls,     0 par    0.008s   0.008s     0.0000s    0.0001s
  Gen  1       175 colls,     0 par    0.016s   0.019s     0.0001s    0.0003s

  INIT    time    0.001s  (  0.001s elapsed)
  MUT     time    1.565s  (  1.860s elapsed)
  GC      time    0.024s  (  0.027s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    1.590s  (  1.890s elapsed)

  %GC     time       0.0%  (0.0% elapsed)

  Alloc rate    1,752,331,400 bytes per MUT second

  Productivity  98.5% of total user, 98.4% of total elapsed
Bodigrim commented 10 months ago

I've added a CI job to test memory consumption on large files, seems all good.

hasufell commented 10 months ago

It seems to work.