snapframework / snap

Top-level package for the official Snap Framework libraries, includes the snaplets API as well as infrastructure for sessions, auth, and templates.
http://snapframework.com/
BSD 3-Clause "New" or "Revised" License
455 stars 68 forks source link

Is it possible that /admin/reload would make the server leak memory? #137

Closed adinapoli closed 9 years ago

adinapoli commented 9 years ago

Hello folks,

Sorry if it is not relevant anymore in the upcoming 1.0 release, but I thought would have been quite useful to bring it up nevertheless.

I have a web server that essentially restart itself if certain conditions are met; in particular, I have a main snaplet that holds the state of a connection to a RabbitMQ broker. If such connection is lost, for example due to a network failure, I use /admin/reload to reload the state of my snaplet back into memory.

As part of a particular catastrophic simulation, the reload function was executed by the server quite frantically, until the broker went back online. Monitoring the memory allocation of my server though I did notice a huge memory allocation. Profiling reveals the following:

screen shot 2015-02-09 at 15 47 06

screen shot 2015-02-09 at 15 52 56

screen shot 2015-02-09 at 15 50 56

I couldn't actually profile with -hr as I get this GHC error:

hermes-server: internal error: Invalid object in isRetainer(): 39
    (GHC version 7.8.3 for x86_64_apple_darwin)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug
[1]    19712 abort      hermes-server -p 9939 +RTS -hr -L120

As far as I can see one possible suspect is this Chunk data structure, which may be coming from text:

http://hackage.haskell.org/package/text-1.0.0.0/docs/Data-Text-Lazy-Internal.html

Browsing through the snap code, I can see that the reload function does nothing but calling an IO action which contains an Either Text Text:

https://github.com/snapframework/snap/blob/fadf49ac8bea51bc9c3d7ab3d794a3cd7b55a8da/src/Snap/Snaplet/Internal/Types.hs#L54

In particular, I can see that you are forcing the result of these IO action in the reload handler:

https://github.com/snapframework/snap/blob/fadf49ac8bea51bc9c3d7ab3d794a3cd7b55a8da/src/Snap/Snaplet/Internal/Types.hs#L398

But despite that, as Either is a lazy data structure, the underlying text within contained will not be forced, and this might explain the leak. I'm not fully convinced of this track though, as also ByteString has a Chunk type, and that ARR_WORDS might be telling us exactly that.

Do you guys think:

  1. This is a bug in my application code
  2. It's a potential bug in Snap
  3. Is not a bug but a "feature": There is no way we can protect ourselves from this to happen.

Thank you in advance!

Alfredo

mightybyte commented 9 years ago

Hmm, since Text is such a commonly used structure, it's hard to say what might be causing this problem. It's entirely possible that this leak could be a problem with snap's reloading code, but it's hard to know for sure. I think there are two ways you could approach this.

  1. Try to fix the problem yourself, but be aware that you might need to fix something in Snap.
  2. Build us a minimal demonstration of the problem that we can use to debug Snap.

It's hard to say which would be easier for you.

On Mon, Feb 9, 2015 at 10:36 AM, Alfredo Di Napoli <notifications@github.com

wrote:

Hello folks,

Sorry if it is not relevant anymore in the upcoming 1.0 release, but I thought would have been quite useful to bring it up nevertheless.

I have a web server that essentially restart itself if certain conditions are met; in particular, I have a main snaplet that holds the state of a connection to a RabbitMQ broker. If such connection is lost, for example due to a network failure, I use /admin/reload to reload the state of my snaplet back into memory.

As part of a particular catastrophic simulation, the reload function was executed by the server quite frantically, until the broker went back online. Monitoring the memory allocation of my server though I did notice a huge memory allocation. Profiling reveals the following:

[image: screen shot 2015-02-09 at 15 47 06] https://cloud.githubusercontent.com/assets/442035/6108771/0fa0d9be-b076-11e4-94b9-15ce0b03cf39.png

[image: screen shot 2015-02-09 at 15 52 56] https://cloud.githubusercontent.com/assets/442035/6108778/17d5f31c-b076-11e4-80a8-0e0119b9b00e.png

[image: screen shot 2015-02-09 at 15 50 56] https://cloud.githubusercontent.com/assets/442035/6108782/1fcc5bba-b076-11e4-8247-c0e87a921f9e.png

I couldn't actually profile with -hr as I get this GHC error:

hermes-server: internal error: Invalid object in isRetainer(): 39 (GHC version 7.8.3 for x86_64_apple_darwin) Please report this as a GHC bug: http://www.haskell.org/ghc/reportabug [1] 19712 abort hermes-server -p 9939 +RTS -hr -L120

As far as I can see one possible suspect is this Chunk data structure, which may be coming from text:

http://hackage.haskell.org/package/text-1.0.0.0/docs/Data-Text-Lazy-Internal.html

Browsing through the snap code, I can see that the reload function does nothing but calling an IO action which contains an Either Text Text:

https://github.com/snapframework/snap/blob/fadf49ac8bea51bc9c3d7ab3d794a3cd7b55a8da/src/Snap/Snaplet/Internal/Types.hs#L54

In particular, I can see that you are forcing the result of these IO action in the reload handler:

https://github.com/snapframework/snap/blob/fadf49ac8bea51bc9c3d7ab3d794a3cd7b55a8da/src/Snap/Snaplet/Internal/Types.hs#L398

But despite that, as Either is a lazy data structure, the underlying text within contained will not be forced, and this might explain the leak. I'm not fully convinced of this track though, as also ByteString has a Chunk type, and that ARR_WORDS might be telling us exactly that.

Do you guys think:

  1. This is a bug in my application code
  2. It's a potential bug in Snap
  3. Is not a bug but a "feature": There is no way we can protect ourselves from this to happen.

Thank you in advance!

Alfredo

— Reply to this email directly or view it on GitHub https://github.com/snapframework/snap/issues/137.

adinapoli commented 9 years ago

Hi Doug, thanks for the prompt reply.

I think that considering the delicate transition the project is facing (0.x -> 1.0), you and Greg would be probably in the best position to troubleshoot what's happening, so I would be inclined to go with the number two.

I will try to whip up a minimal example reproducing the bug later this week. Thanks :)

Alfredo

gregorycollins commented 9 years ago

I'm betting this has to do with some snaplet (either heist or something in your user code) doing a lazy modifyMVar instead of a strict one; that would mean that until the contents of the MVar are demanded, the MVar would contain a thunk chain. Similar caution applies to e.g. IORef.

Are you using the heist cache tag anywhere?

adinapoli commented 9 years ago

Hello Greg, I'm not using the heist cache anywhere, but my snaplet does indeed use a lot of IORefs, having to deal with the mutant nature of the RabbiMQ brokering. I was always very cautious though, and I am using atomicModifyIORef' everywhere I need to modify the content of a ref.

However, I can see that the underlying RabbitMQ library is using modifyMVar in a couple of places:

https://github.com/hreinhardt/amqp/search?utf8=%E2%9C%93&q=modifyMVar

So indeed, that may be the source of the leak. During this week, I will do two things:

  1. Come up with an experimental patch for the amqp library, which uses the strict version of the MVar operations, and then profile again;
  2. Reproduce a minimal example of snaplet reloading, but this time without any state, to see if the leak comes from the server rather than from the application code.

That should give us a better picture of what's going on.

Thanks guys! Alfredo

adinapoli commented 9 years ago

I think we can close the case.

After having managed to run the profiling with the -hr option enabled ( the crash seems to be an already reported GHC bug, which can be partially worked around by removing the -N threading option when running the program. After that, and a bit of cost centre annotations, the culprit was evident:

screen shot 2015-02-10 at 13 06 35

(It is renown that http-client's Manager is a pretty heavy object to create, and I'm pretty sure that my code was not strict enough, causing more and more managers to be created every time).

I am inclined to close this issue now: I still have a marginal leak in my profiling, but I do not think it is any way related to the reloading portion of the framework.

Sorry for the noise guys!