status-im / nim-chronos

Chronos - An efficient library for asynchronous programming
https://status-im.github.io/nim-chronos/
Apache License 2.0
362 stars 51 forks source link

Memory leak with HttpRequestRef.getBody proc #377

Closed termermc closed 1 year ago

termermc commented 1 year ago

When reading a full request body to memory, ~4x the memory is allocated and is never freed, even after running GC_runOrc.

Compiler flags: --threads:on --mm:orc

Nim version: 1.16.12 (latest stable) OS: ArchLinux running linux 6.2.9.arch1-1

Reproduction instructions:

# Start the server in another terminal and notice it's using less than 1MiB RAM

# Create a 100MiB file
fallocate -l 100M /tmp/100m

# Send the 100MiB file verbatim
# It also works with form data using the HttpRequestRef.post proc, memory usage may be worse for that, too
curl -i -X POST 'http://localhost:8080/' --data-binary '@/tmp/100m'

# Notice that the server is now using over 400MiB RAM, and it never gets freed, even after manual ORC collections

Reproduction program:

import std/[os]
import chronos, chronos/apps/http/httpserver

proc server() {.thread.} =
    ## Server thread

    proc handler(rf: RequestFence): Future[HttpResponseRef] {.async.} =
        if not rf.isOk:
            return dumbResponse()

        let req = rf.get()

        let bodyRaw = await req.getBody()

        return await req.respond(Http200, "Body length: " & $bodyRaw.len)

    var serverRes = HttpServerRef.new(
        initTAddress("127.0.0.1", 8080),
        handler,
        maxRequestBodySize = 1024 * 1024 * 1024,
        socketFlags = {ServerFlags.TcpNoDelay, ServerFlags.ReuseAddr, ServerFlags.ReusePort},
    )

    if serverRes.isErr:
        echo "ERR!"
        quit 1

    let server = serverRes.get()

    server.start()
    runForever()

var serverThread: Thread[void]
createThread(serverThread, server)

proc gc() {.thread.} =
    ## Garbage collector thread

    while true:
        # Sleep for 10s and then do full collect
        sleep(10_000)
        GC_runOrc()
        echo "Ran full collect"

var gcThread: Thread[void]
createThread(gcThread, gc)

runForever()

I've also noticed that simple allocations in the request body also often don't get freed.

Allocating 100MiB with let str = newString(1024 * 1024 * 100) doesn't get freed (under refc it actually takes up more space) but doesn't grow past the first request usually. If you allocate 1GiB with let str = newString(1024 * 1024 * 1024) though, it immediately gets freed (except on refc, where it stays allocated forever).

termermc commented 1 year ago

This may be due to Nim's memory allocator not returning memory back to the OS, because the issue seems to go away with -d:useMalloc. Is this still an issue?

See below comment

termermc commented 1 year ago

Actually, the problem is less severe with -d:useMalloc, but if you spam send the server files with many connections, a huge amount of memory is allocated but never freed. You can test using the tool bombardier (similar to wrk) with the following command:

bombardier -m POST -c 20 -d 10s --body-file '/tmp/100m' http://localhost:8080/

This POSTs the file using 20 connections for a duration of 20 seconds.

The peak memory I observed was ~500MiB, and after the test finished, ~300MiB was remaining and never freed. I'm assuming this has to do with connections that never had data sent not having resources closed for them. I'm not entirely sure about the implementation details, but that's what it looks like to me.

The issue is much worse under refc, and it looks like very little of it is freed in that case.

The underlying issue seems to be that the leak occurs when clients disconnect unexpectedly, because bombardier never succeeded to finish any of its requests.

termermc commented 1 year ago

Experimenting further, I disabled the request timeout in bombardier and I noticed that the leak is not present anymore. This confirmed my suspicion that prematurely terminated requests cause the program to leak memory.

As long as requests aren't terminated during an upload, the program does not leak memory with getBody or post.

termermc commented 1 year ago

I'm not seeing any reliable way to check whether the underlying client connection is closed, which may be part of the problem. I've checked the request's HttpState, and it's always Alive. I've also checked read and write stream states, and they're both always Running. As far as I can tell, there's no way for me to check whether a request client connection is closed, which means more opportunities for memory and resource leaks. Is there some other way of determining this?

arnetheduck commented 1 year ago

We'll be looking into this, but a point of order for anyone coming to this thread: ORC/ARC is practically unusable in the 1.6 series, thus this does not constitute a valid use.

arnetheduck commented 1 year ago

So, running the latest chronos and nim 1.6.12 and refc, there is no leak here: ie the memory eventually gets recycled - that's the way the gc works: it leaves a high memory usage watermark because it generally never returns memory to the OS.

However, there are a few things to point out about this code - in particular, no matter what chronos does, it will unavoidably create at least 2 copies of the body (one in the future and one in the local variable) - this is just the way it works and one would have to use the streaming interface instead to get lower memory usage: see getBodyReader here - https://github.com/status-im/nim-chronos/blob/94ca0c38478e47ca806dc912b2a725fc41b532c2/chronos/apps/http/httpserver.nim#L376

I'm gonna keep this open for a few days but it looks like there is no fire here (beyond orc bugs in nim which will not be solved generally until 2.0).

termermc commented 1 year ago

@arnetheduck It leaks on ORC, but not refc, as was stated in the original issue. That has been my experience with it, but it seems using ORC pretty much destroys and non-leaking guarantees that refc had, so I don't know how much control Chronos has over this.

arnetheduck commented 1 year ago

Right, so the thing is that ORC in nim 1.6 is thoroughly broken and all fixes to it go into the devel branch, where new bugs and leaks in it are discovered weekly - there's nothing that chronos can do about this.

We're careful under refc not to hold on to references longer than needed thus allowing the GC to work efficiently - in ORC, this should translate to generally lower memory usage than refc because that memory will be releases slightly earlier - however, there's not much we can do in chronos itself (ie beyond recommending the use of a stream and asserting that whatever memory refc uses goes into a pool that gets reused and that we've verified that this reuse works well in long-running applications).

arnetheduck commented 1 year ago

As an aside, I would recommend raising upstream bugs if you find specific leaks in ORC under the latest devel versions - they're addressing any such issues with urgency right now whip it into a good shape ;)

termermc commented 1 year ago

@arnetheduck I didn't mean to waste anyone's time on the Chronos team with this bug item. Back when I submitted it, I assumed that ORC worked just as well as refc, so the only possible cause for the leak as far as I could tell must've been Chronos. Months later, I can say with confidence that I can't trust ORC, and if I had this knowledge when I found the issue, I would have submitted it on the Nim repo instead.