trealla-prolog / go

Trealla Prolog embedded in Go using WASM
MIT License
79 stars 3 forks source link

Memory leaks (wasmtime vs wasmer) #2

Closed jfmc closed 1 year ago

jfmc commented 2 years ago

Would wasmtime provide any advantage over wasmer?

guregu commented 2 years ago

There's one blocker for using wasmtime: https://github.com/bytecodealliance/wasmtime-go/issues/34 We need an easy way to capture stdout to receive results from the Prolog interpreter. We could work around it using files but it's not ideal.

wasmer solves this with a little C shim: https://github.com/wasmerio/wasmer-go/blob/b4462e6583f8d7b964e32bdd8d065cf96fba6c08/wasmer/wasi.go#L18-L54 So it probably wouldn't be too hard to add this to wasmtime.

I think we could also maybe export Go functions into WASM and have Trealla call it via FFI to report results, but I'm not quite sure how that works and if it is possible yet.

I would be especially interested in switching to any WASM runtime with good Windows support. Then we can support all major platforms out of the box.

guregu commented 2 years ago

This looks promising as well: https://github.com/tetratelabs/wazero It's more in spirit with the library I think. Not sure if it's performance is anywhere close to the others. wazero struggles with my broken realpath implementation in TPL so I need to fix that and do some experimenting.

enoperm commented 1 year ago

Sorry for "hijacking" this issue, though I do believe what I have observed is not entirely unrelated to the current state of wasmer-go

I have played around with the library, really like the concept, though I have ran into some issues with memory usage.

As an example, the code below demonstrates three separate issues:

  1. Using a single engine instance for a large number of queries will cause memory usage to climb until around 7GiBs or so, at which point all subsquent queries are stuck - I suspect it has to do with the allocator, though I have not verified it. I'm pretty sure a leak is involved somewhere because I do not think any state should be retained from just querying for nl.
  2. Using a single engine instance for a large number of queries that involve host-interop, will cause memory usage to climb until around 2GiBs, at which point I think the interop-related code uses an int32 to slice into WASM-allocated memory, resulting in a panic:
    $ go run main.go -- --with-interop
    2023/04/20 14:59:30
    2023/04/20 15:00:53 trealla: query error: trealla: panic: runtime error: slice bounds out of range [-2130246640:]
    exit status 1
  3. The reason I described the issue here lies with this leak. In theory, it is not an issue if a WASM application leaks any memory, since it is very cheap to to just throw it away and allocate one with a fresh slate... except doing so will consume memory infinitely. The runtime does not seem to ever release the instantiated WASM modules/memory, and the result is a far larger, and faster memory leak than just using a persistent instance.
package main

import (
    "context"
    "fmt"
    "log"
    "math/rand"
    "os"
    "time"

    "github.com/trealla-prolog/go/trealla"
)

func newEngine(ctx context.Context) (trealla.Prolog, error) {
    engine, err := trealla.New()
    if err != nil {
        return nil, err
    }

    err = engine.Register(ctx, "invoke", 1, trealla.Predicate(func(pl trealla.Prolog, subquery trealla.Subquery, goal trealla.Term) trealla.Term {
        v := fmt.Sprintf("%v", rand.Float64())
        return trealla.Compound{
            Functor: "invoke",
            Args: []trealla.Term{
                trealla.Atom(v),
            },
        }
    }))
    return engine, err
}

func main() {
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    engine, err := newEngine(ctx)
    if err != nil {
        log.Fatal(err)
    }

    log.Print()
    defer func() {
        recover()
        log.Print()
        os.Exit(2)
    }()

    getEngine := func() trealla.Prolog {
        return engine
    }

    // perhaps ugly, but suffices for testing
    query := `nl`
    for _, arg := range os.Args[1:] {
        switch arg {
        case "--with-interop":
            query = `invoke(A)`

        case "--with-new-instance":
            getEngine = func() trealla.Prolog {
                engine, _ := newEngine(ctx)
                return engine
            }
        }
    }

    for {
        func() {
            ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
            defer cancel()
            engine := getEngine()
            doWork(ctx, query, engine)
        }()
    }
}

func doWork(ctx context.Context, query string, engine trealla.Prolog) {
    resultSet := engine.Query(ctx, query)
    defer resultSet.Close()

    for resultSet.Next(ctx) {
    }

    if err := resultSet.Err(); err != nil {
        log.Println(err)
        os.Exit(1)
    }
}

Since the alternative is getting stuck forever, I'm not entirely sure the int32 issue should be fixed just yet - I would personally prefer crashing and restarting to never getting anywhere at all, for one. But I think, assuming I'm not wrong about the source of the leak when each query is served by a new instance, a different WASM runtime might be more beneficial than it seems at first glance.

guregu commented 1 year ago

Thanks @enoperm, that is very concerning. I’ll look into it 👀

enoperm commented 1 year ago

Do remember to use a separate cgroup or something if you run the repro code with --with-new-instance. It is a good way to test your OOM killer...

guregu commented 1 year ago

I think the leak around instantiating new Prologs is my fault, not wasmer's, seems like I forgot to add a Close to Prolog. I'll add that and let's see if it helps.

I think it would also not be too hard to move to wasmtime, Trealla has some stuff now to capture stdout/stderr so we can work around the wasmtime-go issue.

guregu commented 1 year ago

To be more specific, the wasmer stuff relies on runtime finalizers to free memory if they aren't explicitly closed so maybe a tight loop could prevent them from being called. Adding a Close to the Prolog interface should help with that.

enoperm commented 1 year ago

It is possible that I may have been harsh towards wasmer, I suppose. I'm not an expert on the Go runtime, so when I dug into the code and saw that finalizers exist, I just assumed they'd be surely called by the GC before the kernel triggers the OOM killer, and then encountered the previously linked issue and it kind of fit: if the wasm runtime does not free up things anyway, then any finalizers are pointless.

guregu commented 1 year ago

I'm not too sure how Go's runtime handles them either, TBH, so no worries. I tried adding a Close() method to the Prolog instance that calls wasmer's instance.Close() but it doesn't seem to help. I'll try and figure out whether it's a problem here or with wasmer-go. I'll also check if trealla-js has the same issue or not (I remember measuring it a while back and it was fine) which will rule out Trealla itself. Thanks again for the test code, it is very helpful.

guregu commented 1 year ago

trealla-js looks fine so it shouldn't be a problem with Trealla itself. I think the issue here is we hold a global wasmer.Store which probably just infinitely accumulates memory. So we'd need to instantiate a new module for each Prolog to avoid that, but then we run into the wasmer-go bug you linked. I'll see about using a different library.

BTW the interop bug with the negative pointers should be fixed in the next release, but it can still run out of memory if hits 4GB. Not sure if any of the Go libraries support 64-bit WASM but if they do it should be possible to fix that.

guregu commented 1 year ago

Good news, I got wasmtime working (in the wasmtime branch if you want to check it out). Running the test program with --with-new-instance and closing the engine in the loop allows it to properly garbage collect the memory. Also, looks like it properly supports 64-bit Wasm so I'll play around with compiling Trealla to that this weekend. Bad news, it seems like there's still a leak with queries against a single Prolog instance. Looks like this is happening on the Wasm side, probably from the stuff I added to get wasmtime to work ;). Should be able to fix it. Also the Tak function benchmark seems like twice as slow on wasmtime vs wasmer, not sure what's up with that.

guregu commented 1 year ago

Found the leak! Got the default test stable at around ~64MB. Looks like there's some weirdness with the interop stuff still, so once that gets fixed we should be good and I'll push the new stuff.

guregu commented 1 year ago

I believe this is what was causing the interop memory issue: https://github.com/trealla-prolog/trealla/issues/162 Once we fix that everything should be solid.

guregu commented 1 year ago

More good news: we've fixed all the leaks. Just need to tweak the interop stuff to work better with wasmtime and I can make a new release.

enoperm commented 1 year ago

I have plugged in the wasmtime branch into the toy project where I initially detected the memory leak and I can confirm that it can now serve each request on a new Trealla instance just fine without any apparent leaks. :heavy_check_mark:

guregu commented 1 year ago

Great! I have some more changes I'm just about to push. Will test it a bit more and make a new release shortly.

guregu commented 1 year ago

Hmm, seems like wasmtime can crash sometimes. Probably my fault, investigating.

guregu commented 1 year ago

Whoops, was a concurrency issue, fixed now. Made an issue for wasm64 support as well: #8