microsoft / garnet

Garnet is a remote cache-store from Microsoft Research that offers strong performance (throughput and latency), scalability, storage, recovery, cluster sharding, key migration, and replication features. Garnet can work with existing Redis clients.
https://microsoft.github.io/garnet/
MIT License
10.35k stars 522 forks source link

NullReferenceException in ExecuteScript Method When Executing Lua Scripts #635

Closed danielchristianschroeter closed 1 month ago

danielchristianschroeter commented 2 months ago

Describe the bug

When attempting to execute Lua scripts using GarnetServer, a NullReferenceException is thrown in the ExecuteScript method. This issue occurs when using the EVAL or EVALSHA commands.

A NullReferenceException is thrown, indicating that an object reference is not set to an instance of an object.

Error Message:

2024-09-04 18:28:21 04::28::21 fail: Session[0] [192.168.80.2:6379] [192.168.80.1:42348] [01400C58] Error executing Lua script callback
2024-09-04 18:28:21 04::28::21 crit: Session[0] [192.168.80.2:6379] [192.168.80.1:42348] [01400C58] ProcessMessages threw an exception: System.NullReferenceException: Object reference not set to an instance of an object.    at Garnet.server.RespServerSession.ExecuteScript(Int32 count, LuaRunner scriptRunner) in /src/libs/server/Lua/LuaCommands.cs:line 235    at Garnet.server.RespServerSession.TryEVAL() in /src/libs/server/Lua/LuaCommands.cs:line 97    at Garnet.server.RespServerSession.ProcessArrayCommands[TGarnetApi](RespCommand cmd, TGarnetApi& storageApi) in /src/libs/server/Resp/RespServerSession.cs:line 664    at Garnet.server.RespServerSession.ProcessMessages() in /src/libs/server/Resp/RespServerSession.cs:line 420    at Garnet.server.RespServerSession.TryConsumeMessages(Byte* reqBuffer, Int32 bytesReceived) in /src/libs/server/Resp/RespServerSession.cs:line 299

The LuaRunner object might not be properly initialized or retrieved, leading to a null reference when attempting to execute the script.

This report is related to the Lua implementation in https://github.com/microsoft/garnet/pull/550

Steps to reproduce the bug

  1. Enable Lua scripting in GarnetServer.

  2. Attempt to execute a Lua script using the EVAL or EVALSHA command. Go Code Using RedisLock from https://github.com/bsm/redislock/blob/main/redislock.go:

    func (rs *RedisStorage) Lock(ctx context.Context, name string) error {
    
    key := rs.prefixLock(name)
    
    for {
        // try to obtain lock
        lock, err := rs.locker.Obtain(ctx, key, lockTTL, &redislock.Options{})
    
        // lock successfully obtained
        if err == nil {
            // store the lock in sync.map, needed for unlocking
            rs.locks.Store(key, lock)
            // keep the lock fresh as long as we hold it
            go func(ctx context.Context, lock *redislock.Lock) {
                for {
                    // refresh the Redis lock
                    err := lock.Refresh(ctx, lockTTL, nil)
                    if err != nil {
                        return
                    }
    
                    select {
                    case <-time.After(lockRefreshInterval):
                    case <-ctx.Done():
                        return
                    }
                }
            }(ctx, lock)
    
            return nil
        }
    
        // check for unexpected error
        if err != redislock.ErrNotObtained {
            return fmt.Errorf("Unable to obtain lock for %s: %v", key, err)
        }
    
        // lock already exists, wait and try again until cancelled
        select {
        case <-time.After(lockPollInterval):
        case <-ctx.Done():
            return ctx.Err()
        }
    }
    }

    Related Lua scripts: https://github.com/bsm/redislock/blob/main/obtain.lua https://github.com/bsm/redislock/blob/main/refresh.lua

  3. Observe the NullReferenceException in the logs of garnet server.

Expected behavior

The Lua script should execute without throwing a NullReferenceException.

Screenshots

No response

Release version

v1.0.19

IDE

No response

OS version

No response

Additional context

Docker Compose service configuration: docker-compose.yml

services:
  garnet:
    image: 'ghcr.io/microsoft/garnet:1.0.19'
    ulimits:
      memlock: -1
    ports:
      - "6379:6379"
    volumes:
      - garnetdata:/data
      - ./garnet.conf:/app/garnet.conf
    command: ["--config-import-path", "/app/garnet.conf"]

garnet.conf

{
    /* Enable Lua scripts on server. */
    "EnableLua" : true,

    /* Run Lua scripts as a transaction (lock keys - run script - unlock keys). */
    "LuaTransactionMode" : true,
}
mmclure-msft commented 2 months ago

I'm seeing the same error when trying to use Redisson maps. Dug down to the following line in LuaCommands.cs:

https://github.com/microsoft/garnet/blob/f281bc3dde9bf8b695279f518c8a25e16609c8de/libs/server/Lua/LuaCommands.cs#L38

which does not return a value. Looking at storeScriptCache shows it to be empty: image So it looks like somehow the script isn't properly getting stored.

mmclure-msft commented 2 months ago

Digging into my case further, it is correct that there is no value returned. If I run with Redis I get a NOSCRIPT error back on the same execution. So I may be running into something different.

mmclure-msft commented 2 months ago

Looks like this is at least part of the problem: https://github.com/microsoft/garnet/blob/f281bc3dde9bf8b695279f518c8a25e16609c8de/libs/server/Lua/LuaCommands.cs#L235 ex.innerException is null. Changing that line to while (!RespWriteUtils.WriteError("ERR " + (ex.InnerException ?? ex).Message, ref dcurr, dend)) fixes the crash and forwards the error on to the client.

badrishc commented 2 months ago

The fix LGTM, if you can create a PR that would be great.

badrishc commented 2 months ago

@danielchristianschroeter - please check whether this change unblocks your example.

mmclure-msft commented 2 months ago

Created #642

danielchristianschroeter commented 2 months ago

@mmclure-msft Thanks to your suggested changes, the NullReferenceException is resolved. However, I’m still encountering "Error executing Lua script callback". Any chance to get more information?

2024-09-05 17:11:55 03::11::55 fail: Session[0] [172.22.0.1:37528] [005C39D4] Error executing Lua script callback

mmclure-msft commented 2 months ago

I'm seeing similar issues. I think there may be subtle (or not so subtle) differences in how NLua parses the scripts vs. the library that's used by Redis. Setting --logger-level Trace on the command line may help get more details on what's failing - for example I'm seeing

08::32::03 fail: Session[0] [10.0.0.123:6379] [127.0.0.1:64740] [004B8E6D] Error executing Lua script System.IndexOutOfRangeException: Index was outside the bounds of the array.    at System.Array.GetFlattenedIndex(Int32 rawIndex)    at System.Array.GetValue(Int32 index)    at NLua.MetaFunctions.TryAccessByArray(Lua luaState, Type objType, Object obj, Object index)    at NLua.MetaFunctions.GetMethodInternal(Lua luaState)    at NLua.MetaFunctions.GetMethod(IntPtr state)    at KeraLua.NativeMethods.lua_pcallk(IntPtr luaState, Int32 nargs, Int32 nresults, Int32 errorfunc, IntPtr ctx, IntPtr k)    at KeraLua.Lua.PCall(Int32 arguments, Int32 results, Int32 errorFunctionIndex)    at NLua.Lua.CallFunction(Object function, Object[] args, Type[] returnTypes)    at NLua.Lua.CallFunction(Object function, Object[] args)    at NLua.LuaFunction.Call(Object[] args)    at Garnet.server.LuaRunner.RunInternal(String[] keys, String[] argv) in C:\src\garnet\libs\server\Lua\LuaRunner.cs:line 374    at Garnet.server.LuaRunner.RunTransactionInternal(String[] keys, String[] argv) in C:\src\garnet\libs\server\Lua\LuaRunner.cs:line 323    at Garnet.server.LuaRunner.Run(Int32 count, SessionParseState parseState) in C:\src\garnet\libs\server\Lua\LuaRunner.cs:line 281    at Garnet.server.RespServerSession.ExecuteScript(Int32 count, LuaRunner scriptRunner) in C:\src\garnet\libs\server\Lua\LuaCommands.cs:line 196

But I think it might be best to track this under a new issue once we have more details.

danielchristianschroeter commented 2 months ago

Thank you!

I already tried setting the logger level to Trace, but I still don’t get additional information in my case. All I receive is:

fail: Session[0] [192.168.0.1:51102] [005818D1] Error executing Lua script callback

Without more details, it’s hard to debug. But of course, we can close this issue if the PR is merged.

danielchristianschroeter commented 2 months ago

After adding detailed logging to the ExecuteScript function within libs/server/Lua/LuaCommands.cs, I identified that the Lua script is attempting to call a function named status_reply, which is not defined.

Log Output:

2024-09-05 17:49:16 03::49::16 info: Session[0] [192.168.48.1:41822] [0283D742] Executing Lua script with count: 5
2024-09-05 17:49:16 03::49::16 info: Session[0] [192.168.48.1:41822] [0283D742] Lua script input: Garnet.server.SessionParseState
2024-09-05 17:49:16 03::49::16 fail: Session[0] [192.168.48.1:41822] [0283D742] Error executing Lua script callback. Inner Exception: (null) NLua.Exceptions.LuaScriptException: [string "..."]:2: attempt to call a nil value (field 'status_reply')

Relevant Code: https://github.com/microsoft/garnet/blob/f281bc3dde9bf8b695279f518c8a25e16609c8de/libs/server/Lua/LuaCommands.cs#L192

private unsafe bool ExecuteScript(int count, LuaRunner scriptRunner)
{
    try
    {
        logger?.LogInformation("Executing Lua script with count: {count}", count);
        logger?.LogInformation("Lua script input: {parseState}", parseState);

        object scriptResult = scriptRunner.Run(count, parseState);
        logger?.LogInformation("Lua script executed successfully. Result: {scriptResult}", scriptResult);

        if (scriptResult != null)
        {
            if (scriptResult is string s)
            {
                while (!RespWriteUtils.WriteAsciiBulkString(s, ref dcurr, dend))
                    SendAndReset();
            }
            else if ((scriptResult as byte?) != null && (byte)scriptResult == 36) //equals to $
            {
                while (!RespWriteUtils.WriteDirect((byte[])scriptResult, ref dcurr, dend))
                    SendAndReset();
            }
            else if (scriptResult as Int64? != null)
            {
                while (!RespWriteUtils.WriteInteger((Int64)scriptResult, ref dcurr, dend))
                    SendAndReset();
            }
            else if (scriptResult as ArgSlice? != null)
            {
                while (!RespWriteUtils.WriteBulkString(((ArgSlice)scriptResult).ToArray(), ref dcurr, dend))
                    SendAndReset();
            }
            else if (scriptResult as Object[] != null)
            {
                // Two objects one boolean value and the result from the Lua Call
                while (!RespWriteUtils.WriteAsciiBulkString((scriptResult as Object[])[1].ToString().AsSpan(), ref dcurr, dend))
                    SendAndReset();
            }
        }
        else
        {
            while (!RespWriteUtils.WriteDirect(CmdStrings.RESP_ERRNOTFOUND, ref dcurr, dend))
                SendAndReset();
        }
    }
    catch (LuaScriptException ex)
    {
        logger?.LogError(ex, "Error executing Lua script callback. Inner Exception: {InnerException}", ex.InnerException);
        while (!RespWriteUtils.WriteError("ERR " + (ex.InnerException ?? ex).Message, ref dcurr, dend))
            SendAndReset();
        return true;
    }
    catch (Exception ex)
    {
        logger?.LogError(ex, "Error executing Lua script");
        while (!RespWriteUtils.WriteError("ERR " + ex.Message, ref dcurr, dend))
            SendAndReset();
        return true;
    }
    return true;
}
mmclure-msft commented 2 months ago

I think the main issue is that much of the API documented at https://redis.io/docs/latest/develop/interact/programmability/lua-api/ is not yet implemented. It will probably require community involvement to add the functionality to bring Garnet Lua scripting to parity with Redis. Definitely not a couple of small PRs 😄

mmclure-msft commented 2 months ago

@danielchristianschroeter can you see if https://github.com/mmclure-msft/garnet/commit/b357a632174cad2eab3e74fd80bf7320608352d0 fixes your status_reply error? EDIT: Never mind - there seems to be some issue with returning a table - this needs more investigation.

danielchristianschroeter commented 2 months ago

@mmclure-msft Thank you for your assistance! Your suggestions were very helpful.

Log Output: info: Session[0] [172.24.0.1:40662] [027B1BBF] Lua script executed successfully. Result: table

Even though the Lua script executes successfully, my code still fails with ErrNotObtained and Unable to obtain lock: context deadline exceeded. This happens because the result is returned as a table instead of a string, which causes compatibility issues with the original Redis-based code.

To fix this, in addition to your suggested changes, I modified the code in libs/server/Lua/LuaRunner.cs to ensure the Lua script returns a string, as required for compatibility with the original Redis code (referencing this line).

Here’s the quick and dirty test approach I used:

        object RunInternal(string[] keys, string[] argv)
        {
            if (keys != this.keys)
            {
                if (keys == null)
                {
                    this.keys = null;
                    sandbox_env["KEYS"] = this.keys;
                }
                else
                {
                    if (this.keys != null && keys.Length == this.keys.Length)
                        Array.Copy(keys, this.keys, keys.Length);
                    else
                    {
                        this.keys = keys;
                        sandbox_env["KEYS"] = this.keys;
                    }
                }
            }
            if (argv != this.argv)
            {
                if (argv == null)
                {
                    this.argv = null;
                    sandbox_env["ARGV"] = this.argv;
                }
                else
                {
                    if (this.argv != null && argv.Length == this.argv.Length)
                        Array.Copy(argv, this.argv, argv.Length);
                    else
                    {
                        this.argv = argv;
                        sandbox_env["ARGV"] = this.argv;
                    }
                }
            }

            var result = function.Call();
            return ProcessLuaResult(result);
        }

        object ProcessLuaResult(object[] result)
        {
            if (result.Length > 0)
            {
                var firstResult = result[0];
                if (firstResult is LuaTable luaTable)
                {
                    if (luaTable["ok"] != null)
                    {
                        return luaTable["ok"].ToString();
                    }
                }
                return firstResult;
            }
            return null;
        }

        /// <summary>
        /// Runs the precompiled Lua function
        /// </summary>
        /// <returns></returns>
        public object Run()
        {
            var result = function.Call();
            return ProcessLuaResult(result);
        }

Log Output After Changes: info: Session[0] [172.29.0.1:60954] [037A9C05] Lua script executed successfully. Result: 1

After applying these changes, the code now executes successfully.

mmclure-msft commented 2 months ago

Interesting. Per https://redis.io/docs/latest/develop/interact/programmability/lua-api it's supposed to return a table - I'll need to experiment with real Redis and see what happens.

danielchristianschroeter commented 2 months ago

I see your point, but for some reason the same code is working fine with Redis v7.4.0.

mmclure-msft commented 2 months ago

With a version of your code I have this almost working the same as Redis for both redis.status_reply and redis.error_reply - the only difference is:

127.0.0.1:6379> EVAL "return { ok = 'Success' }" 0
Success
127.0.0.1:6379> EVAL "return { err = 'Failure' }" 0
(error) Failure
127.0.0.1:6379> EVAL "return redis.status_reply('Success')" 0
Success
127.0.0.1:6379> EVAL "return redis.error_reply('Failure')" 0
(error) Failure

in Redis vs.

172.24.80.1:6379> EVAL "return { ok = 'Success' }" 0
"Success"
(5.56s)
172.24.80.1:6379> EVAL "return { err = 'Failure' }" 0
(error) Failure
(4.28s)
172.24.80.1:6379> EVAL "return redis.status_reply('Success')" 0
"Success"
(3.62s)
172.24.80.1:6379> EVAL "return redis.error_reply('Failure')" 0
(error) Failure
(4.11s)

in Garnet (ignore the timings - that's just because I had breakpoints enabled). The difference is Success vs. "Success" and I haven't yet found where that's coming from and whether it's a problem or not.

mmclure-msft commented 2 months ago

Added PR #645 to handle this correctly.

danielchristianschroeter commented 2 months ago

Thank you! This looks good, and the blocking issue appears to be resolved.

However, I've encountered a new undesired behavior: I'm getting a "lock not held" error during the Release, although this doesn't cause any blocking behavior. I previously also set "LuaTransactionMode" to "false." From what I can tell, the lock seems to be acquired successfully. Even after several attempts, this behavior cannot be reproduced with the Redis v7.4.0. Unfortunately, I don't see any hint in the logs of Garnet.

Here’s an example code:

func (rs *RedisStorage) Unlock(ctx context.Context, name string) error {

    key := rs.prefixLock(name)

    // load and delete lock from sync.Map
    if syncMapLock, loaded := rs.locks.LoadAndDelete(key); loaded {

        // type assertion for Redis lock
        if lock, ok := syncMapLock.(*redislock.Lock); ok {

            // release the Redis lock
            if err := lock.Release(ctx); err != nil {
                return fmt.Errorf("Unable to release lock for %s: %v", key, err)
            }
        }
    }

    return nil
}

For reference, here’s the related function in redislock: func (l *Lock) Release(ctx context.Context) error {

mmclure-msft commented 2 months ago

I'm pretty sure the issue is with this line in release.lua: https://github.com/bsm/redislock/blob/bfd385b776d1e069eca5cb6b6cc41d635cbb4e55/release.lua#L5 because unpack isn't implemented. There's still a lot of work to do to get Garnet up to parity with Redis in terms of LUA scripting. I'll chip away at some of it but it's not going to be a short-term thing.

mmclure-msft commented 2 months ago

Actually, unpack should be implemented: https://github.com/microsoft/garnet/blob/b92c3dc57e75e7fc88fa1891f81304e481f342e1/libs/server/Lua/LuaRunner.cs#L68 Will dig a little deeper.

badrishc commented 2 months ago

For obtain.lua, I am hitting an index out of bounds issue with the ipairs command. Are you not hitting this issue?

This seems to be an open issue in NLua, as per this: https://github.com/NLua/NLua/issues/472

badrishc commented 2 months ago

This unit test works: https://github.com/microsoft/garnet/blob/badrishc/lua-more-fns/test/Garnet.test/LuaScriptTests.cs#L352

But if I replace "a" with "KEYS" which is populated with 1 based offsets, the ipairs fails. This is what obtain.lua does:

local setArgs = {}
for _, key in ipairs(KEYS) do
    table.insert(setArgs, key)
    table.insert(setArgs, ARGV[1])
end
badrishc commented 2 months ago

I worked around the ipairs issue by adding an extra null at the end of the KEYS string array in this PR: https://github.com/microsoft/garnet/pull/655 so that works, but unpack(KEYS) does not work as it throws a NLua exception: attempt to get length of a System.String[]

badrishc commented 2 months ago

It seems a more robust solution would be to make KEYS and ARGV LuaTable types rather than C# string arrays. That way, all Lua functions on them would be mostly guaranteed to work.

badrishc commented 1 month ago

The linked PR https://github.com/microsoft/garnet/pull/672 should support most existing Lua scripts - let me know if you see any issues.

danielchristianschroeter commented 1 month ago

Thanks! Unfortunately, I still see the reported issue with "lock not held" from comment https://github.com/microsoft/garnet/issues/635#issuecomment-2333495398

badrishc commented 1 month ago

Good news - this is now fixed in the latest commit to my linked PR. The example from https://github.com/bsm/redislock?tab=readme-ov-file#examples now works unmodified. If your code still does not work, please provide a stand-alone example similar to the redislock example I linked, so that I can run it directly using go run ..

danielchristianschroeter commented 1 month ago

@badrishc, I wanted to follow up and let you know that I've tested the latest commit, and I'm happy to report that it's working very smoothly without any issues now! The improvements you've made to handle Lua scripting in Garnet are fantastic, and I'm confident that many users will appreciate the capability to seamlessly replace Lua Redis implementations with Garnet.

A huge thank you to both you and @mmclure-msft for your tremendous effort and time invested in resolving these issues. It’s wonderful to see the progress that's been made!

Let's close this issue if the PR is merged.

danielchristianschroeter commented 1 month ago

@badrishc Args, wait a minute. I tested now also https://github.com/bsm/redislock?tab=readme-ov-file#examples

It seems, this part is not working as expected:

    // Sleep and check the remaining TTL.
    time.Sleep(50 * time.Millisecond)
    if ttl, err := lock.TTL(ctx); err != nil {
        log.Fatalln(err)
    } else if ttl > 0 {
        fmt.Println("Yay, I still have my lock!")
    }

Response with Garnet:

go run .
I have a lock!
Now, my lock has expired!

Response with Redis (Docker image redis:latest):

 go run .
I have a lock!
Yay, I still have my lock!
Now, my lock has expired!

Update: It looks like Garnet is setting the TTL to zero immediately after obtaining the lock, which is not the expected behavior. This discrepancy could be due to differences in how Garnet handles TTL settings compared to Redis.

Slightly modified main.go with additional logging:

package main

import (
    "context"
    "fmt"
    "log"
    "time"

    "github.com/bsm/redislock"
    "github.com/redis/go-redis/v9"
)

func main() {
    start := time.Now()

    // Connect to redis.
    client := redis.NewClient(&redis.Options{
        Network: "tcp",
        Addr:    "127.0.0.1:6380", // Update Port for Garnet/Redis Endpoint
    })
    defer client.Close()
    fmt.Printf("Redis connection time: %v\n", time.Since(start))

    // Create a new lock client.
    locker := redislock.New(client)
    fmt.Printf("Lock client creation time: %v\n", time.Since(start))

    ctx := context.Background()

    // Try to obtain lock.
    lockStart := time.Now()
    lock, err := locker.Obtain(ctx, "my-key", 300*time.Millisecond, nil) // Increased lock duration
    if err == redislock.ErrNotObtained {
        fmt.Println("Could not obtain lock!")
    } else if err != nil {
        log.Fatalln(err)
    }
    fmt.Printf("Lock obtain time: %v\n", time.Since(lockStart))

    // Don't forget to defer Release.
    defer lock.Release(ctx)
    fmt.Println("I have a lock!")

    // Sleep and check the remaining TTL.
    time.Sleep(50 * time.Millisecond)
    ttlStart := time.Now()
    if ttl, err := lock.TTL(ctx); err != nil {
        log.Fatalln(err)
    } else if ttl > 0 {
        fmt.Printf("Yay, I still have my lock! TTL: %v\n", ttl)
    } else {
        fmt.Println("Lock TTL is zero!")
    }
    fmt.Printf("TTL check time: %v\n", time.Since(ttlStart))

    // Extend my lock.
    refreshStart := time.Now()
    if err := lock.Refresh(ctx, 300*time.Millisecond, nil); err != nil {
        log.Fatalln(err)
    }
    fmt.Printf("Lock refresh time: %v\n", time.Since(refreshStart))

    // Sleep a little longer, then check.
    time.Sleep(100 * time.Millisecond)
    expireCheckStart := time.Now()
    if ttl, err := lock.TTL(ctx); err != nil {
        log.Fatalln(err)
    } else if ttl == 0 {
        fmt.Println("Now, my lock has expired!")
    } else {
        fmt.Printf("Lock still valid! TTL: %v\n", ttl)
    }
    fmt.Printf("Final TTL check time: %v\n", time.Since(expireCheckStart))

    fmt.Printf("Total execution time: %v\n", time.Since(start))
}

With Garnet:

go run .
Redis connection time: 0s
Lock client creation time: 0s
Lock obtain time: 4.149ms
I have a lock!
Lock TTL is zero!
TTL check time: 1.8605ms
Lock refresh time: 2.6271ms
Lock still valid! TTL: 198ms
Final TTL check time: 1.1119ms
Total execution time: 160.9905ms

With Redis:

go run .
Redis connection time: 0s
Lock client creation time: 0s
Lock obtain time: 2.6355ms
I have a lock!
Yay, I still have my lock! TTL: 249ms
TTL check time: 945µs
Lock refresh time: 502.7µs
Lock still valid! TTL: 198ms
Final TTL check time: 1.0737ms
Total execution time: 156.6583ms
badrishc commented 1 month ago

Thank you for the simple clean repro! Fixed in the latest commit to the open PR. It was just a small corner case where we did not pass the extra SET parameters (such as expiration) from the script to the storage API.