secondlife / viewer

🖥️ Second Life's official client
GNU Lesser General Public License v2.1
206 stars 53 forks source link

Lua: require\\fiber.lua stack overflow ((null)) #2237

Open AtlasLinden opened 1 month ago

AtlasLinden commented 1 month ago

Environment

Second Life Release 7.1.9.10286089860 (64bit) Release Notes

You are at 121.0, 110.0, 23.1 in Pasta Cake located at simhost-0a7de8a08a5d18b0e.agni SLURL: http://maps.secondlife.com/secondlife/Pasta%20Cake/121/110/23 (global coordinates 332,153.0, 305,518.0, 23.1) Second Life Server 2024-06-11.9458617693 Release Notes

CPU: 12th Gen Intel(R) Core(TM) i9-12900HK (2918.4 MHz) Memory: 32440 MB OS Version: Microsoft Windows 10 64-bit (Build 19044.3086) Graphics Card Vendor: NVIDIA Corporation Graphics Card: NVIDIA GeForce RTX 3050 Ti Laptop GPU/PCIe/SSE2

Windows Graphics Driver Version: 31.0.15.4659 OpenGL Version: 4.6.0 NVIDIA 546.59

Window size: 1278x1368 Font Size Adjustment: 96pt UI Scaling: 1 Draw distance: 128m Bandwidth: 10000kbit/s LOD factor: 1.75 Render quality: 5 Texture memory: 4095MB Disk cache: Max size 1638.4 MB (36.6% used)

J2C Decoder Version: KDU v7.10.4 Audio Driver Version: FMOD Studio 2.02.20 Dullahan: 1.14.0.202310131404 CEF: 118.4.1+g3dd6078+chromium-118.0.5993.54 Chromium: 118.0.5993.54 LibVLC Version: 3.0.16 Voice Server Version: Vivox 4.10.0000.32327.5fc3fe7c.5942f08

Packets Lost: 0/1,148 (0.0%) August 09 2024 06:35:38

Description

This error was noticed in the logs after noticing a lua scripted floater stop updating. WARNING #Lua# llcommon/lua_function.cpp(517) LuaState::~LuaState : atexit() function error: [string "C:\\Program Files\\SecondLifeLua\\scripts/lua/require\\fiber.lua"]:326: stack overflow ((null))

Reproduction steps

Not the most ideal steps but it does reproduce fairly consistently with the following:

  1. Run the test_luafloater_speedometer.lua script
  2. Leave it running for a few minutes
  3. Use other apps outside of the viewer in the meantime
  4. Return to the viewer, move your avatar and observe that the speedometer has stopped updating
nat-goodspeed commented 1 month ago

I was able to repro without even switching to other apps, just poking around in the viewer for a few minutes.

nat-goodspeed commented 1 month ago

I added code to include a stack trace in the message when any atexit() function fails with an error, producing:

2024-08-09T19:24:40Z WARNING #Lua# llcommon/lua_function.cpp(529) ~LuaState : atexit() function error: [string "/Users/nat/linden/viewer-lua-main/build-darwin-x86_64/newview/RelWithDebInfo/Second Life Test.app/Contents/Resources/scripts/lua/require/fiber.lua"]:328: stack overflow ((null))
[string "/Users/nat/linden/viewer-lua-main/build-darwin-x86_64/newview/RelWithDebInfo/Second Life Test.app/Contents/Resources/scripts/lua/require/fiber.lua"]:253 function scheduler
[string "/Users/nat/linden/viewer-lua-main/build-darwin-x86_64/newview/RelWithDebInfo/Second Life Test.app/Contents/Resources/scripts/lua/require/fiber.lua"]:328 function run

which is perplexing. I assumed the stack in question was the function-call stack, but with only those two entries, that seems highly unlikely.

nat-goodspeed commented 1 month ago

fwiw fiber.lua line 253 (in function scheduler()) is the line:

others = next(ready)

which simply tests whether the ready list is or is not empty. next() is a Lua builtin that returns the first (key, value) pair if ready isn't empty, or nil if it is.

I'm perplexed as to how that could result in a data stack overflow.

Google searches for the "stack overflow ((null))" message have not yet been fruitful.

nat-goodspeed commented 1 month ago

Poking around in the Luau source, I find the message here:

void luaL_checkstack(lua_State* L, int space, const char* mes)
{
    if (!lua_checkstack(L, space))
        luaL_error(L, "stack overflow (%s)", mes);
}

where the passed mes is (as usual) nullptr. lua_checkstack() is here:

int lua_checkstack(lua_State* L, int size)
{
    int res = 1;
    if (size > LUAI_MAXCSTACK || (L->top - L->base + size) > LUAI_MAXCSTACK)
        res = 0; // stack overflow
    else if (size > 0)
    {
        luaD_checkstack(L, size);
        expandstacklimit(L, L->top + size);
    }
    return res;
}

In a local build of 3p-luau, I added debugging output to lua_checkstack(), producing this when the error occurs:

==================== Stack overflow
LUAI_MAXCSTACK = 8000
requested size = 1
stack span = 13686, stacksize = 13691
function base = 8, top = 7996
nested C calls = 1
==================== end stack overflow

I'm not sure how to diagnose where the problem is. Apparently some function is pushing 7988 stack items, such that requesting a single item more hits the overflow condition. (In my local edit, I subtracted 4 from the max in case it might be useful to call another Lua function (print()?) to examine stack entries ... but I don't want to review almost 8000 of them.)

nat-goodspeed commented 1 month ago

Passing __FUNCTION__ to luaL_checkstack() actually identified at least the specific call that hits the error: it's in our helper function lua_rawgetfield(). To simplify future debugging, I added a lluau_checkstack(L, n) macro that implicitly passes __FUNCTION__ to the underlying luaL_checkstack() function, and replaced all our luaL_checkstack(L, n, nullptr) calls with lluau_checkstack(L, n).

As lua_rawgetfield() is pretty recent, the only existing caller is lluau::check_interrupts_counter().

Knowing the location let me wrap the lua_rawgetfield() call in try / catch, with a statement in the catch block on which I could set a breakpoint. When I hit that breakpoint, I was able to examine the C++ call stack.

To my surprise, the C++ call stack looked perfectly ordinary. I was expecting infinite recursion on somebody's function call stack, but neither Lua nor C++ report such.

Of course hitting this error in check_interrupts_counter(), which is called every Lua engine "interrupt," makes one wonder if the problem might be that check_interrupts_counter() is itself performing too many Lua operations, and it's being reentered by another interrupt. I tried setting and checking a recursion flag, such that check_interrupts_counter() returns immediately if a previous instance hasn't yet returned. But there are relatively few ways to associate such a flag with a specific lua_State; I chose to make another Registry entry. That only led to Lua data stack overflow trying to access the recursion flag.

nat-goodspeed commented 4 weeks ago

Please test with this viewer build.

nat-goodspeed commented 4 weeks ago

I didn't see an issue status meaning "ready for QA," but this is. It's not yet "integrated" because I still don't have approval for the bug-fix PR.