kyren / piccolo

An experimental stackless Lua VM implemented in pure Rust
Creative Commons Zero v1.0 Universal
1.62k stars 59 forks source link

Unexpected panic: `thread finalization was missed` #45

Closed jrobsonchase closed 4 months ago

jrobsonchase commented 4 months ago

Unfortunately, this one is way more vague than my previous report. I've yet to figure out exactly how I manged to cause it or to reproduce it reliably. I actually stumbled across #44 while trying to repro this one. The best I can do is provide a description of what I'm attempting in the hopes that some of the details might be relevant. I don't have a stack trace on hand, but the panic is coming from this expect.

I'm working on building a bevy integration. Scripts are loaded module-style and are expected to return a table of functions, which is stashed and stored in a bevy resource which associates it with the AssetId it was loaded from. Since piccolo is !Send, it relies heavily on thread-locals. The Lua environment, a StashedExecutor, and the StashedTables loaded from the compiled scripts are all kept in thread-local data. If the lua system detects that either the StashedTable for a particular asset is missing, or the timestamp for the table is older than that of the most recent asset reload, the module is re-evaluated. On every Update, for each loaded script, it looks up the update method in its table and executes it.

The StashedExecutor is created once alongside the Lua environment and kept around for all operations. Loading a script from the compiled asset (a CompiledPrototype) is done by creating a FunctionPrototype, then a Closure, and finally a Function with which the Executor is restarted, then stepped to completion. The Table result is then taken and stashed. When running the "update" function, the function is looked up from the table, used to restart the executor, and then Lua::execute is used to drive execution.

It seems like the panic was happening shortly after I had edited the script and seen it reload, but not immeditely. My suspicion is that the Executor got left in an odd state after one or more load or execution failures on one thread, but not all threads, so the panic is somewhat nondeterministic as bevy schedules the system to different threads. But it should be reproducible if I can figure out what the conditions actually are, since I'm not using unsafe anywhere and things should be deterministic within each individual thread.

kyren commented 4 months ago

This has to be a bug in the finalization logic in Lua, and there must be pointers being swept without Finalizers::finalize being called, I'm looking into it.

kyren commented 4 months ago

Just so you know what to look for, (naively at least) Executor should have nothing at all to do with this.

The logic is this: Every Thread that is created is registered in a list (in Finalizers), and then just before that Thread is collected, Thread::reset is called on it. The reason we do this is so that we can keep upvalue references to threads weakly, and if a thread were to be garbage collected anyway, the upvalue just automatically becomes closed. This is similar to what PUC-Rio Lua and its many forks do.

Do you happen to be panicking and then catching those panics anywhere? Do you happen to call Lua::gc_collect manually? Neither of these things is wrong at all I'm just trying to narrow it down.

kyren commented 4 months ago

Can you do an experiment for me and see if this commit fixes the problem? a6d2d83008fb62a4dbb1e663da1c1914b536c4ff

I made this commit somewhat confident that I understood what might be happening and I think I actually... don't understand, and now I'm wondering if I need to revert that commit.

My working theory is that there is a Thread that is somehow not "dead" (in the gc-arena sense), but after another thread is reset, it becomes dead somehow. This shouldn't happen though... because the only things that lose Gc ownership are uhh... already dead threads which can't keep another thread alive. All of the upvalues that the thread holds strengthen their ownership of something, not loosen it, so I just don't get it.

If you can possibly try to reproduce the issue and then try to reproduce the issue again with a6d2d83008fb62a4dbb1e663da1c1914b536c4ff and see if it helps, that would be amazing.

kyren commented 4 months ago

oops, closed on accident sorry!

kyren commented 4 months ago

Oh I'm stupid, I know what the problem is.

The problem was not the logic before, it was that after finalization, the arena is still in the marking phase, which means that the user can still come in and make a Thread dead before the next sweep cycle 🤦‍♀️

What needs to happen is that finalization and proceeding to the sweeping phase must happen as one unit of work, and it isn't.

kyren commented 4 months ago

Okay, I think 1cdbca46b20748e93980e5f2efe53de9c927a2e4 is the actual fix here, and in fact I don't even think the previous commit would have fixed the issue except maybe accidentally (as it's a sort of race condition).

The one downside is that (for now) it means that piccolo is using a gc-arena branch again, until https://github.com/kyren/gc-arena/pull/96 is merged.

jrobsonchase commented 4 months ago

Well, I switched computers yesterday and after fiddling with various things for several hours, I never actually saw it again :sweat_smile:

I should have the previous state of my code on my other machine though, at which point I was seeing the panic semi-regularly. I'll go grab it and try to repro, and then give your fix a try.

jrobsonchase commented 4 months ago

Yep, that fixes it! I also managed to reliably reproduce it with this:

use piccolo::{
    compiler::interning::BasicInterner,
    *,
};

const SOURCE: &'static str = r#"
local module = {}

function module.update_global()
  print("Hello, lua!")
end

-- typo: causes a module evaluation error
return moule
"#;

fn main() {
    let mut lua = Lua::full();

    // Does't matter if the script is pre-compiled or Closure::load'ed - both result in the panic.
    let mut interner = BasicInterner::default();
    let parsed = compiler::parse_chunk(SOURCE.as_bytes(), &mut interner).expect("parsed");
    let compiled = compiler::compile_chunk(&parsed, &mut interner).expect("compiled");

    let mut tries = 50;
    let _module = loop {
        if tries == 0 {
            // Never hit - a panic is encountered after 8 failed attempts.
            println!("giving up!");
            return;
        }
        tries -= 1;

        let module = lua.try_enter(|ctx| {
            let fn_proto = FunctionPrototype::from_compiled_map_strings(
                &ctx,
                String::from_slice(&ctx, "<source>"),
                &compiled,
                |s| String::from_slice(&ctx, s.as_ref()),
            );
            let closure = Closure::new(&ctx, fn_proto, Some(ctx.globals()))?;

            let exec = Executor::start(ctx, closure.into(), ());
            loop {
                let mut fuel = Fuel::with(4096);

                if exec.step(ctx, &mut fuel) {
                    break;
                }
            }

            let module = exec.take_result::<Table>(ctx)??;
            Ok(ctx.stash(module))
        });

        match module {
            Ok(m) => break m,
            Err(e) => {
                println!("error loading script: {e}");
                continue;
            }
        }
    };

    println!("loaded successfully!");
}

I was actually wrong about the state of my project when I was seeing the panic - I wasn't stashing the executor and was actually creating a new one every time for script loading. Using a stashed one actually hid the issue, which is why I wasn't seeing it yesterday.

kyren commented 4 months ago

Great work btw, these are really really good finds, keep em coming if you got em!