godotengine / godot

Godot Engine – Multi-platform 2D and 3D game engine
https://godotengine.org
MIT License
91.04k stars 21.18k forks source link

Dangling OmniLight happening randomly #41360

Closed Zylann closed 2 years ago

Zylann commented 4 years ago

Godot 3.2.2, GLES3, Windows 10 64 bits

Continuing from https://github.com/Zylann/voxelgame/issues/71

I was hesitant to post this one, but after investigating I think it's a bug in the renderer.

I have a game project in which I create explosions, each with an OmniLight inside. They get freed after 3 seconds, after lighting the terrain and a bunch of debris I spawn around. After spawning several explosions, the following error happens, and starts spamming:

ERROR: RID_Owner<struct RasterizerSceneGLES3::LightInstance>::getornull: Condition "!id_map.has(p_rid.get_data())" is true. Returned: 0

It happens randomly. Can be after 10 seconds of spawning explosions every so often, 1 minute, or 5 minutes, but consistently happens in a relatively short time, the same way.

After inspection, it seems Godot is pairing lights with mesh instances, but sometimes forgets to unpair. So dangling lights remain attached forever, causing error spams as long as I look at any affected mesh.

EDIT: simpler repro ==> https://github.com/godotengine/godot/issues/41360#issuecomment-675743179 Unfortunately I don't have a simple repro, only my whole project, as I didnt manage to reproduce the problem in a simpler scene so far. Steps:

clayjohn commented 4 years ago

Could this be multithreading issue? Is anything happening in multiple threads, and/or is the renderer set to multithreaded?

Zylann commented 4 years ago

Although threads are going on to manage voxels, the renderer is set to single/safe and is not used from any of those threads.

Zylann commented 4 years ago

Just managed to reproduce it in a simpler project. It's still a rather complicated repro and involves randomness (one of the many situations causes the bug, I just dont know which, so I shoot large), but at least it doesn't involve the voxel module:

OmnilightBug.zip

Launch main scene, start shooting around randomly using LMB. At some point the errors will start printing. Reproduction rate is similar, it can take a few minutes of trying.

lawnjelly commented 4 years ago

I've tried this both with the octree fix and the old octree, and haven't been able to reproduce in either.

I think given that it happens on @Zylann's machine, it would be best for him to download #41122 and see if this fixes the issue on his side.

jocamar commented 4 years ago

I have this same issue happening to me. I have a project with @Zylann 's terrain addon and I'm instantiating a bunch of bullets, each with an OmniLight, which disappear after a couple of seconds. After some time of firing the bullets I start to get a bunch of errors which brings my game to a crawl:

ERROR: RID_Owner<struct RasterizerSceneGLES3::LightInstance>::getornull: Condition "!id_map.has(p_rid.get_data())" is true. Returned: 0
   At: D:\godot-3.2.2-stable\core/rid.h:160

I tried applying the changes from the suggested fix (#41122) but the issue persists. I'm probably going to try to pool my bullets to fix this but it would be nice to get a fix.

TokisanGames commented 3 years ago

Most likely caused by RID design, dangling RID references, and free() spam by SpatialEditor. Probably fixed by #54650

Zylann commented 3 years ago

This happens in game, not in editor, so it cannot be caused by SpatialEditor. Maybe there is another rogue dangling RID elsewhere though.

TokisanGames commented 3 years ago

OK. But it should be fixed by the PR which enforces self cleaning rids.

Zylann commented 3 years ago

What happens to cleaning RIDs in a case like this?

RID a = something_valid;
RID b = a;
free(a);
// b?
lawnjelly commented 3 years ago

What happens to cleaning RIDs in a case like this?

RID a = something_valid;
RID b = a;
free(a);
// b?

Yes I also noticed this in https://github.com/godotengine/godot/pull/54650#issuecomment-962410538 (marked as second thing)

TokisanGames commented 3 years ago

What happens to cleaning RIDs in a case like this?

RID a = something_valid;
RID b = a;
free(a);
// b?

The same thing that happens to regular pointers, b is dangling.

Here, the object memory is free, a is reset to null, b is dangling. Previously both a and b were dangling, and bad things happen if you free dangling pointers. This change isn't to make a smart pointer system. It's to provide access be able to reset the pointer, and nullify the RID you sent to free.

If the 4.x system is smarter, I'm not opposed to back porting it. I haven't looked at it yet

TokisanGames commented 2 years ago

I've found a few more dangling RIDs in the engine and fixed them in my PR.

Lawnjelly made an RID patch #54907 where you can build in RID handles w/ or w/o tracking to find leaks. However, it's not turned on by default. Maybe testing with this will help. Otherwise, all you can do is ensure you're clearing your own RIDs.

Also in release mode, the id_map red/black trees are compiled out, so it may be less likely to cause an issue with disappearing meshes due to dangling RIDs.

akien-mga commented 2 years ago

Would be good to test if this was solved by either @lawnjelly or @tinmanjuggernaut's fixes included in 3.4.1 and 3.5. If not, then indeed a custom build with RID handles tracking could help debug further.

lawnjelly commented 2 years ago

I've just checked the OmniLight_Bug project above and it isn't fixed, however it is being caught by the rids=tracked_handles build. I'll see if I can work out what it causing it.

E 0:00:14.582   handle_get_or_null: RID get_or_null, revision is incorrect, possible dangling RID. RID id=614 [ rev 1 ], PE [ rev 4 ] visual_instance.cpp line 159 ( prev visual_instance.cpp line 159 )
  <C++ Error>   Condition "pe.revision != p_rid._revision" is true. Returned: nullptr
  <C++ Source>  core/rid_handle.cpp:244 @ handle_get_or_null()

This is the error source BTW.

Got it. (Or the problem source). I've found the problem and can fix it, but still tracking down where it is being first introduced, as there is a chain of events involved (for the best fix). Quite an evasive bug! :grinning:

lawnjelly commented 2 years ago

This is a really odd bug, and seems very timing related (possibly a race condition but not sure yet).

There is a dangling RID, but this is a symptom at the end of a long chain of events, rather than the cause. The dangling RID is in VisualServerScene, line 2653.

            if (geom->lighting_dirty) {
                int l = 0;
                //only called when lights AABB enter/exit this geometry
                ins->light_instances.resize(geom->lighting.size());

                for (List<Instance *>::Element *E = geom->lighting.front(); E; E = E->next()) {
                    InstanceLightData *light = static_cast<InstanceLightData *>(E->get()->base_data);

                    ins->light_instances.write[l++] = light->instance;
                }

                geom->lighting_dirty = false;
            }

This code basically updates a list of lights affecting a piece of geometry when this lighting_dirty flag is set.

The problem is that this flag is not set when it should be. If we add an:

            else
            {
                DEV_ASSERT(ins->light_instances.size() == geom->lighting.size());
            }

This triggers just before the error happens - the dirty flag has got out of sync with the lights, and there is one extra dangling RID in the unupdated list ins->light_instances.

(We can incidentally fix the bug by making the condition:

if (geom->lighting_dirty || (ins->light_instances.size() != geom->lighting.size()))

but that is fixing the symptom rather than cause).

It turns out the dirty flag is only set / cleared in the pairing / unpairing callbacks from the BVH, which suggests maybe a problem there. However the problem also occurs with the octree (indeed this issue predates the BVH), and it seems less likely that both independently contain the same bug. So I'm now also suspecting there is misuse going on of the spatial partitioning.

Next I have been reducing the complexity of the reproduction project. It contains a lot of extra unnecessary stuff so I've reduced it quite a bit, and it now seems to go down without a few seconds of firing: OmnilightBug_Cutdown3.zip

It turns out there are two things which seem to need to happen together to cause the conditions for it to occur. There is a timer in rocket_explosion.gd which seems to need to call queue_free(). This is presumably the cause of the light deletion.

This interacts with a random event in _process in omnilight_bug.gd. Specifically alter_mesh is called, which replaces one of the meshes on an instance (at random) with another random sphere mesh.

My best guess is that during the handover between one mesh and the other, something is not being updated which is causing the lights to get out of sync (perhaps an update missing to the BVH / octree?).

Quite a challenge! That's about all I can manage today. :grin:

akien-mga commented 2 years ago

Fixed by #55813.