Closed Zylann closed 2 years ago
Could this be multithreading issue? Is anything happening in multiple threads, and/or is the renderer set to multithreaded?
Although threads are going on to manage voxels, the renderer is set to single/safe
and is not used from any of those threads.
Just managed to reproduce it in a simpler project. It's still a rather complicated repro and involves randomness (one of the many situations causes the bug, I just dont know which, so I shoot large), but at least it doesn't involve the voxel module:
Launch main scene, start shooting around randomly using LMB. At some point the errors will start printing. Reproduction rate is similar, it can take a few minutes of trying.
I've tried this both with the octree fix and the old octree, and haven't been able to reproduce in either.
I think given that it happens on @Zylann's machine, it would be best for him to download #41122 and see if this fixes the issue on his side.
I have this same issue happening to me. I have a project with @Zylann 's terrain addon and I'm instantiating a bunch of bullets, each with an OmniLight, which disappear after a couple of seconds. After some time of firing the bullets I start to get a bunch of errors which brings my game to a crawl:
ERROR: RID_Owner<struct RasterizerSceneGLES3::LightInstance>::getornull: Condition "!id_map.has(p_rid.get_data())" is true. Returned: 0
At: D:\godot-3.2.2-stable\core/rid.h:160
I tried applying the changes from the suggested fix (#41122) but the issue persists. I'm probably going to try to pool my bullets to fix this but it would be nice to get a fix.
Most likely caused by RID design, dangling RID references, and free() spam by SpatialEditor. Probably fixed by #54650
This happens in game, not in editor, so it cannot be caused by SpatialEditor. Maybe there is another rogue dangling RID elsewhere though.
OK. But it should be fixed by the PR which enforces self cleaning rids.
What happens to cleaning RIDs in a case like this?
RID a = something_valid;
RID b = a;
free(a);
// b?
What happens to cleaning RIDs in a case like this?
RID a = something_valid; RID b = a; free(a); // b?
Yes I also noticed this in https://github.com/godotengine/godot/pull/54650#issuecomment-962410538 (marked as second thing)
What happens to cleaning RIDs in a case like this?
RID a = something_valid; RID b = a; free(a); // b?
The same thing that happens to regular pointers, b is dangling.
Here, the object memory is free, a is reset to null, b is dangling. Previously both a and b were dangling, and bad things happen if you free dangling pointers. This change isn't to make a smart pointer system. It's to provide access be able to reset the pointer, and nullify the RID you sent to free.
If the 4.x system is smarter, I'm not opposed to back porting it. I haven't looked at it yet
I've found a few more dangling RIDs in the engine and fixed them in my PR.
Lawnjelly made an RID patch #54907 where you can build in RID handles w/ or w/o tracking to find leaks. However, it's not turned on by default. Maybe testing with this will help. Otherwise, all you can do is ensure you're clearing your own RIDs.
Also in release mode, the id_map red/black trees are compiled out, so it may be less likely to cause an issue with disappearing meshes due to dangling RIDs.
Would be good to test if this was solved by either @lawnjelly or @tinmanjuggernaut's fixes included in 3.4.1 and 3.5. If not, then indeed a custom build with RID handles tracking could help debug further.
I've just checked the OmniLight_Bug
project above and it isn't fixed, however it is being caught by the rids=tracked_handles
build. I'll see if I can work out what it causing it.
E 0:00:14.582 handle_get_or_null: RID get_or_null, revision is incorrect, possible dangling RID. RID id=614 [ rev 1 ], PE [ rev 4 ] visual_instance.cpp line 159 ( prev visual_instance.cpp line 159 )
<C++ Error> Condition "pe.revision != p_rid._revision" is true. Returned: nullptr
<C++ Source> core/rid_handle.cpp:244 @ handle_get_or_null()
This is the error source BTW.
Got it. (Or the problem source). I've found the problem and can fix it, but still tracking down where it is being first introduced, as there is a chain of events involved (for the best fix). Quite an evasive bug! :grinning:
This is a really odd bug, and seems very timing related (possibly a race condition but not sure yet).
There is a dangling RID, but this is a symptom at the end of a long chain of events, rather than the cause. The dangling RID is in VisualServerScene, line 2653.
if (geom->lighting_dirty) {
int l = 0;
//only called when lights AABB enter/exit this geometry
ins->light_instances.resize(geom->lighting.size());
for (List<Instance *>::Element *E = geom->lighting.front(); E; E = E->next()) {
InstanceLightData *light = static_cast<InstanceLightData *>(E->get()->base_data);
ins->light_instances.write[l++] = light->instance;
}
geom->lighting_dirty = false;
}
This code basically updates a list of lights affecting a piece of geometry when this lighting_dirty
flag is set.
The problem is that this flag is not set when it should be. If we add an:
else
{
DEV_ASSERT(ins->light_instances.size() == geom->lighting.size());
}
This triggers just before the error happens - the dirty flag has got out of sync with the lights, and there is one extra dangling RID in the unupdated list ins->light_instances
.
(We can incidentally fix the bug by making the condition:
if (geom->lighting_dirty || (ins->light_instances.size() != geom->lighting.size()))
but that is fixing the symptom rather than cause).
It turns out the dirty flag is only set / cleared in the pairing / unpairing callbacks from the BVH, which suggests maybe a problem there. However the problem also occurs with the octree (indeed this issue predates the BVH), and it seems less likely that both independently contain the same bug. So I'm now also suspecting there is misuse going on of the spatial partitioning.
Next I have been reducing the complexity of the reproduction project. It contains a lot of extra unnecessary stuff so I've reduced it quite a bit, and it now seems to go down without a few seconds of firing: OmnilightBug_Cutdown3.zip
It turns out there are two things which seem to need to happen together to cause the conditions for it to occur. There is a timer in rocket_explosion.gd
which seems to need to call queue_free()
. This is presumably the cause of the light deletion.
This interacts with a random event in _process
in omnilight_bug.gd
. Specifically alter_mesh
is called, which replaces one of the meshes on an instance (at random) with another random sphere mesh.
My best guess is that during the handover between one mesh and the other, something is not being updated which is causing the lights to get out of sync (perhaps an update missing to the BVH / octree?).
Quite a challenge! That's about all I can manage today. :grin:
Fixed by #55813.
Godot 3.2.2, GLES3, Windows 10 64 bits
Continuing from https://github.com/Zylann/voxelgame/issues/71
I was hesitant to post this one, but after investigating I think it's a bug in the renderer.
I have a game project in which I create explosions, each with an OmniLight inside. They get freed after 3 seconds, after lighting the terrain and a bunch of debris I spawn around. After spawning several explosions, the following error happens, and starts spamming:
It happens randomly. Can be after 10 seconds of spawning explosions every so often, 1 minute, or 5 minutes, but consistently happens in a relatively short time, the same way.
After inspection, it seems Godot is pairing lights with mesh instances, but sometimes forgets to unpair. So dangling lights remain attached forever, causing error spams as long as I look at any affected mesh.
EDIT: simpler repro ==> https://github.com/godotengine/godot/issues/41360#issuecomment-675743179
Unfortunately I don't have a simple repro, only my whole project, as I didnt manage to reproduce the problem in a simpler scene so far.Steps:voxel
) and compile Godot with it (note, this module doesnt do anything with lights, I'm pretty confident it doesnt affect the bug)res://blocky_game/blocky_game.tscn