sezero / quakespasm

QuakeSpasm -- A modern, cross-platform Quake game engine based on FitzQuake.
https://sourceforge.net/projects/quakespasm/
GNU General Public License v2.0
244 stars 97 forks source link

Crash in `SV_AreaTriggerEdicts()` function #64

Closed alexey-lysiuk closed 1 year ago

alexey-lysiuk commented 1 year ago

The given crash is very rare, but it happened from time to time on different maps/mods. Recently, I managed to reproduce a bit more reliably.

The crash occurs consistently at the beginning of Ranger Dynamics level, rm_dynamics, from Re:Mobilize. Here is a stripped down version of the mod rm1.0-stripped.zip with saved game included.

Steps to reproduce

Callstack

SV_AreaTriggerEdicts(ent=0x000000015e656510, node=0x0000000100599648, list=0x00000001263db4e0, listcount=0x00007ff7bfefeb68, listspace=525) at world.c:294:13
SV_AreaTriggerEdicts(ent=0x000000015e656510, node=0x00000001005995d8, list=0x00000001263db4e0, listcount=0x00007ff7bfefeb68, listspace=525) at world.c:322:3
SV_AreaTriggerEdicts(ent=0x000000015e656510, node=0x00000001005995a0, list=0x00000001263db4e0, listcount=0x00007ff7bfefeb68, listspace=525) at world.c:320:3
SV_AreaTriggerEdicts(ent=0x000000015e656510, node=0x0000000100599568, list=0x00000001263db4e0, listcount=0x00007ff7bfefeb68, listspace=525) at world.c:320:3
SV_AreaTriggerEdicts(ent=0x000000015e656510, node=0x0000000100599530, list=0x00000001263db4e0, listcount=0x00007ff7bfefeb68, listspace=525) at world.c:320:3
SV_TouchLinks(ent=0x000000015e656510) at world.c:347:2
SV_LinkEdict(ent=0x000000015e656510, touch_triggers=true) at world.c:502:3
SV_StepDirection(ent=0x000000015e656510, yaw=2.3561945, dist=1) at sv_move.c:256:3
SV_MoveToGoal at sv_move.c:414:3
PR_ExecuteProgram(fnum=1404) at pr_exec.c:616:4
SV_RunThink(ent=0x000000015e656510) at sv_phys.c:144:2
SV_Physics_Step(ent=0x000000015e656510) at sv_phys.c:1160:2
SV_Physics at sv_phys.c:1218:4
Host_ServerFrame at host.c:650:3
_Host_Frame(time=0.00200000009) at host.c:723:3
Host_Frame(time=0.00200000009) at host.c:786:3

Entity class was monster_army, but sometimes it crashes on spark as well. Optimization and platform have no effect on it, reproduced with 64-bit Intel macOS and with 32-bit VS2022 debug builds. Removal of this srand() call seems to make its reproduction more deterministic.

sezero commented 1 year ago

Does it ever happen with ironwail or vkquake?

CC: @ericwa, @temx, @andrei-drexler, @Shpoike, @Novum

alexey-lysiuk commented 1 year ago

~~Didn't manage to reproduce it with vkquake so far, regardless of protocol 666 or 999. Reproduced it with ironwail, both protocol versions.~~

Reproduced the crash with vkquake (after a bunch of tries) and ironwail.

sezero commented 1 year ago

The following seems to cure the issue, but I'm not sure it's correct. Waiting for others' comments.

diff --git a/Quake/world.c b/Quake/world.c
index 475cef6..791bfd5 100644
--- a/Quake/world.c
+++ b/Quake/world.c
@@ -289,7 +289,7 @@ SV_AreaTriggerEdicts ( edict_t *ent, areanode_t *node, edict_t **list, int *list
    edict_t     *touch;

 // touch linked edicts
-   for (l = node->trigger_edicts.next ; l != &node->trigger_edicts ; l = next)
+   for (l = node->trigger_edicts.next ; l != NULL && l != &node->trigger_edicts ; l = next)
    {
        next = l->next;
        touch = EDICT_FROM_AREA(l);
@@ -793,7 +793,7 @@ void SV_ClipToLinks ( areanode_t *node, moveclip_t *clip )
    trace_t     trace;

 // touch linked edicts
-   for (l = node->solid_edicts.next ; l != &node->solid_edicts ; l = next)
+   for (l = node->solid_edicts.next ; l != NULL && l != &node->solid_edicts ; l = next)
    {
        next = l->next;
        touch = EDICT_FROM_AREA(l);
alexey-lysiuk commented 1 year ago

The main reason to submit this issue is questionable validity of trigger_edicts and solid_edicts linked lists. According to for conditions, nodes must form a loop. This issue definitely needs someone who knows engine internals well.

sezero commented 1 year ago

Well I'm not comfortable with the patch at all. Still waiting for others' comments.

Shpoike commented 1 year ago

if an ent is linked then its next/prev values will NEVER be null. So if you've got nulls then its not linked and shouldn't have been found there anyway. QS adopted QSS's fix to fully solve the bugs with any relinking inside trigger touch events so it shouldn't be possible for it to be iterating over lists that are changed while still iterating (the historical source of all related bugs).

Basically, you should simply never have any nulls in there. Just hacking past the segfault is going to leave you with a more obscure bug that gives weird inconsistent behaviour (eg random triggers not working, or whatever). Without doing any actual debugging, there's a good chance that its some sort of memory corruption. Poking around, there is a memset (ent, 0, pr_edict_size); line inside SV_LoadGame_f that's clearing the entire ent without unlinking it first, which is a strong candidate as the source of nulls - eg an ent is killtargetted and saved such that the loadgame spawns it (saved games do an initial spawn for eg precaches+makestatic+ambientsoud), links it, then clears it as part of restoring freed-ent state (side note: remove only clears some fields, nor all of them - this means it should be possible to read stale values for those fields until the ent is cleared by spawn() to become something else, but saved games will wipe all that lingering data, but w/e evil hacks are evil anyway).

Extra info: The area nodes form a fixed-depth kdtree that the engine generates in an attempt to split the ents into smaller lists so it doesn't have to iterate every single one every single time. Down side: If an entity manages to link itself at the centre of the map then it'll be inserted into the root node itself and thus checked against EVERY time something is relinked etc (often thrice at a time thanks to setmodel+setsize+setorigin...) Fragile: if you rewrite this stuff like DP or FTE did to try to work around the previous problem (ie: when xonotic goes out of its way to spawn 2000 invisible ents right on the root node) then you will find ents can get pulled out in a different order vs other engines, which results in eg trigger_push entities getting touched in a different order and with each stomping on the player's velocity you end up with a different trigger's velocity being used instead of the velocity you expected, basically breaking areas where wind tunnel ents overlap.

alexey-lysiuk commented 1 year ago

It seems like the issue is boils down to this fragment of Host_Loadgame_f() function.

In the linked code, the last line sets sv.num_edicts to the number of edicts parsed from saved game. When this number is bigger than the value of sv.num_edicts, "new" edicts are handled correctly, i.e. memset() + ED_ParseEdict() + SV_LinkEdict(). However, when the number of parsed edicts is smaller than the value of sv.num_edicts, "old" edicts are not freed nor unlinked while their memory can be reused by newly spawned entities.

I think, these "old" edicts should be explicitly cleared right before assignment of a new value to sv.num_edicts.

for (int i = entnum; i < sv.num_edicts; i++)
    ED_Free(EDICT_NUM(i));

Just to avoid the given crash, it's enough to do SV_UnlinkEdict() instead of ED_Free(), so I would like to ask someone familiar with this code to take a closer look.

sezero commented 1 year ago

@ericwa, @andrei-drexler?

alexey-lysiuk commented 1 year ago

I made PR #75 with the proposed fix. Maybe this helps with its review.