ioquake / ioq3

The ioquake3 community effort to continue supporting/developing id's Quake III Arena
https://ioquake3.org/
GNU General Public License v2.0
2.42k stars 529 forks source link

Dedicated server running on ARM64 and ARM causes the client to freeze #570

Open Parker1200 opened 2 years ago

Parker1200 commented 2 years ago

At first everything seemed fine on AArch64, but after a few weeks of using the dedicated server, unfortunately it is obvious that ARM64 is struggling with issues that give players a lot of headache and confusion.

OS: Ubuntu 22.04 AArch64 Version: ioq3 1.36_GIT_6d748965-2022-03-21 linux-arm64 Jul 31 2022

Plus: Latest Manjaro, cross-compiled binaries that run via QEMU on ARM64.

The client is always running on x86_64 the newest from Git.

On the c_phobia map, when you hit the upper half of the wall in one of the two round rooms, the client will freeze. I tested it on OpenArena (configured it to work) and Quake 3. OpenArena is worse because the client stops completely and I can't exit using the menu. In Quake 3, you get stuck in a wall. The symptoms changed when I compiled a dedicated server without -ffast-math. Now in Quake 3, the player dies immediately after touching a wall. OpenArena continues to lock up.

I tested it on ARM64 VPS and on ARM using a Raspberry Pi 2B. I was also stuck in a wall on Rpi2.

I even tested the ARM64 binary with QEMU. Same problems. The same ioquake3 dedicated server compiled on x86_64 has zero problems, everything works stably, no client side lockups.

I thought changing pmove_float to 0 would change something, but it had no effect.

I'm starting to feel very uncomfortable running my server on ARM64. People experience similar random bugs and freezes on other maps as well. However, on c_phobia it is easy to reproduce this bug. The map is unplayable.

I wonder if these errors can be fixed at all or do I need to shut down the server now and try to find an x86_64 solution?

Link to the map: https://lvlworld.com/download/id:55

Screenshot of a frozen game, menu still working in Q3: q3_arm64_bug

Parker1200 commented 2 years ago

I found a temporary workaround for this problem. I ran binaries for an x86 processor using Box86. The performance is pretty good, absolutely amazing for an x86 emulator. There are no problems with the c_phobia map. I only encountered one crash when switching maps while using Box86.

I also tried to use Box64, but the dedicated server despite using binaries for x86_64 had similar problems as the native code for ARM64. In OpenArena, when you touch a wall, the player immediately dies on the c_phobia map. Besides, a dedicated server running via Box64 is not able to load the native game code, so only QVM remains.

Edit: Unfortunately, there are too many crashes during intermission. I just had another one. Box86 is not an option.

Edit # 2: Another observation was that c_phobia only worked properly with the native game code under Box86, not with QVM.

wtfbbqhax commented 2 years ago

It sounds like you've got some serious architecture issues. It can be a lot of effort, but you're going to need to pull in a debugger (gdb or lldb) to debug what exactly is happening to those clients.

Parker1200 commented 2 years ago

Well, I haven't used any debugger, but I have been able to track this problem by other means.

Some time ago I managed to make a hotfix for the OpenArena mod by modifying the mod's game code. Seems to be working properly so far for this particular mod. This is a different project, so let's go back to ioquake3.

The problem is how NaN is handled. It completely escaped my attention as my main target was ARM64, but the problem even occurs on x86_64 with native game code (vm_game set to 0) and QVM! I'm testing on Manjaro. It does not appear if I am using the QVM that came with the game.

For example, in the ioquake3 source code in g_trigger.c, line 178: if ( !time ) { If time is NaN then the code in the if statement should execute. However, if time is NaN and the code in the if expression is not executed then invalid jumppads are registered on c_phobia and this causes the client to crash or send the player to the next galaxy.

This should be fixed and tested on all platforms. The c_phobia map is perfect for testing. I just made a fix for the OpenArena mod by wrapping the sqrt function. I'm only good with scripting languages, so I don't think I can write a proper patch for ioquake3, especially for such a convoluted problem.

There is another map that causes problems called "odam": https://lvlworld.com/review/id:396 Here every jumppad is actually valid and registered, but before wrapping the sqrt function I experienced random crashes in the OpenArena mod.

Edit: Fixed my example.

timangus commented 2 years ago

If time is a NaN, you've got bigger problems. Coercing a NaN (or even a float) to a bool is not a good thing to do and is probably undefined behaviour. if(time <= 0.0f || isnan(time)) is probably a better check here. Ultimately it looks possibly like there is a NaN in the map data, which is bleeding into the game code.

Parker1200 commented 2 years ago

This would only fix the c_phobia map. However, after I wrapped the sqrt function and changed each NaN to 0.0 using isnan(), bugs and crashes ended in the odam map. I'm 98% sure.

timangus commented 2 years ago

The best course of action here is to determine where the NaN is coming from, and dealing with it as early as possible.

Parker1200 commented 2 years ago

I injected the code into the AimAtTarget function to print the height value. time = sqrt( height / ( .5 * gravity ) ); When the height is negative, the result will be NaN. There are many jumppads on c_phobia that have a negative height and therefore should be turned off by the game when everything works as intended.

Things are getting weirder on odam. When I load the map, the last jumppad can have a different height value. I load the map multiple times and the value changes. Sometimes it is 134.5 and sometimes -113.5. I don't know if this is a bug or if the map does some sort of randomization and switches between the two values.

Now I'm not sure if I mixed up anything while testing my hotfix, but changing that single line also seems to fix the odam map, apart from the fact that sometimes the jumppad doesn't work at all. I'm not entirely sure. I have to assume I was wrong, sorry for the bad information. I still feel more comfortable with wrapping the entire sqrt function for my personal use. It's hard to test because that odam jumppad is weird anyway. It's an invisible point in the air that helps the player get to quad damage.

This is still an architectural problem. Playing with the original qagame.qvm that came with Quake 3 does not cause any problems on x86-64, but is buggy on AArch64. Therefore, those who play the game self-hosted on ARM64 will have to avoid the original QVM even after the fix.

timangus commented 2 years ago

Sounds to me like height being negative is the real bug here. Should it not be calculated as height = fabs(ent->s.origin[2] - origin[2]);?

Parker1200 commented 2 years ago

After changing the code now the jumppad on the odam map when the height is 113.5 pushes the player in the opposite direction of the quad damage.

Now there are certainly many more jumppads on c_phobia. I doubt that this is what the author of the map intended. Touching some of them just makes a very loud noise. I have never seen the Quake 3 engine interpret map data like this in the case of c_phobia map.

timangus commented 2 years ago

Honestly it just sounds like the map is broken, or at least is using the jump pad entities in a way that wasn't originally intended. The code assumes that height is never negative, but with this map it seems that it is. The reason it "works" in the QVM world is that it's a system that is very relaxed about what it'll accept; presumably it is not returning a NaN when you take the square root of a negative number. You could try something like the following:

@@ -174,7 +174,11 @@ void AimAtTarget( gentity_t *self ) {

        height = ent->s.origin[2] - origin[2];
        gravity = g_gravity.value;
-       time = sqrt( height / ( .5 * gravity ) );
+       if ( height < 0.0 ) {
+               time = -sqrt( -height / ( .5 * gravity ) );
+       } else {
+               time = sqrt( height / ( .5 * gravity ) );
+       }
        if ( !time ) {
                G_FreeEntity( self );
                return;
Parker1200 commented 2 years ago

I tested the patch and this actually makes more sense:

    gravity = g_gravity.value;
    if ( height <= 0.0 ) {
        time = 0.0;
    } else {
        time = sqrt( height / ( .5 * gravity ) );
    }
    if ( !time ) {

I think this is the only desired behavior.

This patch can serve as a quick and good fix, but will not affect the QVM from pak8.pk3. It works fine on x86_64, but not on ARM64. Only all x86_64 usage scenarios will be fine after applying this patch. BTW. This would mean that even in the case of the QVM game code from the pak8.pk3 file, NaN is still returned when a negative number is passed to sqrt.

Interestingly, Ratmod for OpenArena has pretty much the same code, but doesn't need this patch at all on the x86_64 architecture. It compiles and works fine. On ARM64 architecture it has this problem. However, Ratmod doesn't support ARM64 out of the box, so q_platform.h had to be modified.

Maybe there are some compiler flags that can fix the way the game code works and maybe fix qagame.qvm from pak8.pk3 on ARM64 architecture? In the case of Ratmod for OpenArena, I couldn't find the flags on ARM64 to achieve this, so I started modifying in the game code. I also had to remove -ffast-math when compiling Ratmod on ARM64 architecture, at least for isnan() to work.

In my experience, compiler flags can affect how NaN is treated even on x86_64.

Links to Ratmod if needed for comparison and testing: https://github.com/rdntcntrl/ratarena_release https://github.com/rdntcntrl/ratoa_gamecode

How to use Ratmod with ioquake3: https://ratmod.github.io/faq.html

This simply cannot be fixed by just modifying the g_trigger.c file. Either some compiler flags can fix this, or else you need to find another way to keep qagame.qvm always working correctly on all platforms. Unless it is impossible for some reason. I don't know exactly how QVM works.

ec- commented 1 year ago