ptitSeb / box64

Box64 - Linux Userspace x86_64 Emulator with a twist, targeted at ARM64 Linux devices
https://box86.org
MIT License
3.76k stars 269 forks source link

RimWorld 1.2/1.3 - Random crashes #110

Closed KweezyCode closed 1 year ago

KweezyCode commented 3 years ago

Mono path[0] = '/home/kweezy/Desktop/RimWorld/RimWorldLinux_Data/Managed' Mono config path = '/home/kweezy/Desktop/RimWorld/RimWorldLinux_Data/MonoBleedingEdge/etc' 0xa04860a0: Stack is corrupted, aborting ==== CPU Registers ==== RAX=207119f200bbe6c3 RCX=0000000000000000 RDX=0000ffffb393d440 RBX=0000ffffb393d3b0 RSP=0000ffffb393d398 FLAGS=--S--PC RBP=0000ffffb393d448 RSI=0000ffffb393d448 RDI=0000ffffb393d3b0 R8=0000000000000001 R9=0000000000000010 R10=0000000000000000 R11=0000000000000000 R12=0000ffffb393d440 R13=00000000a2cde790 R14=0000ffffb393d498 R15=00000000a2cde780 RIP=00000000a04a6913 ======== Stack ======== Stack is from FFFFB393D448 to FFFFB393D398 Old IP: A04860A0 LIBGL: Shuting down 59973|0xa04a9000: Unimplemented Opcode (00) F0 1F EA A0 00 00 00 00 90 72 49 A0 00 00 00

dinchamion commented 3 years ago

Same for me, on a Pi4 8GB with Ubuntu MATE.

KweezyCode commented 3 years ago

Same for me, on a Pi4 8GB with Ubuntu MATE.

as i see, developer ignores just this issue, or i must be blind

ptitSeb commented 3 years ago

The opocde liste is an invalid one. The issue os not the opecode, but the "Stack is corrupted" message that happened before.

I have fixed some the stack corruption error, but not all it seems.

KweezyCode commented 3 years ago

problem still exists, OS: Ubuntu 21.04 Desktop dump.txt dump1.txt

ptitSeb commented 3 years ago

Ok Relevent part on dump1 is

29907|0xffff969ffd5a: Calling my_pthread_mutex_init (/usr/local/bin/box64)(0xFFFF96D4E960, 0x0, 0x1, ...) => return 0x0
29907|0xffff96a16db8: Calling __libc_calloc (/lib/aarch64-linux-gnu/libc.so.6)(0x1, 0x10, 0x1, ...) => return 0x8C61CCF0
29907|0xffff96a0eba8: Calling sem_init (/lib/aarch64-linux-gnu/libpthread.so.0)(0x8C61AC30, 0x0, 0x0, ...) => return 0x0
29907|0xffff96a0e8f9: Calling pthread_setspecific (/lib/aarch64-linux-gnu/libpthread.so.0)(0x19, 0x8C61ABD0, 0x0, ...) => return 0x0
29907|0xffff96a11c41: Calling my_pthread_attr_init (/usr/local/bin/box64)(0xFFFFB03AB3E0, 0xFFFFB03AB470, 0x0, ...) => return 0x0
29907|0xffff96a11c7b: Calling pthread_self (/lib/aarch64-linux-gnu/libc.so.6)(0xFFFFB03AB3E0, 0xFFFFB03AB470, 0x0, ...) => return 0xFFFFB08BB870
29907|0xffff96a11c86: Calling pthread_getattr_np (/lib/aarch64-linux-gnu/libc.so.6)(0xFFFFB08BB870, 0xFFFFB03AB3E0, 0x0, ...) => return 0x0
29907|0xffff96a11cc9: Calling my_pthread_attr_getstack (/usr/local/bin/box64)(0xFFFFB03AB3E0, 0xFFFFB03AB478, 0xFFFFB03AB470, ...) => return 0x0
29907|0xffff96a11d06: Calling my_pthread_attr_destroy (/usr/local/bin/box64)(0xFFFFB03AB3E0, 0xFFFFB03AB478, 0xFFFFB03AB470, ...) => return 0x0
29907|0xffff96a11d50: Calling my___stack_chk_fail (/usr/local/bin/box64)(0xFFFFB03AB3E0, 0xFFFFB03AB478, 0xFFFFB03AB470, ...) =>0x8a0350a0: Stack is corrupted, aborting
==== CPU Registers ====
RAX=10d5e6b23c171300 RCX=0000ffff96a0e8dd RDX=0000ffffb03ab470 RBX=0000ffffb03ab3e0 RSP=0000ffffb03ab3c8 FLAGS=--S--PC
RBP=0000ffffb03ab478 RSI=0000ffffb03ab478 RDI=0000ffffb03ab3e0  R8=0000000000000001  R9=0000000000000010 
R10=0000000000000000 R11=0000000000000000 R12=0000ffffb03ab470 R13=000000008c60dc70 R14=0000ffffb03ab4c8 
R15=000000008c60dc60 RIP=000000008a055913 
======== Stack ========
Stack is from FFFFB03AB478 to FFFFB03AB3C8
Old IP: 8A0350A0
 return 0x10D5E6B23C171300
 return 0x0
End of X86 run (0x8a036130), RIP=0xffff96a11d50, Stack=0xffffb03ab3d0, unimp=0, emu->fork=0, emu->uc_link=(nil), emu->quit=1

Seems to be related to the my_pthread_attr_getstack(...) call. I'll check later.

mogery commented 2 years ago

This is probably fixed by #135. Could you try building a fresh box64 and testing again?

KweezyCode commented 2 years ago

now i see black window that instantly closes dump.txt dumpfromgame.txt (about 38mb): https://ufile.io/c2denjit

mogery commented 2 years ago

You need to run the game with the MESA_GL_VERSION_OVERRIDE=3.2 environment variable and the -force-glcore32 command line argument.

KweezyCode commented 2 years ago

wow, i see game loading screen but then it crashes with Aborted (core dumped) :( do you need log?

mogery commented 2 years ago

do you need log?

Yup. STDOUT, Player.log, and a reading from cat /proc/$(pgrep RimWorld)/maps while the loading screen is present.

dinchamion commented 2 years ago

Same as @KweezyPlay, it starts up, loads for a while, and then crashes. dinchamion@tranquility-pi:~/apps/rimworld/data/noarch/game$ MESA_GL_VERSION_OVERRIDE=3.2 ./start_RimWorld_openglfix.sh Dynarec for ARM64, with extension: ASIMD CRC32 PageSize:4096 Box64 with Dynarec v0.1.5 36bfcd2 built on Oct 18 2021 19:06:15 Using default BOX64_LD_LIBRARY_PATH: ./:lib/:lib64/:x86_64/:bin64/:libs64/ Using default BOX64_PATH: ./:bin/ Counted 64 Env var Looking for ./RimWorldLinux argv[1]="-logfile" argv[2]="/tmp/rimworld_log_1000" Using emulated UnityPlayer.so Using native(wrapped) libm.so.6 Using emulated /lib/x86_64-linux-gnu/libgcc_s.so.1 Using native(wrapped) libpthread.so.0 Using native(wrapped) libc.so.6 Using native(wrapped) ld-linux-x86-64.so.2 Using native(wrapped) librt.so.1 Using native(wrapped) libdl.so.2 Set current directory to /home/dinchamion/apps/rimworld/data/noarch/game Found path: /home/dinchamion/apps/rimworld/data/noarch/game/RimWorldLinux Using native(wrapped) libX11.so.6 Using native(wrapped) libXext.so.6 Using native(wrapped) libxcb.so.1 Using native(wrapped) libXau.so.6 Using native(wrapped) libXdmcp.so.6 Using native(wrapped) libXcursor.so.1 Using native(wrapped) libXinerama.so.1 Using native(wrapped) libXi.so.6 Using native(wrapped) libXrandr.so.2 Using native(wrapped) libXrender.so.1 Using native(wrapped) libXss.so.1 Using native(wrapped) libXxf86vm.so.1 Using native(wrapped) libGL.so.1 Using native(wrapped) libudev.so.1 ./start_RimWorld_openglfix.sh: line 39: 13181 Aborted (core dumped) ./$GAMEFILE $LOG $FORCEGL $@

and

dinchamion@tranquility-pi:~/apps/box64/build$ cat /proc/$(pgrep RimWorld)/maps cat: /proc/13168: Is a directory cat: 13181/maps: No such file or directory

Log: rimworld_log_1000.txt

Don't know how to do STDOUT. :-/

KweezyCode commented 2 years ago

for no reason, when i used BOX64_LOG=2 it loaded. let me test it for stability.

KweezyCode commented 2 years ago

P_20211018_222347_1

mogery commented 2 years ago

for no reason, when i used BOX64_LOG=2 it loaded. let me test it for stability.

Please try it a couple times again. This might be memory mapping related.

dinchamion commented 2 years ago

Have been trying it a few times. Got it running twice, but otherwise crash:

RimWorld 1.3.3117 rev664 Error loading needed lib /home/dinchamion/apps/rimworld/data/noarch/game/RimWorldLinux_Data/Managed/System.Xml.Linq.dll.so Warning: Cannot dlopen("/home/dinchamion/apps/rimworld/data/noarch/game/RimWorldLinux_Data/Managed/System.Xml.Linq.dll.so"/0x8605c320, 101) Error loading needed lib /home/dinchamion/apps/rimworld/data/noarch/game/RimWorldLinux_Data/Managed/mono/aot-cache/amd64/System.Xml.Linq.dll.so Warning: Cannot dlopen("/home/dinchamion/apps/rimworld/data/noarch/game/RimWorldLinux_Data/Managed/mono/aot-cache/amd64/System.Xml.Linq.dll.so"/0x86c2e9a0, 101) double free or corruption (fasttop) 18495|0x64952d8a: Unimplemented Opcode (DF) FE 97 60 02 40 F9 5C DD FE 97 E3 FF FF 17 1F

rimworld-crash.txt

Using BOX64_LOG=2 outright makes it hang without even getting to the loading screen. Both with and without the Royalty files placed in the Data directory. (I don't have the latest expansion yet.)

Gonna try to test it a bit more, and if I luck into making it run, will try to get some logs.

Screenshot at 2021-10-18 20-02-45 Screenshot at 2021-10-18 20-13-57

mogery commented 2 years ago

Using BOX64_LOG=2 outright makes it hang without even getting to the loading screen.

I'm pretty sure it doesn't hang, preload just gets slowed down by all the writing, so you're gonna have to wait a couple minutes. It'd be awesome if you could get me a Player.log with LOG=2.

ptitSeb commented 2 years ago

So it's probably a timing issue if detailled logs makes it run.

dinchamion commented 2 years ago

I'm pretty sure it doesn't hang, preload just gets slowed down by all the writing, so you're gonna have to wait a couple minutes. It'd be awesome if you could get me a Player.log with LOG=2.

You're right there, I've let it run for about an hour or so, and it did get to the loading screen. I had to kill it, but could perhaps let it run as long as it takes... it did however seem to be hanging there at that point. (There's an ellipsis loading animation, which was frozen.)

Thing is, Player.log is 1.7GB. :-D

dinchamion@tranquility-pi:~/apps/box64/build$ ls -l /home/dinchamion/.config/unity3d/Ludeon\ Studios/RimWorld\ by\ Ludeon\ Studios/ total 1631204 drwxrwxr-x 2 dinchamion dinchamion 4096 Oct 18 20:13 Config -rw-rw-r-- 1 dinchamion dinchamion 1670331479 Oct 18 21:52 Player.log -rw-rw-r-- 1 dinchamion dinchamion 841 Oct 18 21:41 prefs drwxrwxr-x 2 dinchamion dinchamion 4096 Oct 18 20:14 Saves drwxrwxr-x 2 dinchamion dinchamion 4096 Oct 18 20:02 Scenarios

Not sure how large it'd grow if I just let it run for hours. I have space, but not infinite. :-) And also above a certain size I would have an issue uploading it to someplace.

I grabbed the tail after killing it:

dinchamion@tranquility-pi:~/apps/box64/build$ tail /home/dinchamion/.config/unity3d/Ludeon\ Studios/RimWorld\ by\ Ludeon\ Studios/Player.log 20228|0xfffeecaf6d3b: Calling __sched_yield (/lib/aarch64-linux-gnu/libc.so.6)(0xFFFE27DE7080, 0x0, 0xFFFEECE1FD68, ...) => return 0x0 20228|0xfffeecaf6d3b: Calling __sched_yield (/lib/aarch64-linux-gnu/libc.so.6)(0xFFFE27DE7080, 0x0, 0xFFFEECE1FD68, ...) => return 0x0 20228|0xfffeecaf6d3b: Calling __sched_yield (/lib/aarch64-linux-gnu/libc.so.6)(0xFFFE27DE7080, 0x0, 0xFFFEECE1FD68, ...) => return 0x0 20228|0xfffeecaf6d3b: Calling __sched_yield (/lib/aarch64-linux-gnu/libc.so.6)(0xFFFE27DE7080, 0x0, 0xFFFEECE1FD68, ...) => return 0x0 20228|0xfffeecaf6d3b: Calling __sched_yield (/lib/aarch64-linux-gnu/libc.so.6)(0xFFFE27DE7080, 0x0, 0xFFFEECE1FD68, ...) => return 0x0 20228|0xfffeecaf6d3b: Calling __sched_yield (/lib/aarch64-linux-gnu/libc.so.6)(0xFFFE27DE7080, 0x0, 0xFFFEECE1FD68, ...) => return 0x0 20228|0xfffeecaf6d67: Calling nanosleep (/lib/aarch64-linux-gnu/libc.so.6)(0xFFFE25C5A4C0, 0x0, 0xFFFEECE1FD68, ...) =>20226|0xfffeecaf1415: Calling pthread_self (/lib/aarch64-linux-gnu/libc.so.6)(0x0, 0xA, 0xA2, ...) => return 0xFFFE263820C0 20226|0xfffeecaf148b: Calling my_pthread_kill (/usr/local/bin/box64)(0xFFFFA7DB0870, 0x1E, 0xA2, ...) => return 0x0 Sigactionhanlder for signal #30 called (jump to 0xfffeecaf5ac2/???(0xfffeecaf5ac2)) 20004|signal function handler 0xfffeecaf5ac2 called, RSP=0xffffa6fa78a8

Maybe you see something from this. Let me know if you want me to run it longer, or upload this 1.7gig log to Dropbox, or something else.

I'd love to figure this out. I miss playing Rimworld.

mogery commented 2 years ago

about an hour or so

Hmm. Ok. That's an issue for trying over and over again. I'll think about how to approach this.

mogery commented 2 years ago

Ok, so I only care about logs if the game crashes, but the game only crashed after the long waiting for preload.

I'll try to get some RimWorld binaries and test on my own time.

@dinchamion What version Ubuntu MATE are you using?

KweezyCode commented 2 years ago

i got it running only once, can't run game anymore for no reason. i am using Ubuntu 21 (the latest). I can send you binaries

mogery commented 2 years ago

i got it running only once, can't run game anymore for no reason. i am using Ubuntu 21 (the latest). I can send you binaries

Please send Player.log with LOG=3 and a memory map. I'll try to get binaries a legit way. We'll see.

dinchamion commented 2 years ago

Current testing status:

I've uploaded a player.log from the crash (with no loglevel) and the "idle" state I killed after a while (with loglevel 3): https://www.dropbox.com/sh/bhkan3u8t115hhw/AADlce1rifWNYzGilarGtvoca?dl=0

I don't know how to do a memory map, you'll need to be slow and patient with me. :-P

mogery commented 2 years ago

I've uploaded a player.log from the crash (with no loglevel) and the "idle" state I killed after a while (with loglevel 3): https://www.dropbox.com/sh/bhkan3u8t115hhw/AADlce1rifWNYzGilarGtvoca?dl=0

Unfortunately I can't really deduce anything from that, only crash logs are useful to me. I'll need to make the game crash with a debugger running on a machine that's local to me.

I'll see what I can do when I get home in a couple hours.

Are you also running Ubuntu 21.04?

dinchamion commented 2 years ago

Are you also running Ubuntu 21.04?

Yes, 21.04 on a Pi4 8GB.

I've left the game running (or whatever it's doing) in the morning, so we'll see if I have anything useful when I get home. (I'm hoping it'd crash instead of just hanging there, to get a crash log.)

mogery commented 2 years ago

Could anyone zip up their game folder and send it to me to mo.geryy@gmail.com? It would let me debug on my own terms and would be amazing

mogery commented 2 years ago

Alright, cool, got it, working on it. Thanks.

mogery commented 2 years ago

Ok, something very weird is going on. Sometimes FreeDynablock is doublefreeing (doublefree or corruption (fasttop)). Sometimes Run gets R_RIP = 0, crashing the whole process. Sometimes mono_os_mutex_lock gets a bad mutex and gets EINVAL from pthread_mutex_lock. Sometimes the game's XML parsing just fails with abruptly ended strings or NullReferenceExceptions. Sometimes Mono misdetects a class as having weak fields, and then tries to register it in the GC, which fails, because the GC has no support for them. Sometimes FillBlock64's call to malloc fails with corrupted double-linked list. Sometimes the game starts completely fine. Thoughts, @ptitSeb? Smells like some sort of memory corruption or memory mapping issues to me.

ptitSeb commented 2 years ago

Yes, it's a memory corruption! It may be because a wrong function signature? Or a structure thta have a different size in x86_64 vs ARM64 but is not adjusted in box64 yet.

mogery commented 2 years ago

Yes, it's a memory corruption! It may be because a wrong function signature?

Wrong function signature as in one that's wrapped?

ptitSeb commented 2 years ago

yes. It's just wild guess of why memory would get corrupted.

bbbruni commented 2 years ago

@ptitSeb I think it's worth it to note that custommem.c (with 4e5473150a9e76d427c3946a115f8ede554579b5) in box86 had potentially introduced a JIT-related regression before box64 was released. WheezyE had a SIGBUS error when trying to run .NET application with that commit and had no SIGBUS error without it.

https://github.com/ptitSeb/box86/issues/217#issuecomment-896907358

Could you clarify why you reduced PROT_DYNAREC from 0x10000 to 0x80 in that commit, please?

ptitSeb commented 2 years ago

I reduce PROT_DYNAREC value to have protection tracking done on uint8_t instead of uint32_t. I have seen any linux PROT_* flags higher the 0x04 (READ, WRITE, EXEC...) so I did that reduction to save some space.

ptitSeb commented 2 years ago

SIGBUS could be due to a compiler optimisation, using some NEON register on an unaligned address?

bbbruni commented 2 years ago

Please post instructions for WheezyE on how to disable compiler optimizations in that github thread so that he can try to avoid SIGBUS.

bbbruni commented 2 years ago

For sure, it' worth trying to launch 32-bit version of Rimworld in box86 with disabled compiler optimizations to see if it has the same issue.

KweezyCode commented 2 years ago

no way, new rimworld is only 64 bit. alpha and beta were 32/64

bbbruni commented 2 years ago

Does 64-bit beta have the same issue that the fresh rimworld?

KweezyCode commented 2 years ago

i know, that 32bit beta has the same issue with random crashes at loading

bbbruni commented 2 years ago

i know, that 32bit beta has the same issue with random crashes at loading

Please note, it wants testing in box86 with disabled compiler optimizations

bbbruni commented 2 years ago

@KweezyPlay @dinchamion @mogery

Please retest. Sebastien reworked custommem.c on November 5 in https://github.com/ptitSeb/box64/commit/f73e4193f032713529e26c4537e5fac44147c040

dinchamion commented 2 years ago

Still crashing unfortunately, with:

Debug info from gdb:

* Assertion: should not be reached at class-accessors.c:73

FillBlock at 0x6a140 triggered a segfault, cancelling
*** stack smashing detected ***: terminated

rimworld_log_1000.log

ptitSeb commented 2 years ago

Yes, it does crash randomly for me. But now it launch, most of the time, for me. It start like in 4 out 5 times, and I'm able to play for somet time. But after a few autosaves, it eventualy crash.

bbbruni commented 2 years ago

Yes, it does crash randomly for me. But now it launch, most of the time, for me. It start like in 4 out 5 times, and I'm able to play for somet time. But after a few autosaves, it eventualy crash.

I saw you made memprot variable in custommem.c volatile in https://github.com/ptitSeb/box64/commit/79d7e9f8eb4c702fe5558798bce79468dd7a0888. I'm a Java/C# dev and know volatile is a very tricky keyword serving as a half memory barrier in multhreading purposes in those languages. In C, volatile is usually is a way to forbid optimisations.

I am not sure I understand why you decided to use volatile. Do you want an effect of memory barriers to prevent instructions reordering by the compiler/CPU? Or want to prevent a simultaneous access of multiple threads to memprot variable?

ptitSeb commented 2 years ago

the volatile is there to prevent C compiler to do optimisation, because that variable may be changed be another thread. The issue here is that many thread are going on, with a JIT in the middle that is tranforming C# bytecode to assembly, and all this need to be handle by box64 own Dynarec, that need to track what block of code needs to be discarded (because the code itself as changed), and was code can (or must) be keeped. Multi-Thread + JIT is tricky to emulate with a Dynarec...

bbbruni commented 2 years ago

the volatile is there to prevent C compiler to do optimisation, because that variable may be changed be another thread. The issue here is that many thread are going on, with a JIT in the middle that is tranforming C# bytecode to assembly, and all this need to be handle by box64 own Dynarec, that need to track what block of code needs to be discarded (because the code itself as changed), and was code can (or must) be keeped. Multi-Thread + JIT is tricky to emulate with a Dynarec...

Maybe it's worth inserting more trace instructions in cutommem source?

From dinchamion's log, it's unclear in which of transformations mentioned by you (I mean, in which lock) a segfault took place.

ptitSeb commented 2 years ago

I'm not sure were to search. That's the issue with multi-task problem, they don't trigger where the root cause is. Maybe I'll try to add some Memory Barrier later, like with box86. Maybe that will help...

ptitSeb commented 2 years ago

I have added the memory barrier. You can retry @dinchamion

dinchamion commented 2 years ago

Looks to be starting up reliably now. :-)

Need to do more testing and actually playing it more, but so far so good. (The errors on startup are probably due to my savegame being vanilla without expansions.)

Screenshot at 2021-11-16 20-49-34 Screenshot at 2021-11-16 20-49-42

bbbruni commented 2 years ago

Looks to be starting up reliably now. :-)

Need to do more testing and actually playing it more, but so far so good. (The errors on startup are probably due to my savegame being vanilla without expansions.)

Could you please say if you had a crash at 6a1b9e050a488cb5b2e1217fd3a01e484c3d7bb7?