rakslice / macemu

Basilisk II and SheepShaver Macintosh emulators
0 stars 0 forks source link

ss x64 on various current distros kanjitalk755: hang with SDL2 default opengl render driver #28

Closed rakslice closed 4 years ago

rakslice commented 4 years ago

Summary:

In ubuntu 20.04 amd64 on a dual core with HT machine, kanjitalk755's fork hangs indefinitely at a black screen on launch, when using the "opengl" SDL 2 renderer. As a workaround use you can use the SDL renderer "software" instead of the default "opengl" one. As of commit a01387b1af07eb4217a3ea9b6561125dbd1c060c, the preferences file parameter sdlrender works for this purpose so you don't have to make a code change.

Previous first investigation steps:

From a cursory look with gdb it is hanging in the various spin_lock()s in spcflags.hpp, although this may just be what a readily identifiable thread is doing at the time.

The upstream cebix/macemu code doesn't have this issue on the same machine.

rakslice commented 4 years ago

Note that kanjitalk755 works okay on linux for amd64 generally, such as the random debian 9.5 amd64 vm I have around, even with only a single processor.

rakslice commented 4 years ago

sdl2 in 20.04 is 2.0.10+dfsg1-3

rakslice commented 4 years ago

Same issue with sdl2 2.0.10 built from source.

Note that cebix/macemu has a sort of similar hang before starting in 20.04 if audio is enabled in the config, despite its message about disabling audio because the devices weren't found in that case

rakslice commented 4 years ago

Some old commits that appear in the commit log of kanjitalk755's branch don't have this problem, but maybe they just appear there by virtue of being merge sources that don't really represent the state of the history of the branch

rakslice commented 4 years ago

Perhaps it would be worth getting to grips with some of the transient build problems in the history of the branch to usefully bisect it, for instance the undefined SIGSEGV_FAULT_HANDLER_ARGLIST error of e.g. f96c92ad5178b29d096d87bee0826196c0f7df3d

rakslice commented 4 years ago

164935017b97eb2f1d39e566bd05ee428bf9099c -> ok f170a527b21de23178666ffb6be693d4996e8272 -> failing to build

rakslice commented 4 years ago

Okay, so part of the build failure is that some autoconf checks that should pass are failing because they just test build sigsegv.cpp but it doesn't build for reasons unrelated to the check, e.g. because of some mach specific code not in a mach-specific define case, as well as missing the include of stdint.h. There are still more problems affecting f170a52, but for f96c92ad5178b29d096d87bee0826196c0f7df3d it is enough to gets us through the build and it doesn't have this issue!

f96c92ad5178b29d096d87bee0826196c0f7df3d -> ok

rakslice commented 4 years ago

This kind of autoconf checks used in the project, where it just tries different flags on big huge source files, are kind of brittle for development for this reason... if you're absolutely sure the code works, they are an okay way to save time I guess, but considering that code can fail to compile for a variety of reasons while doing development, they are a very poor substitute compared to just having narrow feature check code specifically intended for such checks

rakslice commented 4 years ago

52fe2290fe5816ad67aa2796331773c2c5208ee3 is the first bad commit

rakslice commented 4 years ago

This is just the first bad build with default config options by virtue of it being the first one where SDL2 is enabled. If I explicitly configure 52fe2290fe5816ad67aa2796331773c2c5208ee3 with --enable-sdl-video=no --enable-sdl-audio=no [and work around the flag bug in main_unix.cpp], it works fine

rakslice commented 4 years ago

After a fair amount of shotgun debugging my way through various parts of the SDL driver, I found that changing the empty render driver hint in video_sdl2.cpp to request software rendering, as was already done for the WIN32 case, made it start working immediately on my test system:

SDL_SetHint(SDL_HINT_RENDER_DRIVER, "software");

rakslice commented 4 years ago

I should probably follow up on this as it will likely come up with 20.04 hitting wider adoption soon

rakslice commented 4 years ago

Note that just commenting out the empty hint entirely so that there is no hint doesn't make the problem go away.

rakslice commented 4 years ago

In the failing config with opengl renderer, redraw_func is being called at reasonable ~16ms intervals, but the VideoVBL calls aren't happening. Does this mean the emulated Mac doesn't think video is in an active state? Is this because some value it has received from initialization to that point is not good, or else something else has gone wrong that is holding up the emulated Mac (what could it be)?

rakslice commented 4 years ago

I had the same thing happen on Arch x64 2020.06.01.

Notably both of these distros seem to have the same kind of 'modern' permissions changes where e.g. you need to run gdb as root

kanjitalk755 commented 4 years ago

This issue was fixed by 1da8385. And had existed since 2005 (dd2b9a9).

rakslice commented 4 years ago

@kanjitalk755 Curious. Did dd2b9a95d5c51866c5fb9571c9dfd3ec8e11a246 introduce it? If so, how? Something to do with the alignment change of regs?

rakslice commented 4 years ago

Oh, I see, by virtue of spcflags being inside powerpc_registers and the instance of it actually located at regs_ptr() never getting initialized there

rakslice commented 4 years ago

@kanjitalk755 It just occurs to me that initializing that regs_ptr() seems like a job for placement new. I wonder if there are more of these things in the codebase that are effectively c-style casts of a confusingly-typed buffer?

kanjitalk755 commented 4 years ago

Can you explain in detail?

rakslice commented 4 years ago

From the old dd2b9a95d5c51866c5fb9571c9dfd3ec8e11a246 change:

In class powerpc_cpu:

Before:

    powerpc_registers regs;

After:

    struct {
        powerpc_registers regs;
        uint8 pad[16];
    } _regs;

    // Make sure the calculation of the current offset makes use of
    // 'this' as this could make it simplified at compile-time
    powerpc_registers *regs_ptr() const         { return (powerpc_registers *)((char *)&_regs.regs + (16 - (((char *)&_regs.regs - (char *)this) % 16))); }

In the case where regs_ptr() is shifted over from the one at regs, the constructor for powerpc_registers is never run on the shifted-over instance. That instance doesn't come from anywhere, it is just a C-style cast (powerpc_registers *) applied to memory that was already part of an object. Since the powerpc_registers constructor never runs on it, this also means that the constructors for the members of the powerpc_registers aren't called, and those members' members, etc.

In 1da83854b0e67ac5061db878c27fc6247ad2362b you fix the basic_spcflags' mask and lock members potentially not being initialized, since the basic_spcflags constructor was potentially not called, by moving the initialization to a separate member function basic_spcflags::init() and explicitly calling it in the powerpc_cpu constructor.

Another way to fix that would be to just make sure the powerpc_registers constructor actually runs on the memory location we use it at:

#include <new>

powerpc_cpu::powerpc_cpu() {
    new (regs_ptr()) powerpc_registers();

...

This has the benefit that it also initializes everything else in powerpc_registers, without having to create a solution for initializing each member, though I don't know if not being initialized is functionally a problem for anything else in there.

With regs still in existence, regs' constructor also runs, implicitly at the start of the powerpc_cpu constructor, but since it runs first it shouldn't clobber anything set by the one we actually want. (See https://gist.github.com/rakslice/3dbd4fc89964a37b4da3dd5a4fc93ef1)

It would seem reasonable to ask: Since we found this problem, does the codebase have other problems of this form?

The general form of this problem is that some memory that was part of an existing object was cast as a pointer to a different class, and then used, without the exact underlying memory range being initialized/constructed as an instance of that class first by any code. There are so many casts happening everywhere in this codebase; it's not trivial to verify each.

kanjitalk755 commented 4 years ago

Thank you for the explanation.

dd2b9a9 implemented the alignment without relying to compiler options before it was standardized, but failed as a result. The compiler before C++11 cannot be used, but I think the following implementation is straightforward.

https://github.com/kanjitalk755/macemu/tree/test_alignas

C-style cast (powerpc_registers *) is still dangerous, but it's common.

It would seem reasonable to ask: Since we found this problem, does the codebase have other problems of this form?

If the project is development phase, we may need to do a thorough code review. However, in the maintenance phase it may be sufficient to handle the bugs found.