jackaudio / jack2

jack2 codebase
GNU General Public License v2.0
2.16k stars 371 forks source link

jack_transport_locate crashes on macOS / Apple Silicon / HomeBrew #749

Open cme opened 3 years ago

cme commented 3 years ago

Describe the bug

Use of the jack_transport_locate function seems to reliably cause a crash in clients on Apple Silicon, with the current Homebrew packages of JACK.

This can be seen just be running jack_test -v, the process will crash after calling jack_transport_locate()

Environment

Steps To Reproduce

% jack_test -v

Expected vs. actual behavior

Expected behaviour (from a Linux ARM 32-bit host):

Transport is rolling...
Test jack_transport_locate while rolling
Transport current frame = 4096
Transport is rolling...
Do jack_transport_locate
Locating.... frame = 5120
Transport is rolling...
Locating.... frame = 5120
Transport is rolling...
Locating.... frame = 6144
Transport is rolling...
Locating.... frame = 6144
Transport is rolling...
sync callback : Releasing status : now ready...

Observed behaviour:

Transport is rolling...
Test jack_transport_locate while rolling
Transport current frame = 6144
Transport is rolling...
Do jack_transport_locate
zsh: bus error  jack_test -v
falkTX commented 3 years ago

Can you get a (back)trace?

cme commented 3 years ago

The crash is in CAS (in macosx/JackAtomic_os.h ) under JackTransportEngine::GenerateUniqueID(), and the faulting instruction is a CASAL instruction (generated by the __sync_bool_compare_and_swap intrinsic added in #685 )

The memory address looks valid and can be read and written via the debugger, although it's in shared memory which should probably be mapped with MAP_HASSEMAPHORE but isn't. I suspected this might be a problem, but adding the flag doesn't seem to help.

I can't seem to find any reference that mentions any new restrictions or requirements on semaphores in shared memory with these instrinsics or instructions on AArch64 macOS.

falkTX commented 3 years ago

the code was crashing in a similar path before. and was fixed by offseting some struct members so memory was aligned to 32bit (or 64bits not sure now)

this might need something like that too.

falkTX commented 3 years ago

Odd enough, using something like https://kx.studio/Applications:Carla I remember the transport controls behaving normal, no crashes from that.

cme commented 3 years ago

Alignment had been my first thought, but it seemed that some CAS calls were misaligned without raising an alignment exception, so I discounted that thought.

But after a bit more digging, it is indeed an alignment fault. It looks like memory that's potentially shared with another core will raise an alignment fault on misaligned compare-and-swap (which should be expected, because the architecture can't guarantee atomicity for misaligned acesses) but for local memory the OS is letting it get away with it without raising an exception.

I think these struct members do really need to be naturally aligned. If I'm understanding correctly, the only reason they're packed on macOS is because x86_64 and aarch64 get the same structures that way to share memory between native & Rosetta2 processes?

Theoretically we should be able to force alignment on anything that's accessed atomically, but the __packed__ attribute on containing structs seems to make contained structs lose their aligment requirements. That may be a Clang bug.

cme commented 3 years ago

I've let this sit for a while since I've been busy with other things.

Specifying the alignment for members of a packed inherently doesn't work as 'packed' structs inherently have an alignment requirement of only byte-alignment. This breaks atomicity guarantees on ARM, and having packed structs is anyway undesirable for performance reasons on most architectures as well. I'd like to fix this, but before I do so I wanted to run the options by interested parties.

There are two potential options I see:

  1. Remove use of packed structs and use other means (standard sized ints, field oldering and explicit padding) to ensure that shared-memory structures have exactly the same layout whether compiled for 32 bit or 64 bit.
    • this would also involve adding some test framework to verify the format is as expected

or:

  1. encapsulate atomically-accessed counters into something that ensures atomicity without the same alignment constraints

I think #1 is more desirable as it should also give some performance improvement as well, but would likely be more work than #2. Hence why I want to check in before embarking on such a project!

Just to double-check my understanding, JACK is required to use shared-libraries so that client and server both pick up their definitions of structures from the same place. Is there any other mechanism for keeping things in sync? Is there any guarantee that a system with both 32-bit and 64-bit libraries will have the same versions (and hence structure definitions)?

Thanks!

cme commented 3 years ago

Perhaps this warrants a more general discussion, perhaps I should take this to the mailing list?

falkTX commented 3 years ago

It is fine to discuss it here.

It is hard to know the right answer, because any change has implications for long-term future. So I am very reserved when it comes to changing stuff like the struct padding.

The theory behind packing is that we want:

  1. compatibility between different architectures of the same base system (e.g. linux 32bit and linux 64bit)
  2. compatibility across different OSes, mainly for use as network packets (encoded as little-endian afaik)
  3. compatibility across compilers, like mingw and msvc on windows

I dont think all those points are achieved right now though, specially no. 2.

If we keep the packing active, but place the item(s) that require specific alignment as the first member in the structs, would that work?

cme commented 3 years ago

The structures are used directly for network transfer? I hadn't realised that. That would mean that changing the data formats at all will break compatability across network uses. I guess I will also review for which structures are used for network and which for shared memory.

Adding test framework for the structure layouts has the effect of building an (executable) specification of the data formats, so that seems pretty desirable.

Placing things with alignment requirements first doesn't work in the general case becase the containing structs themselves don't have alignment requirements, but might happen to work if it's possible to arrange it that way. It's fragile though, and can be broken by things like a struct containing more than one member struct which has alignment requirements.

But it's worth trying since it seems a smaller amount of work! The fragility could be mitigated by a small amount of selective assertions at call sites for CAS that assert the struct offset just for these members.

cme commented 3 years ago

Following uses of atomically accesed structures and adding some checks, I have changed a few field orders and now seem to have working JACK on Apple Silicon (including Hydrogen).

Are there any other access controls or API version information that need to be updated on changing the struct layout? I didn't find anything.

falkTX commented 3 years ago

I recall there is a global ABI version macro.. we need to bump this.

falkTX commented 3 years ago

Done in 13d5eccba44d615767892c6c93cb73c9b15587f0

So if a client is running on an old jack server somehow (jack updated while clients running or different versions of jack connected via network) an error is thrown.

We can consider this fixed, right?

(v1.9.19 to be tagged in around 10 days)

cme commented 3 years ago

We can consider this fixed, right?

Awesome. That should take care of it yup! I'll leave you to close since I don't know what process you follow for testing and verification. Thanks!