Open slouken opened 1 year ago
The rwlock ones are likely me having set up the metadata for RWlock api itself incorrectly, so it thinks we're releasing the lock incorrectly when we aren't. I could be wrong, though.
@slouken, when you get a moment, can you make sure I didn't botch the function declaration in include/SDL3/SDL_mutex.h? I just sort of guessed at how to use those macros, and I think this is a false positive because it doesn't understand how RWlock is meant to work, because I've specified this incorrectly in some way.
(Specifically, I think it expects there to be a seperate "unlock something we previously locked for writing" function based on current declarations, but both reading and writing use the same unlock function atm.)
/home/slouken/projects/SDL/src/audio/SDL_audio.c:256:5: warning: releasing mutex 'current_audio.device_list_lock' using shared access, expected exclusive access [-Wthread-safety-analysis]
SDL_UnlockRWLock(current_audio.device_list_lock);
^
/home/slouken/projects/SDL/src/audio/SDL_audio.c:247:5: note: mutex acquired here
SDL_LockRWLockForWriting(current_audio.device_list_lock);
^
I just sort of guessed at how to use those macros, and I think this is a false positive because it doesn't understand how RWlock is meant to work, because I've specified this incorrectly in some way.
I read the Clang docs, and figured out the proper attribute in b16165a33fdd993d0801d9e5cef79182c3f51530. Still need to sort through some warnings, but this cleaned up a bunch of them.
Okay, so here's a conundrum:
The clang attributes don't allow conditional locks...locking a mutex must work no matter what, unless it's a "try lock," where it will get upset if you don't check the return value to make sure the lock succeeded.
That means code like this, (a cutdown version of something currently in SDL_properties.c), triggers a warning:
if (SDL_LockRWLockForWriting(SDL_properties_lock) == 0) {
do_some_stuff_while_holding_the_write_lock();
SDL_UnlockRWLock(SDL_properties_lock);
}
do_some_stuff_that_doesnt_require_a_lock();
}
The do_some_stuff_that_doesnt_require_a_lock();
line triggers this warning:
/home/icculus/projects/SDL/src/SDL_properties.c:138:9: warning: mutex 'SDL_properties_lock' is not held on every path through here [-Wthread-safety-analysis]
This warning is a little confusing, but it's upset because it thinks we only released the lock in one branch, because it assumes we own the lock no matter what if we call that function and doesn't care what its return value says.
Now, we can mark all our mutex functions as try-locks (and technically they are, because strictly speaking the API says it can return an error without locking the mutex), but we sure don't treat them as such in SDL: there are 205 references to SDL_LockMutex inside SDL itself, and exactly three of them check the return value (all in the generic RWlock implementation).
The new SDL_properties code checks the return value when locking its RWlock and takes defensive action, ironically triggering compiler warnings about thread safety.
Maybe we should change the mutex and rwlock APIs to have locks always succeed if they were successfully created (and continue to pretend to work if the lock is NULL), and remove the return value, except on try-locks, which only "fail" if the lock is currently held by another thread.
And if a mutex fails to lock at the system level when the lock is valid, we just...assert and otherwise carry on...? What sort of situation could make a valid mutex fail to lock?
As for the original audio warnings, this is what's killing me:
static SDL_AudioDevice *ObtainPhysicalAudioDevice(SDL_AudioDeviceID devid);
(and ObtainLogicalAudioDevice, too.)
I can't see a way to annotate this to say "this returns an object that we have locked" and the analyzer isn't smart enough to deduce this itself, even when it has access to the implementation code.
I went through the first dozen or so of these warnings in the audio code, and all of the warnings are due to this pattern. It did not, however, flag any of the FIXMEs I have for extremely-unlikely-but-possible race conditions.
It seems like this tool has a long way to go, still.
I think having the lock functions return void, or just ignoring errors in lock and unlock is fine.
The clang thread-safety analysis doesn't handle flow across function call/return boundaries, it strictly checks lock, code, unlock. I think this is possibly by design. It's much harder to prove that multi-thread code is safe when locks are held across function return boundary. This gave me pause myself when I saw you doing that, because it's going to be hard to track down issues in new code that isn't aware that it needs to unlock after calling.
I've found that if you follow the rules, it's really good at telling you when your code is thread-safe. I'm betting that the race conditions you're seeing it miss are because you're bending the rules that it's trying to enforce, or not annotating in a way that it expects. :) Do you have any examples?
This gave me pause myself when I saw you doing that, because it's going to be hard to track down issues in new code that isn't aware that it needs to unlock after calling.
It's relying on the word "Obtain" to mean something to the developer here; we could call it FindDeviceByIdAndLockIt() if that's better. :)
(I could also add a "Release" function to match it instead of just having the Obtain caller explicitly unlock returned_device_pointer->lock
, which might help.)
It could work with this tool mostly as-is if there were a way to annotate "the acquired resource is the return value" but I couldn't see a way to do that. There's SDL_RETURN_CAPABILITY, but that wants a specific object.
The simplest (but also probably ugliest) change that would probably make the tool happy is to remove the Obtain*() functions and change them to GetPhysicalDeviceFromId() or whatever, and then require the caller to explicitly lock the returned device, but that risks the opposite problem in new code: now they have to be aware they need to explicitly lock the device.
Do you have any examples?
The biggest FIXME in this regard is closing an audio device...it can't hold the device lock, because if it holds that lock, because it has to WaitThread() on the device thread that also might be waiting on that lock. But in the time that you release the lock to do the WaitThread, six other things (probably won't, but) could grab the lock and cause havoc.
It's fixable, with some effort. But it's way too complex a system for this tool (or maybe any tool) to find through analysis, especially because the half of the code that will cause the bug is the application calling into the API, and it isn't available to this tool. :)
I think having the lock functions return void, or just ignoring errors in lock and unlock is fine.
I'm going to take a run at making these void tomorrow and see if it sucks, I'll report back.
The simplest (but also probably ugliest) change that would probably make the tool happy is to remove the Obtain*() functions and change them to GetPhysicalDeviceFromId() or whatever, and then require the caller to explicitly lock the returned device, but that risks the opposite problem in new code: now they have to be aware they need to explicitly lock the device.
That would be the easiest to verify, both with eyeballs and with clang. You can assert that the lock is held in the function, and I think there's even notation for the assertion function that will make clang happy. You should annotate the data that is protected by the lock, so clang knows what is protected and can validate that the lock is held for that data. You can see an example of this in the joystick code.
Do you have any examples?
The biggest FIXME in this regard is closing an audio device...it can't hold the device lock, because if it holds that lock, because it has to WaitThread() on the device thread that also might be waiting on that lock. But in the time that you release the lock to do the WaitThread, six other things (probably won't, but) could grab the lock and cause havoc.
Yeah, I would just mark that function as ignored by thread analysis. I haven't found a good pattern for closing/destroying objects that isn't filled with race conditions, so I just document that you can't destroy an object until all threads are done touching them. This is a little tricky because it sounds like the application isn't directly destroying it, but the ref counting you're doing might help with that.
If you enable thread-safety checking on the audio code, a whole pile of warnings come up.
You can enable thread-safety checks by cleaning the build folder and setting these environment variables before running cmake:
Here's the current output: