Windows test instability - hard to reproduce

brotskydotcom commented 6 months ago

I've noticed that the tests will occasionally fail on Windows. This is not repeatable (sometime it happens twice in a row, and sometimes it doesn't happen once in a hundred runs), and I'm not getting reports of failures from the field.

It most often happens in the explicitly multi-threaded tests (test_simultaneous_create_set_move and test_simultaneous_independent_create_set). I've seen it happen in the single-threaded tests as well (test_empty_service_and_user and test_update), but only when the single-threaded tests are being run in a multi-thread test environment, never when --test-threads=1

Almost all the tests set a password and then immediately read it back and delete it. The nature of the failure is that the read or delete gets a NoEntry failure. (When the delete gets a NoEntry failure, there is a following read that also fails because it expects to get no entry but finds one.)

The most common failure situation is that an entry is created and set on one thread and then passed into another where it is read and deleted. If I run the tests single-threaded then I can consistently get test_simultaneous_create_set_move to fail at threading.rs line 75 where the first thing that happens in the spawned thread is the read. But if I run the tests multi-threaded this rarely fails.

It feels like there is some sort of race condition between the threads in the keyring process and the threads in the windows credential manager, so that when the keyring process makes a call there's a thread in wincred that gets created to handle that call, and threads in wincred may run out of order relative to when they are started (so the read is serialized before the set even though they weren't spawned that way).

I don't see this happen on platforms other than windows. All help appreciated.

brotskydotcom commented 6 months ago

@russellbanks I would be interested in your take on this. It certainly seems to be thread-related, because if I move the set of threading.rs line 73 one line down into the closure (see this link) and I run single-threaded I never get errors. So that suggests that calls made on the same entry from different threads to the windows credential manager are not synchronized with the execution of the calling threads.

But then again: I also get errors in basic tests from time to time if I run the tests multi-threaded. But the basic tests never make calls to the same entry from different threads.

I find the whole situation very puzzling.

brotskydotcom commented 6 months ago

It's probably worth noting that I'm running my windows tests on a simulation (Parallels Windows 11 on a Mac M1 machine). So it would be good to know if these tests also fail on windows hardware.

russellbanks commented 6 months ago

It's probably worth noting that I'm running my windows tests on a simulation (Parallels Windows 11 on a Mac M1 machine). So it would be good to know if these tests also fail on windows hardware.

I've been able to reproduce this on a Windows machine by rerunning the tests until it fails. It's confusing because they pass most of the time.

ReactorScram commented 5 months ago

https://github.com/ReactorScram/keyring-rs-repro/actions/runs/7630294543/job/20785648983

It fails a little different in CI than on my laptop. Maybe the CI CPUs are faster or slower or something, but it does seem to fail pretty often.

brotskydotcom commented 5 months ago

It's pretty clear that the Windows test failures are all related to setting a credential on one thread and then immediately accessing that credential on a different thread. While this feels, to me, like it's clearly a bug in the Windows credential manager, it's also one that's unlikely to affect keyring clients adversely, as this should be a very uncommon usage pattern and certainly one that clients can avoid.

I have submitted PR #164 which removes the problematic behavior from the Windows integration tests, and adds a feature which can be used to run the tests. I would appreciate if @ReactorScram and @russellbanks could retry the tests in their environment to confirm that this does "fix" the tests.

I will see if I can report the underlying behavior as a bug in the Window credential manager - any help in how to do that appreciated! :)

Once I've heard back from some others that this fixes the tests for them, I will release it and re-close this bug.

ReactorScram commented 5 months ago

@brotskydotcom But in my case I do set_password and get_password on the same credential and within the same thread, and it still failed https://github.com/ReactorScram/keyring-rs-repro/blob/main/src/main.rs

I think this only changes how the bug gets reported to Microsoft, and how to document it. If it can happen in a single thread, then two library authors could read the keyring-rs documentation and do everything right within their own code and their own thread, and then an app using both libraries in separate threads could cause them to fail even though they're both individually correct.

I speculated in a deleted comment that maybe Windows disables its page cache for the vault files for security, which can cause it to read stale data from disk in some scenarios. (AIUI most OS page caches take responsibility for making files look consistent even if they aren't flushed out to disk, and I'm guessing disabling the page cache is a code path that's almost never exercised)

I have no idea how to report bugs to Microsoft, though. If nothing else I'd say open an issue on the windows-rs Github and let them redirect you, at least Github is probably checked regularly :P

brotskydotcom commented 5 months ago

But in my case I do set_password and get_password on the same credential and within the same thread, and it still failed ReactorScram/keyring-rs-repro@main/src/main.rs

Your tests are doing a bit more than just set, get, and delete on the same credential; they are actually re-creating the (rust-level) entry on every call. I'm not sure why this would make a difference, but in my testing it absolutely does. If I simply use the same keyring entry I can easily set/get/delete/get thousands of times on the same thread, and even simultaneously on different threads, without failures.

I've updated the tests on all platforms to do multiple set/get/delete/get cycles on each platform (fewer on linux, because there are capacity limits in the inter-process calls to dbus for secret-service, and to the kernel for keyutils). I'm going to release this version and I'd be very interested in what happens if you update your regression tests not to create a new entry on each call.

ReactorScram commented 5 months ago

@brotskydotcom That's interesting. I hoisted the new_with_target call and they still fail on my dev laptop and on CI. (I'm still on keyring 2.3.1 for these) https://github.com/ReactorScram/keyring-rs-repro/actions/runs/7699494265

I ran the official tests for keyring 252f43c3e74 on the same laptop and they passed.

Just for science, I tried putting a SeqCst fence in front of each call, but it still failed.

brotskydotcom commented 5 months ago

@ReactorScram Thanks for spending so much time looking at this! Why is it always Windows where this stuff happens? I'm very glad to hear the tests are working for you in 2.3.2. I will try to report this problem to Microsoft.

hwchen / keyring-rs

Windows test instability - hard to reproduce #163