provoke bugs that rely on iteration order of hash maps/sets

BurntSushi commented 5 months ago

This PR is not meant to be merged.

This PR hacks in a hash map/set wrapper that forcefully shuffles iteration order with the goal of provoking bugs that result from depending on a particular iteration order. Namely, hash maps/sets in Rust's standard library don't have a defined iteration order. With that said, I think that in practice, their order is somewhat stable/predictable. But in some code we're working on at Astral (that isn't public just yet), we've noticed some non-deterministic test failures that we believe might be coming from pubgrub. This PR is a test of that theory by forcing the issue and more aggressively shuffling iteration order.

With this PR, I have been able to find one consistently failing test: prop_same_on_repeated_runs. I'm hopeful it will show up in CI too. But the failure output looks like this for me locally:

---- prop_same_on_repeated_runs stdout ----
proptest: FileFailurePersistence::SourceParallel set, but failed to find lib.rs or main.rs
thread 'prop_same_on_repeated_runs' panicked at tests/proptest.rs:460:39:
assertion `left == right` failed
  left: Map { map: {648: NumberVersion(11), 327: NumberVersion(12), 164: NumberVersion(12), 350: NumberVersion(8), 531: NumberVersion(1)} }
 right: Map { map: {648: NumberVersion(11), 327: NumberVersion(7), 164: NumberVersion(7), 350: NumberVersion(10), 531: NumberVersion(12)} }
thread 'prop_same_on_repeated_runs' panicked at tests/proptest.rs:460:39:
assertion `left == right` failed
  left: Map { map: {648: NumberVersion(11), 82: NumberVersion(7), 327: NumberVersion(7), 350: NumberVersion(10), 531: NumberVersion(12)} }
 right: Map { map: {648: NumberVersion(11), 82: NumberVersion(12), 327: NumberVersion(12), 350: NumberVersion(8), 531: NumberVersion(1)} }
thread 'prop_same_on_repeated_runs' panicked at tests/proptest.rs:460:39:
assertion `left == right` failed
  left: Map { map: {648: NumberVersion(11), 327: NumberVersion(7), 41: NumberVersion(7), 350: NumberVersion(10), 531: NumberVersion(12)} }
 right: Map { map: {648: NumberVersion(11), 327: NumberVersion(12), 41: NumberVersion(12), 350: NumberVersion(8), 531: NumberVersion(1)} }
thread 'prop_same_on_repeated_runs' panicked at tests/proptest.rs:460:39:
assertion `left == right` failed
  left: Map { map: {648: NumberVersion(11), 327: NumberVersion(7), 20: NumberVersion(7), 350: NumberVersion(10), 531: NumberVersion(12)} }
 right: Map { map: {648: NumberVersion(11), 327: NumberVersion(12), 20: NumberVersion(12), 350: NumberVersion(8), 531: NumberVersion(1)} }
thread 'prop_same_on_repeated_runs' panicked at tests/proptest.rs:460:39:
assertion `left == right` failed
  left: Map { map: {648: NumberVersion(11), 10: NumberVersion(12), 327: NumberVersion(12), 350: NumberVersion(8), 531: NumberVersion(1)} }
 right: Map { map: {648: NumberVersion(11), 10: NumberVersion(7), 327: NumberVersion(7), 350: NumberVersion(10), 531: NumberVersion(12)} }
thread 'prop_same_on_repeated_runs' panicked at tests/proptest.rs:460:39:
assertion `left == right` failed
  left: Map { map: {648: NumberVersion(11), 8: NumberVersion(7), 327: NumberVersion(7), 350: NumberVersion(10), 531: NumberVersion(12)} }
 right: Map { map: {648: NumberVersion(11), 8: NumberVersion(12), 327: NumberVersion(12), 350: NumberVersion(8), 531: NumberVersion(1)} }
thread 'prop_same_on_repeated_runs' panicked at tests/proptest.rs:460:39:
assertion `left == right` failed
  left: Map { map: {648: NumberVersion(11), 327: NumberVersion(7), 7: NumberVersion(7), 350: NumberVersion(10), 531: NumberVersion(12)} }
 right: Map { map: {648: NumberVersion(11), 327: NumberVersion(12), 7: NumberVersion(12), 350: NumberVersion(8), 531: NumberVersion(1)} }
thread 'prop_same_on_repeated_runs' panicked at tests/proptest.rs:460:39:
assertion `left == right` failed
  left: Map { map: {648: NumberVersion(11), 350: NumberVersion(8), 327: NumberVersion(12), 6: NumberVersion(12), 531: NumberVersion(1)} }
 right: Map { map: {648: NumberVersion(11), 6: NumberVersion(7), 327: NumberVersion(7), 350: NumberVersion(10), 531: NumberVersion(12)} }
thread 'prop_same_on_repeated_runs' panicked at tests/proptest.rs:397:1:
Test failed: assertion failed: `(left == right)`
  left: `"Because there is no version of 327 in <0 | >0 and 327 0 depends on 7 ∅, 327 * is forbidden.\nAnd because 568 0 depends on 327, 568 0 is forbidden."`,
 right: `"Because 327 0 depends on 7 ∅ and there is no version of 327 in <0 | >0, 327 * is forbidden.\nAnd because 568 0 depends on 327, 568 0 is forbidden."` at tests/proptest.rs:462; minimal failing input: (dependency_provider, cases) = (OfflineDependencyProvider { dependencies: Map { map: {568: {NumberVersion(0): Map { map: {7: Range { segments: [(Unbounded, Unbounded)] }, 327: Range { segments: [(Unbounded, Unbounded)] }} }, NumberVersion(6): Map { map: {} }}, 327: {NumberVersion(0): Map { map: {6: Range { segments: [(Unbounded, Excluded(NumberVersion(1)))] }, 7: Range { segments: [] }} }}, 7: {NumberVersion(0): Map { map: {} }}, 6: {NumberVersion(0): Map { map: {} }, NumberVersion(4): Map { map: {} }}} } }, [(568, NumberVersion(0)), (568, NumberVersion(6))])
        successes: 0
        local rejects: 0
        global rejects: 0

I don't know much about pubgrub internals, so I haven't really been able to investigate the root cause of this failure. What do you think?

mpizenberg commented 5 months ago

One thing we have put some effort into when designing this lib is removing as much as possible anything that would produce non-reproducible effects. In an effort in that direction, I remember we chose hash functions that are explicitly deterministic. If we recently introduced one that is not deterministic, it may be an oversight?

The test you highlight "prop_same_on_repeated_runs" is here to catch such regressions. I'll let Jacob complement this info as I haven't followed as closely as I hoped the recent changes.

mpizenberg commented 5 months ago

Also the errors you see is I believe an error stating that the failure cases are not the same. So if you explicitly introduced hash maps with shuffled iterations order, it makes sense that the dependency trees in an error case can be different. Indeed, there can be different reasons why a set of dependencies might be incompatible. And different paths may produce different errors. So I interpret the error you got as a normal behavior of that failsafe test, which detects something non-reproducible is happening. But it doesn't mean the solver failed when it should have succeeded. It just means it failed differently. At least that's my interpretation.

BurntSushi commented 5 months ago

But it doesn't mean the solver failed when it should have succeeded. It just means it failed differently. At least that's my interpretation.

I believe that interpretation roughly matches the kind of failure we're seeing on our non-public code. That is, something about how the solver failed can change from run-to-run, which in turn changes the error message presented to the end user.

Eh2406 commented 5 months ago

First off it's an honor to receive this PR/bug report from you! I have long been an admirer of your work.

As you noticed, it is much harder to test code that has non-determinism. So for our own sanity, we try and make sure pubgrub is deterministic if it's inputs are deterministic. The prop_same_on_repeated_runs test is our way of checking if we have added non-determinism. so I'm glad to see it is failing when you replaced our intentionally deterministic hasher within intentionally nondeterministic one.

If any of the other tests were failing with this change, I would be much more concerned. We should not have the correctness of the output depend on the iteration order of hash maps. We had such a bug in one of our early releases, and the non-determinism was very helpful in sussing it out. it would be really nice to have the hasher available in our public API, so that we could have a "is deterministic" test with fxhash and a "error does not depend on hashing" if it uses std's randomized. But the complexity of the public API is not worth it. (I'm open to other suggestions about having our cake and eating it too.)

If you are seeing nondeterministic behavior, then let's extend our tests to find it and keep us from reintroducing it.

BurntSushi commented 5 months ago

The prop_same_on_repeated_runs test is our way of checking if we have added non-determinism. so I'm glad to see it is failing when you replaced our intentionally deterministic hasher within intentionally nondeterministic one.

I guess the way I look at this is that this PR exposes some kind of non-deterministic behavior that is already present in the library. It "just" makes it much easier to see. AFAIK, even if you're using a deterministic hash function, that doesn't give you any guarantees about the iteration order of a hash map/set. The order is still arbitrary.

I believe there are some uses of std::collections::HashSet (with the default hasher). I'll put up a PR that replaces those with FxHashSet as a starting point.

BurntSushi commented 5 months ago

OK, I opened https://github.com/pubgrub-rs/pubgrub/pull/182 for that last change.

Eh2406 commented 5 months ago

I think your paragraphs are in contradiction with each other. FxHashSet is a type alias for STD::HashMap with a FxHash. If we cannot rely on the iteration order of STD::HashMap (even with a deterministic hasher), then how does FxHashSet help?

BurntSushi commented 5 months ago

I think your paragraphs are in contradiction with each other. FxHashSet is a type alias for STD::HashMap with a FxHash. If we cannot rely on the iteration order of STD::HashMap (even with a deterministic hasher), then how does FxHashSet help?

It's a shot-in-the-dark band-aide and makes the library consistently use FxHasher everywhere. For example, maybe the non-determinism we're seeing on our end is coming from the use of std::collections::HashSet (with the RandomState hasher) and not from anything using FxHasher. If that's true, maybe switching from RandomState to FxHasher in the last few remaining places it is used will, in some sense, fix the issue we're having even if it doesn't fix the grander issue of "can't rely on hash map/set iteration order regardless of hashing function."

To be clear, I have no clue whether the last remaining uses of RandomState in pubgrub are the source of the non-determinism we're seeing.

mpizenberg commented 5 months ago

Places where I can easily see the order of the solver changing are:

the prioritization of packages https://github.com/pubgrub-rs/pubgrub/blob/4a483710a32318ece97e0de54096147291a3fc5b/src/solver.rs#L238
the dependency retrieval that returns a hash map https://github.com/pubgrub-rs/pubgrub/blob/4a483710a32318ece97e0de54096147291a3fc5b/src/solver.rs#L258C8-L258C24

If the first one is the cause of the problem, then you should have a look at the prioritization code in your private repo. If the second one is the cause of the problem, then it's likely a hash map iteration order issue.

I believed the iteration order should stay the same but you're certainly right that a deterministic hasher does not mean a deterministic iteration order, as I don't know if they are related.

If that's indeed the problem, I think we could fix it by using an indexed map. Would be nice to confirm that's the source.

mpizenberg commented 5 months ago

Also @Eh2406 if the unit propagation buffer is turned into a hash map (with non-deterministic order), it may have the same effect.

zanieb commented 5 months ago

If it's helpful, here's the diff for the snapshot for one of our flaking tests

      And because albatross==3.0.0 depends on bluebird==3.0.0 we can conclude that any of:␊
           albatross<2.0.0␊
           albatross>2.0.0␊
       depends on one of:␊
-          bluebird<=1.0.0␊
-          bluebird>=3.0.0␊
+          bluebird==1.0.0␊
+          bluebird==3.0.0␊
 ␊
       And because you require one of:␊
           albatross<2.0.0␊
           albatross>2.0.0␊
-      and you require bluebird>=2.0.0,<3.0.0, we can conclude that the requirements are unsatisfiable.
+      and you require bluebird>=2.0.0,<3.0.0, we can conclude that the requirements are unsatisfiable.␊

Eh2406 commented 5 months ago

can't rely on hash map/set iteration order regardless of hashing function.

I think your technically correct, the best kind of correct. STD probably makes no guarantees about its iteration order, and it's allowed to add a "shuffle before iterating" as an implementation detail without regard to which hasher is used. Given that we don't rely on iteration order for correctness, only for ease of testability, I feel happy leaving that in "theoretical problem land" until bigger harms are demonstrated or STD actually adds that code.

That is to say, if there is reason to reduce our reliance on the iteration order wonderful. (Performance, and available bug demonstrating inconsistency, correctness, code clarity...) But without one of those I'm still willing to trust that things are deterministic enough.

To be clear, I have no clue whether the last remaining uses of RandomState in pubgrub are the source of the non-determinism we're seeing.

Can I suggest running repeatedly with extensive logging until you see two different results, and then comparing where the logs first diverge? Even more thoroughly you could record traces with rr until you have two different outcomes, and step through until the recordings diverge.

the dependency retrieval that returns a hash map https://github.com/pubgrub-rs/pubgrub/blob/4a483710a32318ece97e0de54096147291a3fc5b/src/solver.rs#L258C8-L258C24

After many layers of type aliases the map is https://github.com/pubgrub-rs/pubgrub/blob/release/src/type_aliases.rs#L6 so that's unlikely to be the problem.

If that's indeed the problem, I think we could fix it by using an indexed map. Would be nice to confirm that's the source.

That is very good suggestion, make a branch changing all Hash* to use IndexMap, and see if your nondeterminism goes away.

if the unit propagation buffer is turned into a hash map (with non-deterministic order), it may have the same effect.

Yes. Definitely. If that code goes non-deterministic testing is going to be very painful.

If it's helpful, here's the diff for the snapshot for one of our flaking tests

That is really interesting. Is it possible to diff the DerivationTree? That could help isolate if the difference is happening in resolution or if it's happening in reporting the error.

BurntSushi commented 5 months ago

Can I suggest running repeatedly with extensive logging until you see two different results, and then comparing where the logs first diverge?

Yeah that was the first thing I tried. But it is spectacularly difficult to reproduce unfortunately. I've tried repeating the flaky tests (on our end) 10,000 times in multiple different ways, but can't force it to fail. It just randomly fails in CI or locally. In my experience, the only pattern I tend to see is that it fails when I run tests for the first time after I've "been away" for a while. (e.g., When I start work in the morning.) But that could be a phantom.

That's when I switched gears to this method with the hope it would provoke something.

My next step is probably to indeed to add some kind of extra logging about the internal state of things when the test fails. (With the derivation tree indeed being something I'd want to record.) But this will require some non-trivial work on our end I think just because of how the test works. So I'm basically trying the "easier" things first to track it down.

The other thing on my mind to try is to take the approach in this PR and make it possible to toggle at a finer level of granularity. i.e., Replace individual hashmaps one-at-a-time with hashmaps using explicitly random order until I provoke the test failure. For example, with this PR change, I can provoke a test failure on our internal code reliably. (Although the diff in that case is larger than the one @zanieb shared above.)

zanieb commented 5 months ago

@Eh2406

That is really interesting. Is it possible to diff the DerivationTree? That could help isolate if the difference is happening in resolution or if it's happening in reporting the error.

I added a debug print of the derivation tree to our snapshots and it looks like it does not change! I think this suggests that it's either non-determinism in simplify or walking the tree? I find the trailing newline in the diff really interesting — where could that come from?

Eh2406 commented 5 months ago

It's good to hear that it's not in resolution proper. That's code I know pretty well, and I couldn't think of what could cause this. I believe you have improved both the error reporting and the simplify code from what is in this repo.

The new line is indeed intriguing. Nothing in this repo generates "we can conclude that" so I'm flying real blind. Do you have more than one code paths that could generate that? Do you have code paths that generate or remove new lines? Are you joining the output from a vec of strings, maybe there's a empty string at the end?

The changed version requirements does make me suspicious of simplify. The requirements may or may not be the same depending on what versions are available. Are there versions below 1 or higher then 3? I know you made improvements to simplify, but I'm reasonably sure that this repos implementation is deterministic as long as the inputs are. Could it be the iterator you provide to simplify, perhaps it's not sorted or it sorted differently between builds?

Now that we know it's not the resolver proper, perhaps add logging to every function call involved in error reporting? Perhaps even log every argument to important functions? This sure is a tricky one.

zanieb commented 5 months ago

This bug was internal to our resolver.

pubgrub-rs / pubgrub

provoke bugs that rely on iteration order of hash maps/sets #181