Open cheshire opened 10 months ago
Hi, @nluehr @cheshire
I would like to inquire whether this issue has been resolved. I changed the ::absl::flat_hash_map<Key, Value>;
to ::absl::btree_map<Key, Value>;
to fix the iteration order of StableHashMap
.
I've observed additional non-deterministic behavior during the autosharding pass in the OSS version of XLA. Could you explain why the solver parameter string is specifically restricted to PLATFORM_GOOGLE
? Additionally, are there any other known instances of non-determinism associated with the autosharding pass that I should be aware of?
For a simple DNN that consists of 3 MLP layers, I've noticed that each time I execute the auto sharding pass, the sharding strategy for the layers shows minor variations.
For instance, on certain runs, both fc3
and fc1
are sharded across four devices, while on other occasions, only fc3
is sharded.
@pratikfegade thoughts?
I am not sure if I can see something that's obviously going wrong here. While there can be other sources of non-determinism as mentioned above, the solver is a big one. Could we verify that the solver is reaching completion and optimality in the above case? Meanwhile, I can run the OSS version to try to repro the non-determinism myself.
@mmoffitt for visibility as well
We should be able to remove those PLATFORM_GOOGLE
guards ... that should definitely help with the determinism issues.
I'll attempt to update the code today.
In order to support autosharding in OSS, we need to fix the non-determinism issues from using
absl::flat_hash_map
. The iteration order forStableHashMap
inauto_sharding_strategy.h
needs to become deterministic (internally it follows the insertion order, backed by linked hash map).