pantsbuild / pants

The Pants Build System
https://www.pantsbuild.org
Apache License 2.0
3.31k stars 636 forks source link

Optimize Pants's data structures by not sorting as much (`FrozenOrderedSet` and `FrozenDict`) #14719

Open Eric-Arellano opened 2 years ago

Eric-Arellano commented 2 years ago

As the author of both FrozenDict and FrozenOrderedSet, I think I was misguided in thinking that deterministic ordering actually matters for the engine. Per https://github.com/pantsbuild/pants/pull/14715#discussion_r820997077, I believe that all we need is a deterministic __eq__ and __hash__ - the underlying ordering of the elements should be irrelevant and abstracted from the engine.

the underlying ordering of the elements should be irrelevant and abstracted from the engine.

The only place I think ordering should matter is things like error messages, if we want to alphabetize for example. But we can call sorted() at the callsites.

--

Because both FrozenDict and FrozenOrderedSet consider the order of elements in their __eq__, we have to eagerly call sort in lots of places, which is wasted computation. For example:

https://github.com/pantsbuild/pants/blob/94921a8edb9b142dff471784aef9da22b58cd0ab/src/python/pants/backend/python/dependency_inference/module_mapper.py#L136-L139

We must still use FrozenDict rather than dict so that it's hashable, but I don't think we need to care anymore about ordering.

I wonder if we even need OrderedSet and FrozenOrderedSet...can we use set and frozenset? Even if we keep those classes, can we stop using them use much and do the simpler + faster thing of stdlib?

--

To implement this change, I recommend:

  1. Fix the __eq__ implementations for FrozenDict and {Frozen,}OrderedSet
  2. Audit calls to sorted() and the class property sort_input, remove when possible.
jyggen commented 2 years ago

I've started to look into this one.

Am I wrong thinking that __eq__ for FrozenDict and {Frozen,}OrderedSet can simply be self._data == other._data and self._items == other._items respectively since Python supports dict comparisons natively?

FrozenDict also implements __lt__, and < on dicts isn't supported - so if that needs to be optimized as well we need another solution for that. Counter(self.items()) < Counter(other.items()) seems consistent regardless of key order, so maybe that's an option.

Eric-Arellano commented 2 years ago

I've started to look into this one.

🎉

Am I wrong thinking that eq for FrozenDict and {Frozen,}OrderedSet can simply be self._data == other._data and self._items == other._items respectively since Python supports dict comparisons natively?

Not at all wrong. That's I think what we should do.

FrozenDict also implements lt

Hm I don't remember why we did this. I think so that we can sort dataclasses that compose FrozenDicts. But maybe that's not necessary? I'd recommend:

  1. do whatever will keep semantics identical for now for __lt__, even if not optimal
  2. fix __eq__
  3. then, in a new PR, try changing the __lt__ semantics and see how CI does. Possibly get rid of __lt__
Eric-Arellano commented 2 years ago

Thinking more about this...order should probably matter for FrozenOrderedSet! It's in the name! Instead, we should be using frozenset rather than FrozenOrderedSet most of the time.

Ordering should not matter for FrozenDict.

stuhood commented 2 years ago

Instead, we should be using frozenset rather than FrozenOrderedSet most of the time.

Maybe? See https://github.com/pantsbuild/pants/issues/14195#issuecomment-1114087237 re: sets.

If something needs to iterate over a set, it should use FrozenOrderedSet. And assuming that the inputs to that FrozenOrderedSet are deterministic (because they themselves were not produced by iterating over non-Ordered sets), then you're good to not sort the inputs.