python / cpython

The Python programming language
https://www.python.org
Other
63.09k stars 30.22k forks source link

Add fast path to `deepcopy()` for empty list/tuple/dict/set #121192

Open lgeiger opened 3 months ago

lgeiger commented 3 months ago

deepcopy() can be surprisingly slow when called with empty containers like lists, tuples, dicts, sets or frozensets.

Adding a fast path for this case similar to #114266 would significantly speed up such cases by about 4x - 28x while having little impact on the default path and not adding too much complexity. With such a patch the following benchmarking script would show a significant speedup compared to main:

import pyperf
runner = pyperf.Runner()

setup = """
import copy

a = {"list": [1, 2 ,3 ,4], "t": (1, 2, 3), "str": "hello", "subdict": {"a": True}}

t = ()
fs = frozenset()
l = []
s = set()
d = {}
deep = [[], (), {}, set(), frozenset()]
"""

runner.timeit(name="deepcopy dict", stmt=f"b = copy.deepcopy(a)", setup=setup)
runner.timeit(name="deepcopy empty tuple", stmt=f"b = copy.deepcopy(t)", setup=setup)
runner.timeit(name="deepcopy empty frozenset", stmt=f"b = copy.deepcopy(fs)", setup=setup)
runner.timeit(name="deepcopy empty list", stmt=f"b = copy.deepcopy(l)", setup=setup)
runner.timeit(name="deepcopy empty set", stmt=f"b = copy.deepcopy(s)", setup=setup)
runner.timeit(name="deepcopy empty dict", stmt=f"b = copy.deepcopy(d)", setup=setup)
runner.timeit(name="deepcopy multiple empty containers", stmt=f"b = copy.deepcopy(deep)", setup=setup)
deepcopy dict: Mean +- std dev: [baseline] 1.86 us +- 0.06 us -> [optimize-empty-copy] 2.02 us +- 0.02 us: 1.09x slower
deepcopy empty tuple: Mean +- std dev: [baseline] 285 ns +- 2 ns -> [optimize-empty-copy] 48.4 ns +- 0.9 ns: 5.89x faster
deepcopy empty frozenset: Mean +- std dev: [baseline] 1.47 us +- 0.11 us -> [optimize-empty-copy] 49.9 ns +- 1.5 ns: 29.44x faster
deepcopy empty list: Mean +- std dev: [baseline] 323 ns +- 2 ns -> [optimize-empty-copy] 82.7 ns +- 2.5 ns: 3.91x faster
deepcopy empty set: Mean +- std dev: [baseline] 1.46 us +- 0.10 us -> [optimize-empty-copy] 85.4 ns +- 4.9 ns: 17.04x faster
deepcopy empty dict: Mean +- std dev: [baseline] 326 ns +- 4 ns -> [optimize-empty-copy] 83.3 ns +- 2.6 ns: 3.91x faster
deepcopy multiple empty containers: Mean +- std dev: [baseline] 4.13 us +- 0.04 us -> [optimize-empty-copy] 1.16 us +- 0.02 us: 3.56x faster

Geometric mean: 5.48x faster

This might conflict with @eendebakpt efforts in #91610 or could be something that should be added to the proposed C version as well.

For context, I noticed this when using pydantic models with mutable default values where pydantic would deep copy the default value upon class instantiation. E.g.:

class Foo(pydantic.BaseModel):
    bar: list[int] = []

To be fair the proper fix in this case would be not to use a mutable default value in pydantic and switch to pydantic.Field(default_factory=list) similar to dataclasses instead which is much faster. But potentially there might be other scenarios where deepcopying empty iterables might be common.

I'm happy to make a PR unless it conflicts with the efforts going on in #91610.

Linked PRs

eendebakpt commented 3 months ago

@lgeiger No worries about conflicts with my PR, I can rebase if needed. I will add benchmarks for empty containers as well (I never imagined they would actually matter).

More important is perhaps whether your patch slows down the other cases. Do you have numbers for that? And perhaps also a benchmark for the pydantic models.

lgeiger commented 3 months ago

More important is perhaps whether your patch slows down the other cases. Do you have numbers for that?

I extended the above benchmarks to add a case which doesn't include any empty iterables. It does make this case about 1.09x slower. I'm not sure what kind of noise we'd expect from such a microbenchmark though, since I'm currently just running it on an M3 macbook on battery only.

And perhaps also a benchmark for the pydantic models.

Unfortunately I can't add a proper benchmark for the pydantic case, since it doesn't built with 3.14 yet. But here's a comparison of the example code with 3.12 which should illustrate the problem:

import pyperf
runner = pyperf.Runner()

setup = """
from pydantic import BaseModel, Field

class Foo(pydantic.BaseModel):
    bar: list[int] = []
    baz: dict[str, int] = {}

class FooField(pydantic.BaseModel):
    bar: list[int] = Field(default_factory=list)
    baz: dict[str, int] = Field(default_factory=dict)
"""

runner.timeit(name="init pydantic defaults", stmt=f"Foo()", setup=setup)
runner.timeit(name="init pydantic default factory", stmt=f"FooField()", setup=setup)
init pydantic defaults: Mean +- std dev: 1.19 us +- 0.01 us
init pydantic default factory: Mean +- std dev: 351 ns +- 12 ns

Only profiling the Foo() calls yields the following flamegraph whereas FooField() doesn't have a need to call deepcopy():

Screenshot 2024-07-01 at 00 04 59
lgeiger commented 3 months ago

I could reduce the slowdown of the default case to 1.06x with a slightly simpler version of the patch that actually performs even better for empty lists, sets and dicts but is a bit slower for tuples and frozen sets (although still much faster than the baseline). Let me know if you'd like me to open a PR, maybe then it's easier to discuss with the concrete code.

+------------------------------------+----------+------------------------+--------------------------+
| Benchmark                          | baseline | optimize-empty-copy    | less-optimize-empty-copy |
+====================================+==========+========================+==========================+
| deepcopy dict                      | 1.86 us  | 2.01 us: 1.09x slower  | 1.97 us: 1.06x slower    |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy empty tuple               | 285 ns   | 48.7 ns: 5.85x faster  | 55.0 ns: 5.18x faster    |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy empty frozenset           | 1.47 us  | 49.7 ns: 29.55x faster | 73.8 ns: 19.92x faster   |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy empty list                | 323 ns   | 81.8 ns: 3.95x faster  | 75.1 ns: 4.30x faster    |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy empty set                 | 1.46 us  | 84.7 ns: 17.18x faster | 77.6 ns: 18.75x faster   |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy empty dict                | 326 ns   | 84.8 ns: 3.84x faster  | 78.0 ns: 4.18x faster    |
+------------------------------------+----------+------------------------+--------------------------+
| deepcopy multiple empty containers | 4.13 us  | 1.17 us: 3.51x faster  | 1.32 us: 3.14x faster    |
+------------------------------------+----------+------------------------+--------------------------+
| Geometric mean                     | (ref)    | 5.47x faster           | 5.20x faster             |
+------------------------------------+----------+------------------------+--------------------------+
pochmann3 commented 3 months ago

Why check len(x) == 0 instead of not x?

lgeiger commented 3 months ago

Why check len(x) == 0 instead of not x?

Good idea, changed in ba7298c7ff9801ccac64a3d127b279de5b3524dc. This makes it slightly faster as well. In my quick benchmark the PR would make the default case only 1.03x slower (see updated numbers in the PR)

sobolevn commented 3 months ago

I think that from 4x to 20x speed up for this quite common case is a good enough reason to slow down for 1.09x in the slow path.

I am +1 on this.

corona10 commented 3 months ago

cc @serhiy-storchaka

corona10 commented 3 months ago

@sobolevn Out of curiosity, what is the typical case for deep copy? Deep cooy for non empty containers or Deepcopy for empty containers? I think the slowdown is not cheap if the nonempty container is more frequently called in the real world.

lgeiger commented 3 months ago

I think the slowdown is not cheap if the nonempty container is more frequently called in the real world.

@corona10 I updated #121193 and I don't see any measurable performance regression in the common case anymore (for more info see the updated numbers in the PR description).