aarch64: cProfile overhead is very high on aarch64 compared to x86

wash-amzn commented 5 months ago

Given a dummy program

import datetime
i = 0
while i < 2**20:
  i = i + 1
  datetime.datetime(2024, 1, 2, 3, 4, 5)

There is a ~5x slowdown on x86 versus ~250x slowdown on aarch64 when using cProfile. The below are using the python 3.10.14/pypy 7.3.16 docker container

root@5a4b5a544a95:/# arch
x86_64

root@5a4b5a544a95:/# time pypy test.py

real    0m0.077s
user    0m0.026s
sys 0m0.026s

root@5a4b5a544a95:/# pypy -m cProfile -s time test.py | head -n 15
         14681013 function calls (14680999 primitive calls) in 0.346 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1048579    0.078    0.000    0.274    0.000 datetime.py:1588(__new__)
        1    0.077    0.077    0.354    0.354 test.py:1(<module>)
  7340067    0.053    0.000    0.053    0.000 {built-in function _operator.index}
  1048581    0.051    0.000    0.082    0.000 datetime.py:392(_check_date_fields)
  1048581    0.051    0.000    0.081    0.000 datetime.py:405(_check_time_fields)
  2097331    0.017    0.000    0.017    0.000 {built-in function isinstance}
  1048581    0.008    0.000    0.008    0.000 datetime.py:57(_days_in_month)
  1048581    0.008    0.000    0.008    0.000 datetime.py:422(_check_tzinfo_arg)
        1    0.001    0.001    0.001    0.001 {built-in function marshal.loads}
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:770(create_module)

root@42758cf4bb3c:/# arch
aarch64

root@42758cf4bb3c:/# time pypy3 test.py

real    0m0.043s
user    0m0.033s
sys 0m0.011s

root@42758cf4bb3c:/# pypy3 -m cProfile -s time test.py | head -n 15
         14681013 function calls (14680999 primitive calls) in 11.151 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  1048579    2.699    0.000   11.039    0.000 datetime.py:1588(__new__)
  7340067    2.637    0.000    2.637    0.000 {built-in function _operator.index}
  1048581    1.910    0.000    3.417    0.000 datetime.py:392(_check_date_fields)
  1048581    1.908    0.000    3.415    0.000 datetime.py:405(_check_time_fields)
  2097331    0.754    0.000    0.754    0.000 {built-in function isinstance}
        1    0.486    0.486   11.528   11.528 test.py:1(<module>)
  1048581    0.377    0.000    0.377    0.000 datetime.py:57(_days_in_month)
  1048581    0.377    0.000    0.377    0.000 datetime.py:422(_check_tzinfo_arg)
        1    0.001    0.001    0.001    0.001 {built-in function marshal.loads}
        1    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:770(create_module)

Compared to cpython (I used the 3.12.3 container for it) where there is almost no slowdown at all (to be fair, the non-profiled execution time is an order of magnitude higher than with pypy)

root@be01f2efb0f0:/# arch
x86_64

root@be01f2efb0f0:/# time python test.py

real    0m0.405s
user    0m0.395s
sys 0m0.010s

root@be01f2efb0f0:/# time python -m cProfile -s time test.py | head -n 15
         479 function calls (472 primitive calls) in 0.451 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.450    0.450    0.451    0.451 test.py:1(<module>)
        1    0.000    0.000    0.000    0.000 {built-in method _imp.create_dynamic}
      5/2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:480(_call_with_frames_removed)
        7    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1593(find_spec)
       10    0.000    0.000    0.000    0.000 {built-in method posix.stat}
       32    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:126(_path_join)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1240(_find_spec)
      2/1    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap>:1349(_find_and_load)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

root@4e8150585108:/# arch
aarch64

root@4e8150585108:/# time python test.py

real    0m0.349s
user    0m0.349s
sys 0m0.000s

root@4e8150585108:/# python -m cProfile -s time test.py | head -n 15
         479 function calls (472 primitive calls) in 0.452 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.451    0.451    0.452    0.452 test.py:1(<module>)
        1    0.000    0.000    0.000    0.000 {built-in method _imp.create_dynamic}
       10    0.000    0.000    0.000    0.000 {built-in method posix.stat}
        7    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1593(find_spec)
       32    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:126(_path_join)
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap>:1240(_find_spec)
      2/1    0.000    0.000    0.001    0.001 <frozen importlib._bootstrap>:1349(_find_and_load)
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {built-in method _io.open_code}
        2    0.000    0.000    0.000    0.000 <frozen importlib._bootstrap_external>:1491(_get_spec)

For a larger application I was attempting to profile with cProfile, I not only got the slower execution on aarch64, but I also got large differences in the relative sorting of top functions. That might have been legitimate, but I don't feel comfortable trusting the numbers given the large difference in overhead of profiling on x86 vs aarch64.

I don't expect that I will have the time to track this down, so I at least wanted to get this posted in case someone else does.

mattip commented 4 months ago

Are you using a aarch64 host or hosting the aarch64 docker image on a x86_64 machine? We did fix an issue #4939 with GC nursery size on aarch64, perhaps the detection code is still failing on emulation. Can you report the size of the nursery on both the runs (from the comment)

pypy3.9 -c 'import gc; print(gc.get_stats().nursery_size)'

washley commented 4 months ago

This is on a real aarch64 machine (Graviton3). This happens with or without the improved nursery size.

pypy / pypy

aarch64: cProfile overhead is very high on aarch64 compared to x86 #4945