python / cpython

The Python programming language
https://www.python.org
Other
62.16k stars 29.88k forks source link

CPython profiler broken with TensorFlow 2.17.0 code in Python 3.12.1+ #122029

Open Yingqin-SU opened 1 month ago

Yingqin-SU commented 1 month ago

Bug report

Bug description:

Lately, I've been testing IA code on various Python interpreters and their corresponding profilers across multiple platforms. After multiple attempts, I've noticed that CPython profilers consistently fail to analyze the following code.

import tensorflow as tf

mnist = tf.keras.datasets.mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train = tf.keras.utils.normalize(x_train, axis=1)
x_test = tf.keras.utils.normalize(x_test, axis=1)

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu))
model.add(tf.keras.layers.Dense(10, activation=tf.nn.softmax))

model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

model.fit(x_train, y_train, epochs=2)

val_loss, val_acc = model.evaluate(x_test, y_test)
print(val_loss, val_acc)

model.save("epic_num_reader.keras")

predictions = model.predict([x_test])

print(predictions)

I've been testing an IA code on different Python interpreters and their respective profilers across multiple platforms. Specifically, I've been working with a simple TensorFlow+Keras code that classifies number image inputs. Interestingly, I found that the code works well with IntelPython, which uses Python 3.9.19 as its latest version. When I tested the code on multiple versions of CPython, I noticed that the profiler works well and returns information for CPython versions less than 3.12.1. However, since CPython 3.12.1, the code crashes with an error.

2024-07-19 14:53:51.996897: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-07-19 14:53:51.997294: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-19 14:53:51.999400: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2024-07-19 14:53:52.005233: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-07-19 14:53:52.014843: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-07-19 14:53:52.017590: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-07-19 14:53:52.025009: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 615, in <module>
    main()
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 604, in main
    runctx(code, globs, None, options.outfile, options.sort)
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 101, in runctx
    return _Utils(Profile).runctx(statement, globals, locals, filename, sort)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 64, in runctx
    prof.runctx(statement, globals, locals)
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 424, in runctx
    exec(cmd, globals, locals)
  File "mnist_number.py", line 1, in <module>
    import tensorflow as tf
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/__init__.py", line 47, in <module>
    from tensorflow._api.v2 import __internal__
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/_api/v2/__internal__/__init__.py", line 8, in <module>
    from tensorflow._api.v2.__internal__ import autograph
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/_api/v2/__internal__/autograph/__init__.py", line 8, in <module>
    from tensorflow.python.autograph.core.ag_ctx import control_status_ctx # line: 34
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/autograph/core/ag_ctx.py", line 21, in <module>
    from tensorflow.python.autograph.utils import ag_logging
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/autograph/utils/__init__.py", line 17, in <module>
    from tensorflow.python.autograph.utils.context_managers import control_dependency_on_returns
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/autograph/utils/context_managers.py", line 19, in <module>
    from tensorflow.python.framework import ops
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/framework/ops.py", line 50, in <module>
    from tensorflow.python.eager import context
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/eager/context.py", line 37, in <module>
    from tensorflow.python.eager import execute
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/eager/execute.py", line 21, in <module>
    from tensorflow.python.framework import dtypes
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 308, in <module>
    resource = DType(types_pb2.DT_RESOURCE)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 81, in __init__
    self._handle_data = handle_data
    ^^^^^^^^^^^^^^^^^
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 209, in trace_dispatch_i
    if self.dispatch[event](self, frame, t):
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/.pyenv/versions/3.12.1/lib/python3.12/profile.py", line 293, in trace_dispatch_return
    assert frame is self.cur[-2].f_back, ("Bad return", self.cur[-3])
AssertionError: ('Bad return', ('/home/user/.pyenv/versions/3.12.1/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py', 1, '<module>'))

I ran the code on my laptop, which has a Tiger Lake architecture and no NVIDIA GPU or Tensor Cores. As I didn't recompile the TensorFlow library. Therefore, it's expected to see warning messages related to the lack of AVX512 and GPU acceleration.

Test Environnement: WSL2 Ubuntu 20.04

For Python versions prior to 3.12.1, I only received warning messages, and the profiler worked as expected. However, since upgrading to 3.12.1, I've started encountering AssertError issues. Interestingly, I've compared the profile.py file between versions 3.12.0 and 3.12.1, and they appear to be identical. It's possible that the introduction of PEP 695 in Python 3.12 is causing this occasional error.

While waiting for your response, I wish you a good day.

Aaron SU

CPython versions tested on:

3.9, 3.10, 3.11, 3.12

Operating systems tested on:

Linux, Windows

Linked PRs

gaogaotiantian commented 1 month ago

Spent a few hours in this issue and I think I know at least part of the issue. The fundamental cause of the assertion failure was that a function call was not recorded, but the return was.

import sys
from tensorflow import DType
from tensorflow.core.framework import types_pb2

def trace(frame, event, arg):
    if event == "call" or event == "c_call":
        print(event, frame, frame.f_code)
    elif event == "return" or event == "c_return" or event == "c_exception":
        print(event, frame, frame.f_code)

sys.setprofile(trace)
DType(types_pb2.DT_RESOURCE)
sys.setprofile(None)

Will generate (subtract warnings)

call <frame at 0x7fed2f4039f0, file '/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py', line 73, code __init__> <code object __init__ at 0x7fecdeaec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73>
c_return <frame at 0x7fed2f4039f0, file '/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py', line 74, code __init__> <code object __init__ at 0x7fecdeaec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73>
return <frame at 0x7fed2f4039f0, file '/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py', line 81, code __init__> <code object __init__ at 0x7fecdeaec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73>
c_call <frame at 0x7fecd70d9900, file '/home/gaogaotiantian/programs/bilibili_video/scrabble.py', line 33, code <module>> <code object <module> at 0x211a330, file "/home/gaogaotiantian/programs/bilibili_video/scrabble.py", line 1>

A call event is missing before the c_return event. That's because our tracing system does not consider that as a C call (notice this is not specific to profile module, I'm using sys.settrace for the demonstration).

The reason is that, it's a method with a C im_func - I don't think this happens in CPython, but maybe for legacy reasons that's supported.

You can catch the relevant events with sys.monitoring:

import sys
from tensorflow import DType
from tensorflow.core.framework import types_pb2

E = sys.monitoring.events

sys.monitoring.use_tool_id(0, "test")

def call_callback(code, instruction_offset, callable, arg0):
    print("call", code, instruction_offset, callable, arg0)

def return_callback(code, instruction_offset, callable, arg0):
    print("return", code, instruction_offset, callable, arg0)

sys.monitoring.register_callback(0, E.CALL, call_callback)
sys.monitoring.register_callback(0, E.C_RETURN, return_callback)

sys.monitoring.set_events(0, E.CALL)
DType(types_pb2.DT_RESOURCE)
sys.monitoring.set_events(0, E.NO_EVENTS)
call <code object <module> at 0x287e330, file "/home/gaogaotiantian/programs/bilibili_video/scrabble.py", line 1> 422 <class 'tensorflow.python.framework.dtypes.DType'> 20
call <code object __init__ at 0x7f99b1eec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73> 18 <class 'super'> <object object at 0xb977f0>
return <code object __init__ at 0x7f99b1eec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73> 18 <class 'super'> <object object at 0xb977f0>
call <code object __init__ at 0x7f99b1eec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73> 24 <bound method PyCapsule.__init__ of tf.unknown dtype enum (54238092)_ref> 20
return <code object __init__ at 0x7f99b1eec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73> 24 <built-in method __init__ of PyCapsule object at 0x7f99b1f77330> <dtype: 'resource'>
return <code object <module> at 0x287e330, file "/home/gaogaotiantian/programs/bilibili_video/scrabble.py", line 1> 422 <class 'tensorflow.python.framework.dtypes.DType'> 20
call <code object <module> at 0x287e330, file "/home/gaogaotiantian/programs/bilibili_video/scrabble.py", line 1> 498 <built-in function set_events> 0

Notice the call event

call <code object __init__ at 0x7f99b1eec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73> 24 <bound method PyCapsule.__init__ of tf.unknown dtype enum (54238092)_ref> 20

and it's correspoding c_return event

return <code object __init__ at 0x7f99b1eec620, file "/home/gaogaotiantian/programs/bilibili_video/venv3.12/lib/python3.12/site-packages/tensorflow/python/framework/dtypes.py", line 73> 24 <built-in method __init__ of PyCapsule object at 0x7f99b1f77330> <dtype: 'resource'>

They have different callables - because CALL instruction stripped the method after sending call event, before c_return event. call event believed that this is a method so sys.settrace won't generate any events, but return believed it is a C function so it generated the event.

Even though we kind of knew the cause of the issue, fixing it is not trivial. The most important thing - we can easily repro this without 3rd party libraries, because CPython does not do this anymore.

I don't know the exact next step, so maybe @markshannon can give some suggestions?

gaogaotiantian commented 1 month ago

I found a pure CPython repro and I'll try to fix this.

markshannon commented 1 month ago

@gaogaotiantian Thanks for doing the triaging and analysis.

We definitely shouldn't be claiming we are calling one object and then calling another. One fix would be to fire the event after we've deconstructed the bound method. I'm not sure how intrusive that would be, though.

Maybe would should apply your fix to 3.12 and try the more intrusive fix for 3.13 onward?

gaogaotiantian commented 1 month ago

Right, so there are multiple conflicts in this issue.

  1. For sys.monitoring (the fundamental layer)
    • Do we consider the method to be the object we are calling, or the function hidden behind it? We got the function behind the bound method before calling it in CALL, but not CALL_FUNCTION_EX - I think we'll get different behavior there.
    • We need to emit CALL event (CALL can be referring to both the sys.monitoring event or the instruction here) even for Python functions or methods. Do we want to emit the actual function (not the bound one) in this event?
  2. For sys.setprofile
    • I believe for backwards compatible reason, we emit call event (omg another call) on the actual C function when it's a method descriptor, so that's where we explicitly did this kind of things.

Overall, I believe we need clear rules for both

markshannon commented 1 month ago

Do we consider the method to be the object we are calling, or the function hidden behind it?

I don't think it matters that much. These two are basically the same thing:

What is important is that the events match what actually happens.

If we make sure that sys.monitoring is consistent, then sys.setprofile and sys.settrace automatically do the right thing. Likewise, CALL and C_RETURN will act as expected if the callable passed to sys.monitoring is the actual object that is called.

Generally, we want to unpack bound methods first, because it keeps the C stack use consistent regardless of how functions are called. That way a slight change in a call doesn't blow the stack. But I don't think we want to make CALL_FUNCITON_EX even larger just for this special case.

markshannon commented 1 month ago

https://github.com/python/cpython/pull/122177 handles the CALL instruction, but we'll need another PR for CALL_KW and maybe a third for CALL_FUNCTION_EX. Overall, that is probably too big a change for 3.12 or 3.13.

@gaogaotiantian do you want to backport https://github.com/python/cpython/pull/122072 to 3.13 and 3.12 and we'll do the more intrusive fix for 3.14 onwards?

gaogaotiantian commented 1 month ago

Okay I can backport #122072 to 3.13 and 3.12.

gaogaotiantian commented 1 month ago

Oops, Greg merged the PR. I thought Mark meant to rebase the change to 3.13 and leave 3.14 alone so I left it there to work on it later. I accidentally request a lot of reviews because I naively tried to re-target the PR to 3.13.

However, I don't think it's that bad. We can still have the more reasonable 3.14 fix in the interpreter instead of in sys.setprofile. We have the test case from the PR and we can remove the extra code in this PR if everything else is ready.