tensorflow / profiler

A profiling and performance analysis tool for TensorFlow
Apache License 2.0
359 stars 55 forks source link

ProfilerPluginLoader fails due to protobuf versions #609

Open Inquisitive-ME opened 1 year ago

Inquisitive-ME commented 1 year ago

Using what is available as the latest versions from pip I get the following error

E0421 08:52:53.103637 140219640803328 application.py:125] Failed to load plugin ProfilePluginLoader.load; ignoring it. Traceback (most recent call last): File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard/backend/application.py", line 123, in TensorBoardWSGIApp plugin = loader.load(context) File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/profile_plugin_loader.py", line 75, in load from tensorboard_plugin_profile import profile_plugin File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/profile_plugin.py", line 36, in from tensorboard_plugin_profile.convert import raw_to_tool_data as convert File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/convert/raw_to_tool_data.py", line 29, in from tensorboard_plugin_profile.convert import input_pipeline_proto_to_gviz File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/convert/input_pipeline_proto_to_gviz.py", line 28, in from tensorboard_plugin_profile.protobuf import input_pipeline_pb2 File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/protobuf/input_pipeline_pb2.py", line 17, in from tensorboard_plugin_profile.protobuf import diagnostics_pb2 as plugin_dot_tensorboardpluginprofile_dot_protobuf_dot_diagnosticspb2 File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/tensorboard_plugin_profile/protobuf/diagnostics_pb2.py", line 36, in _descriptor.FieldDescriptor( File "/home/richard/.virtualenvs/deep_learning/lib/python3.10/site-packages/google/protobuf/descriptor.py", line 561, in new__ _message.Message._CheckCalledFromGeneratedFile() TypeError: Descriptors cannot not be created directly. If this call came from a _pb2.py file, your generated code is out of date and must be regenerated with protoc >= 3.19.0. If you cannot immediately regenerate your protos, some other possible workarounds are:

  1. Downgrade the protobuf package to 3.20.x or lower.
  2. Set PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python (but this will use pure-Python parsing and will be much slower).

It seems like the profiler plugin is incompatible with the latest tensorflow and tensorboard.

rdbis commented 1 year ago

same observation on my setup: clean install of Ubuntu 22.04.2 tensorboard 2.12.2 tensorflow & CUDA installation according to the official tensorflow instructions: https://www.tensorflow.org/install/pip?hl=en#linux

setting PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python does not help either: W0428 20:31:10.595216 139654082852416 security_validator.py:60] In 3.0, this warning will become an error: Illegal Content-Security-Policy for script-src: 'unsafe-inline' Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'

No profile data was found.

marcosfelt commented 1 year ago

same observation on my setup: clean install of Ubuntu 22.04.2 tensorboard 2.12.2 tensorflow & CUDA installation according to the official tensorflow instructions: https://www.tensorflow.org/install/pip?hl=en#linux

setting PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python does not help either: W0428 20:31:10.595216 139654082852416 security_validator.py:60] In 3.0, this warning will become an error: Illegal Content-Security-Policy for script-src: 'unsafe-inline' Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'

No profile data was found.

I hav the same issue!

rdbis commented 1 year ago

Ok, I think I found the rootcause of the problem. It is caused by a bug in the Bazel configuration files. All profiler protobuf stubs are generated using the ancient protobuf package ( 3.8.0 ). Which makes them incompatible with protobuf stubs from tensorboad/tensorflow as they are generated with the newer protobuf package >= 3.19.6. Tensorboard has an explicit dependency to load protobuf 3.19.6 for stub generation. Such a dependency is missing in the Bazel configuration for the profiler - instead it has a dependency on tensorflow 2.1.0 where protobuf 3.8.0 is loaded: in tensorflow/workspace.bzl

310ba5ee72661c081129eb878c1bbcec936b20f0 is based on 3.8.0 with a fix for protobuf.bzl.

PROTOBUF_URLS = [
    "https://storage.googleapis.com/mirror.tensorflow.org/github.com/protocolbuffers/protobuf/archive/310ba5ee72661c081129eb878c1bbcec936b20f0.tar.gz",
    "https://github.com/protocolbuffers/protobuf/archive/310ba5ee72661c081129eb878c1bbcec936b20f0.tar.gz",
]
PROTOBUF_SHA256 = "b9e92f9af8819bbbc514e2902aec860415b70209f31dfc8c4fa72515a5df9d59"
PROTOBUF_STRIP_PREFIX = "protobuf-310ba5ee72661c081129eb878c1bbcec936b20f0"

this makes tensorflow profiler incompatible with all tensorboard/tensorflow releases based on protobuf >= 3.19.0

cliveverghese commented 1 year ago

https://github.com/tensorflow/profiler/pull/636 Fixes this isuse, You can verify that the change works by downloading tbp-nightly.

rdbis commented 1 year ago

Thanks, It looks like the fix solves the problem with protobuf compatibility. However still I cannot see any profile data in the browser, the log from tensorboard/profiler shows the followng:

NOTE: Using experimental fast data loading logic. To disable, pass "--load_fast=false" and report issues on GitHub. More details: https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all TensorBoard 2.14.0a20230604 at http://localhost:6006/ (Press CTRL+C to quit) W0608 18:27:59.310396 140681651652160 security_validator.py:60] In 3.0, this warning will become an error: Illegal Content-Security-Policy for script-src: 'unsafe-inline' Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline' W0608 19:31:01.143541 140681441916480 security_validator.py:60] In 3.0, this warning will become an error: Illegal Content-Security-Policy for script-src: 'unsafe-inline' Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline' W0608 19:33:49.505171 140681358022208 security_validator.py:60] In 3.0, this warning will become an error: Illegal Content-Security-Policy for script-src: 'unsafe-inline' Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline' W0608 19:35:26.398867 140681358022208 security_validator.py:60] In 3.0, this warning will become an error: Illegal Content-Security-Policy for script-src: 'unsafe-inline' Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline' W0608 19:35:32.885195 140681358022208 security_validator.py:60] In 3.0, this warning will become an error: Illegal Content-Security-Policy for script-src: 'unsafe-inline' Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline' W0608 19:48:56.000323 140681525810752 security_validator.py:60] In 3.0, this warning will become an error: Illegal Content-Security-Policy for script-src: 'unsafe-inline' Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'

cliveverghese commented 1 year ago

Hi,

Could you provide information regarding the version of the packages installed on your system?

I don't see a possible error condition within the logs provided. Do you see any errors within the browser console?.

rdbis commented 1 year ago

Sure, I can recreate this problem with latest: tf-nightly - 2.14.0.dev20230609 tb-nightly - 2.14.0a20230609 tbp-nightly - 2.14.0a20230609

log from tensorboard: 2023-06-09 21:37:04.990730: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. 2023-06-09 21:37:05.429546: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

NOTE: Using experimental fast data loading logic. To disable, pass "--load_fast=false" and report issues on GitHub. More details: https://github.com/tensorflow/tensorboard/issues/4784

Serving TensorBoard on localhost; to expose to the network, use a proxy or pass --bind_all TensorBoard 2.14.0a20230609 at http://localhost:6006/ (Press CTRL+C to quit) W0609 21:50:52.116963 140076514244160 security_validator.py:60] In 3.0, this warning will become an error: Illegal Content-Security-Policy for script-src: 'unsafe-inline' Illegal Content-Security-Policy for script-src-elem: 'unsafe-inline'

here is tf execution log from my app: 2023-06-09 21:41:26.653283: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1637] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 19414 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3090, pci bus id: 0000:01:00.0, compute capability: 8.6 2023-06-09 21:41:26.853218: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing. 2023-06-09 21:41:26.853243: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started. 2023-06-09 21:41:26.853269: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1671] Profiler found 1 GPUs 2023-06-09 21:41:26.859587: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down. 2023-06-09 21:41:26.859639: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed Epoch 1/2 2023-06-09 21:41:28.795830: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:434] Loaded cuDNN version 8902 2023-06-09 21:41:29.981827: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:606] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once. 2023-06-09 21:41:30.143119: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fb678a86a50 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 2023-06-09 21:41:30.143143: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): NVIDIA GeForce RTX 3090, Compute Capability 8.6 2023-06-09 21:41:30.155607: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var MLIR_CRASH_REPRODUCER_DIRECTORY to enable. 2023-06-09 21:41:30.276619: I ./tensorflow/compiler/jit/device_compiler.h:186] Compiled cluster using XLA! This line is logged at most once for the lifetime of the process. 499/924 [===============>..............] - ETA: 1:55 - loss: 397827.8438 - accuracy: 0.29602023-06-09 21:43:46.324071: I tensorflow/tsl/profiler/lib/profiler_session.cc:104] Profiler session initializing. 2023-06-09 21:43:46.324092: I tensorflow/tsl/profiler/lib/profiler_session.cc:119] Profiler session started. 519/924 [===============>..............] - ETA: 1:49 - loss: 382497.2812 - accuracy: 0.29912023-06-09 21:43:52.028953: I tensorflow/tsl/profiler/lib/profiler_session.cc:70] Profiler session collecting data. 2023-06-09 21:43:52.030085: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_tracer.cc:1805] CUPTI activity buffer flushed 2023-06-09 21:43:52.047298: I tensorflow/compiler/xla/backends/profiler/gpu/cupti_collector.cc:541] GpuTracer has collected 2475 callback api events and 2450 activity events. 2023-06-09 21:43:52.061061: I tensorflow/tsl/profiler/lib/profiler_session.cc:131] Profiler session tear down. 2023-06-09 21:43:52.066437: I tensorflow/tsl/profiler/rpc/client/save_profile.cc:144] Collecting XSpace to repository: /home/jozef/logs/20230609-214008/plugins/profile/2023_06_09_21_43_52/jozef-desktop.xplane.pb 924/924 [==============================] - 256s 273ms/step - loss: 1571177.1250 - accuracy: 0.3012 Epoch 2/2 924/924 [==============================] - 253s 274ms/step - loss: 28949560.0000 - accuracy: 0.2877

this is the list of files created in log directory during the program execution: ./plugins/profile/2023_06_09_21_43_52/jozef-desktop.xplane.pb ./train/events.out.tfevents.1686339687.jozef-desktop.6744.0.v2

in the browser - in the profiler tab the message "No profile data was found." appears

cliveverghese commented 1 year ago

What is the logdir specified when starting tensorboard?

rdbis commented 1 year ago

tensorboard --logdir ~/logs

cliveverghese commented 1 year ago

Seems like an issue with the logdir path, It should be /home/jozef/logs/20230609-214008. The tensorflow execution is receiving that as the logdir for the profiling request.

You could try running tensorboard --logdir /home/jozef/logs/20230609-214008

rdbis commented 1 year ago

Wow, with tensorboard --logdir /home/jozef/logs/20230609-214008 it works like a charm. Thanks for this workaroud. :+1: So, it looks like there is an issue with handling the logdir parameter. Tensorboard shows properly all the collected profile runs in tensorboard browser interface, however selecting specific one via tensorboard web interface is not working properly right now. To make it work path to specific profiler run must be provided as input parameter to tensorboard, right? And tensorboard must be restarted with new logdir parameter everytime new profile data is collected. ?

pritamdodeja commented 1 year ago

Seems like an issue with the logdir path, It should be /home/jozef/logs/20230609-214008. The tensorflow execution is receiving that as the logdir for the profiling request.

You could try running tensorboard --logdir /home/jozef/logs/20230609-214008

@cliveverghese I think this is indicative of a broader issue starting in 2.12 (possibly earlier) where the location of the profile data has changed. This is breaking profiler in tensorboard. When I manually copy the files to the right place, tensorboard profiler works as expected.