tensorflow / profiler

A profiling and performance analysis tool for TensorFlow
Apache License 2.0
359 stars 55 forks source link

No step marker observed and hence the step time is unknown #578

Open pritamdodeja opened 1 year ago

pritamdodeja commented 1 year ago

Consider Stack Overflow for getting support using TensorBoard—they have a larger community with better searchability:

https://stackoverflow.com/questions/tagged/tensorboard

Do not use this template for for setup, installation, or configuration issues. Instead, use the “installation problem” issue template:

https://github.com/tensorflow/tensorboard/issues/new?template=installation_problem.md

To report a problem with TensorBoard itself, please fill out the remainder of this template.

Environment information (required)

Please run diagnose_tensorboard.py (link below) in the same environment from which you normally run TensorFlow/TensorBoard, and paste the output here:

https://raw.githubusercontent.com/tensorflow/tensorboard/master/tensorboard/tools/diagnose_tensorboard.py

Diagnostics

Diagnostics output `````` --- check: autoidentify INFO: diagnose_tensorboard.py version 516a2f9433ba4f9c3a4fdb0f89735870eda054a1 --- check: general INFO: sys.version_info: sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0) INFO: os.name: posix INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='71d6fe811d18', release='6.0.5-200.fc36.x86_64', version='#1 SMP PREEMPT_DYNAMIC Wed Oct 26 15:55:21 UTC 2022', machine='x86_64') INFO: sys.getwindowsversion(): N/A --- check: package_management INFO: has conda-meta: False INFO: $VIRTUAL_ENV: None --- check: installed_packages INFO: installed: tensorboard==2.11.0 WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview'] INFO: installed: tensorflow-estimator==2.11.0 INFO: installed: tensorboard-data-server==0.6.1 --- check: tensorboard_python_version INFO: tensorboard.version.VERSION: '2.11.0' --- check: tensorflow_python_version INFO: tensorflow.__version__: '2.11.0' INFO: tensorflow.__git_version__: 'v2.11.0-rc2-17-gd5b57ca93e5' --- check: tensorboard_data_server_version INFO: data server binary: '/usr/local/lib/python3.8/dist-packages/tensorboard_data_server/bin/server' INFO: data server binary version: b'rustboard 0.6.1' --- check: tensorboard_binary_path INFO: which tensorboard: b'/usr/local/bin/tensorboard\n' --- check: addrinfos socket.has_ipv6 = True socket.AF_UNSPEC = socket.SOCK_STREAM = socket.AI_ADDRCONFIG = socket.AI_PASSIVE = Loopback flags: Loopback infos: [(, , 6, '', ('127.0.0.1', 0))] Wildcard flags: Wildcard infos: [(, , 6, '', ('0.0.0.0', 0)), (, , 6, '', ('::', 0, 0, 0))] --- check: readable_fqdn INFO: socket.getfqdn(): '71d6fe811d18' --- check: stat_tensorboardinfo INFO: directory: /tmp/.tensorboard-info INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=805882112, st_dev=51, st_nlink=2, st_uid=0, st_gid=0, st_size=6, st_atime=1677293427, st_mtime=1677293598, st_ctime=1677293598) INFO: mode: 0o40777 --- check: source_trees_without_genfiles INFO: tensorboard_roots (1): ['/usr/local/lib/python3.8/dist-packages']; bad_roots (0): [] --- check: full_pip_freeze INFO: pip freeze --all: absl-py==1.3.0 anyio==3.6.2 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 asttokens==2.1.0 astunparse==1.6.3 attrs==22.1.0 backcall==0.2.0 beautifulsoup4==4.11.1 bleach==5.0.1 cachetools==5.2.0 certifi==2022.9.24 cffi==1.15.1 charset-normalizer==2.1.1 contourpy==1.0.6 cycler==0.11.0 debugpy==1.6.3 decorator==5.1.1 defusedxml==0.7.1 entrypoints==0.4 executing==1.2.0 fastjsonschema==2.16.2 flatbuffers==22.10.26 fonttools==4.38.0 gast==0.4.0 google-auth==2.14.1 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 grpcio==1.50.0 gviz-api==1.10.0 h5py==3.7.0 idna==3.4 importlib-metadata==5.0.0 importlib-resources==5.10.0 ipykernel==5.1.1 ipython==8.6.0 ipython-genutils==0.2.0 ipywidgets==8.0.2 jedi==0.17.2 Jinja2==3.1.2 jsonschema==4.17.0 jupyter==1.0.0 jupyter-client==7.4.7 jupyter-console==6.4.4 jupyter-core==5.0.0 jupyter-http-over-ws==0.0.8 jupyter-server==1.23.2 jupyterlab-pygments==0.2.2 jupyterlab-widgets==3.0.3 keras==2.11.0 kiwisolver==1.4.4 libclang==14.0.6 Markdown==3.4.1 MarkupSafe==2.1.1 matplotlib==3.6.2 matplotlib-inline==0.1.6 mistune==2.0.4 nbclassic==0.4.8 nbclient==0.7.0 nbconvert==7.2.5 nbformat==4.4.0 nest-asyncio==1.5.6 notebook==6.5.2 notebook-shim==0.2.2 numpy==1.23.4 oauthlib==3.2.2 opt-einsum==3.3.0 packaging==21.3 pandocfilters==1.5.0 parso==0.7.1 pexpect==4.8.0 pickleshare==0.7.5 Pillow==9.3.0 pip==20.2.4 pkgutil-resolve-name==1.3.10 platformdirs==2.5.4 prometheus-client==0.15.0 prompt-toolkit==3.0.32 protobuf==3.19.6 psutil==5.9.4 ptyprocess==0.7.0 pure-eval==0.2.2 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycparser==2.21 Pygments==2.13.0 pyparsing==3.0.9 pyrsistent==0.19.2 python-dateutil==2.8.2 pyzmq==24.0.1 qtconsole==5.4.0 QtPy==2.3.0 requests==2.28.1 requests-oauthlib==1.3.1 rsa==4.9 Send2Trash==1.8.0 setuptools==65.5.1 six==1.16.0 sniffio==1.3.0 soupsieve==2.3.2.post1 stack-data==0.6.1 tensorboard==2.11.0 tensorboard-data-server==0.6.1 tensorboard-plugin-profile==2.11.1 tensorboard-plugin-wit==1.8.1 tensorflow-cpu==2.11.0 tensorflow-estimator==2.11.0 tensorflow-io-gcs-filesystem==0.27.0 termcolor==2.1.0 terminado==0.17.0 tinycss2==1.2.1 tornado==6.2 traitlets==5.5.0 typing-extensions==4.4.0 urllib3==1.26.12 wcwidth==0.2.5 webencodings==0.5.1 websocket-client==1.4.2 Werkzeug==2.2.2 wheel==0.34.2 widgetsnbextension==4.0.3 wrapt==1.14.1 zipp==3.10.0 ``````

Next steps

No action items identified. Please copy ALL of the above output,
including the lines containing only backticks, into your GitHub issue
or comment. Be sure to redact any sensitive information.
~
For browser-related issues, please additionally specify:

image

Issue description

Running very standard example of tensorboard callback, code below, and getting No step marker observed issue

import tensorflow as tf
import datetime
mnist = tf.keras.datasets.mnist

(x_train, y_train),(x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

def create_model():
  return tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28), name='layers_flatten'),
    tf.keras.layers.Dense(512, activation='relu', name='layers_dense'),
    tf.keras.layers.Dropout(0.2, name='layers_dropout'),
    tf.keras.layers.Dense(10, activation='softmax', name='layers_dense_2')
  ])

model = create_model()
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

log_dir = "logs/fit/" + datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=log_dir, histogram_freq=1, profile_batch=(1,50))

model.fit(x=x_train, 
          y=y_train, 
          epochs=5, 
          validation_data=(x_test, y_test), 
          callbacks=[tensorboard_callback])

Please describe the bug as clearly as possible. How can we reproduce the problem without additional resources (including external data files and proprietary Python modules)?

Step markers are either not getting logged by Keras or are not being read by tensorboard. I would expect that this information is logged so that I can use the module for optimizing tf.data usage. The environment that this is run in is a standard tensorflow docker container with the only additional package installed being tensorboard_plugin_profile

@foxik has suggested this is a protobuf version issue and that upgrading to 3.20.3 fixed a similar issue for him. It didn't fix it for me, am attaching the logs from both versions pre and post upgrade. I originally opened the issue at https://github.com/tensorflow/tensorboard/issues/6210 - @bmd3k asked me to recreate it here with all the information consolidated.

logs.oldprotobuf 2.zip

logs.protobuf.3.20.3.zip

foxik commented 1 year ago

Hi,

I retried my experiment and I actually did a slightly different thing -- I installed tensorflow==1.12.0rc0 (which brought tensorboard==1.12.0) and then tensorboard-plugin-profile==2.11.1, and finally downgraded to protobuf==3.20.3. This allows me to open profile runs created by both TF 1.11 and TF 1.12.0rc0.

JustASquid commented 1 year ago

I'm running into the same issue. The workaround suggested by @foxik didn't work for me either. Are there any suggestions for other workarounds? Profiling is currently not possible for our model; Trying to figure out which custom layer is causing the issue is not feasible.

pritamdodeja commented 1 year ago

@JustASquid do you have the flexibility to run on a slightly older versions of tf*? I was able to get this to work by doing that. I can share my config with you later today in case that's a viable option.

JustASquid commented 1 year ago

@pritamdodeja I did a run in version 2.10, I no longer get this warning, but the training step markers are wrong:

image

This run was with profile_batch="10,20" with a 30-batch epoch.

Could this be related to the same issue?

pritamdodeja commented 1 year ago

@pritamdodeja I did a run in version 2.10, I no longer get this warning, but the training step markers are wrong:

image

This run was with profile_batch="10,20" with a 30-batch epoch.

Could this be related to the same issue?

@JustASquid the original symptom I faced was profiler wasn't available with the message related to the step markers in the screenshot above. Do you see that the profiler is available to you in tensorboard? Try running the reproducible example I have put as a snippet above and see what results you get.

JustASquid commented 1 year ago

@pritamdodeja to clarify, the warning doesn't show up anymore when downgrading from Tensorflow 2.11 to Tensorflow 2.10 for the training run.

But the issue now is that the step numbers are all wrong; As you can see from the x-axis which shows incorrect step numbers and the very strange "spiking". Could be related to #266 perhaps?

pritamdodeja commented 1 year ago

@JustASquid It looks like the same issue to me. I don't know enough protocol buffers yet to be able to effectively debug it though. If/when that changes, I will post back here with an update.

pritamdodeja commented 1 year ago

@JustASquid I just tested this issue on the following configuration and it's still broken. Things are actually worse now as you cannot go back to an older tf version because of cudnn dependency :( - Profiler no longer shows up. If I get the time, I'm going to do a deep dive on tensorboard profiler and protocol buffers. I'm using the latest protobuf but setting

export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python

$ pip freeze | grep tensor
tensorboard==2.14.0
tensorboard-data-server==0.7.1
tensorboard-plugin-profile==2.13.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.13.0
tensorflow-data-validation==1.13.0
tensorflow-estimator==2.13.0
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.33.0
tensorflow-metadata==1.13.1
tensorflow-model-analysis==0.44.0
tensorflow-serving-api==2.12.2
tensorflow-transform==1.13.0
pritamdodeja commented 1 year ago

I was able to understand why this is happening. The profiler is writing the profiler data in a different place in the hierarchy. Once that issue is solved, and the profile duration is long enough, for me, the step marker issue is going away. I will provide details in the next day or so.

pritamdodeja commented 1 year ago

@JustASquid @foxik Here is my understanding of the possible cause of this:

Let's say you usually run tensorboard --logdir model_run to start tensorboard

tensorboard expects plugins/profile to exist in model_run/<run number>/<train|validation>

Starting with tensorflow 2.12 (possibly earlier) plugins/profile is instead appearing at model_run/<run number>

This is causing tensorboard to not see the profile data, and not activating the profiler in the UI, etc. Once you manually rectify this by copying the data using

cp -Rpv ../plugins .

in model_run/<run number>/<train|validation>

and refresh tensorboard, it should start seeing the profiler.

If I had to guess what introduced the change/error, I would say it's somewhere in the vicinity of

tensorflow/tensorflow/core/profiler/convert/xplane_to_tools_data.cc

More specifically, in the tensorflow repo, I suspect the following might be helpful to figure out what exactly broke this

git diff 7a500e 4d4873 tensorflow/tensorflow/core/profiler/convert/xplane_to_tools_data.h

My use-case is in the context of a tfx pipeline, but I believe this applies to other use cases where profiling is happening, so likely your log_dir and hierarchy might be different, but relatively, the problem should be the same.

Gaura commented 1 year ago

Hello,

Thanks for raising and discussing the issue. I am facing the same issue. Could you tell me if this is resolved?

Thanks.

stellarpower commented 6 months ago

In my case I am able to obtain stats for example code similar to what @pritamdodeja has provided above, but not when I change to my own loss function that I am trying to debug (and this runs okay). I get the impression the core profiler is not outputting those markers, as they don't appear to be in the protobuf file, so have opened here.