tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.72k stars 1.66k forks source link

With Azure Blob Storage, I always get "No dashboards are active for the current data set." #5638

Open ben-omji opened 2 years ago

ben-omji commented 2 years ago

Environment information (required)

Diagnostics

Diagnostics output `````` --- check: autoidentify INFO: diagnose_tensorboard.py version e43767ef2b648d0d5d57c00f38ccbd38390e38da --- check: general INFO: sys.version_info: sys.version_info(major=3, minor=8, micro=10, releaselevel='final', serial=0) INFO: os.name: posix INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='a601721015eb', release='5.4.0-74-generic', version='#83-Ubuntu SMP Sat May 8 02:35:39 UTC 2021', machine='x86_64') INFO: sys.getwindowsversion(): N/A --- check: package_management INFO: has conda-meta: False INFO: $VIRTUAL_ENV: None --- check: installed_packages INFO: installed: tensorboard==2.8.0 WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview'] INFO: installed: tf-estimator-nightly==2.8.0.dev2021122109 INFO: installed: tensorboard-data-server==0.6.1 --- check: tensorboard_python_version INFO: tensorboard.version.VERSION: '2.8.0' --- check: tensorflow_python_version INFO: tensorflow.__version__: '2.8.0' INFO: tensorflow.__git_version__: 'v2.8.0-rc1-32-g3f878cff5b6' --- check: tensorboard_data_server_version INFO: data server binary: '/usr/local/lib/python3.8/dist-packages/tensorboard_data_server/bin/server' INFO: data server binary version: b'rustboard 0.6.1' --- check: tensorboard_binary_path INFO: which tensorboard: b'/usr/local/bin/tensorboard\n' --- check: addrinfos socket.has_ipv6 = True socket.AF_UNSPEC = socket.SOCK_STREAM = socket.AI_ADDRCONFIG = socket.AI_PASSIVE = Loopback flags: Loopback infos: [(, , 6, '', ('127.0.0.1', 0))] Wildcard flags: Wildcard infos: [(, , 6, '', ('0.0.0.0', 0)), (, , 6, '', ('::', 0, 0, 0))] --- check: readable_fqdn INFO: socket.getfqdn(): 'a601721015eb' --- check: stat_tensorboardinfo INFO: directory: /tmp/.tensorboard-info INFO: os.stat(...): os.stat_result(st_mode=16895, st_ino=73738, st_dev=88, st_nlink=2, st_uid=0, st_gid=0, st_size=2, st_atime=1648610374, st_mtime=1648613527, st_ctime=1648613527) INFO: mode: 0o40777 --- check: source_trees_without_genfiles INFO: tensorboard_roots (1): ['/usr/local/lib/python3.8/dist-packages']; bad_roots (0): [] --- check: full_pip_freeze INFO: pip freeze --all: absl-py==1.0.0 astunparse==1.6.3 cachetools==5.0.0 certifi==2021.10.8 charset-normalizer==2.0.11 flatbuffers==2.0 gast==0.5.3 google-auth==2.6.0 google-auth-oauthlib==0.4.6 google-pasta==0.2.0 grpcio==1.43.0 h5py==3.6.0 idna==3.3 importlib-metadata==4.10.1 keras==2.8.0 Keras-Preprocessing==1.1.2 libclang==13.0.0 Markdown==3.3.6 numpy==1.22.1 oauthlib==3.2.0 opt-einsum==3.3.0 pip==20.2.4 protobuf==3.19.4 pyasn1==0.4.8 pyasn1-modules==0.2.8 requests==2.27.1 requests-oauthlib==1.3.1 rsa==4.8 setuptools==60.7.0 six==1.16.0 tensorboard==2.8.0 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorflow-cpu==2.8.0 tensorflow-io==0.24.0 tensorflow-io-gcs-filesystem==0.24.0 termcolor==1.1.0 tf-estimator-nightly==2.8.0.dev2021122109 typing-extensions==4.0.1 urllib3==1.26.8 Werkzeug==2.0.2 wheel==0.34.2 wrapt==1.13.3 zipp==3.7.0 ``````

Issue description

I can get list of model ckeckpoint directory with tensorflow gfile and tensorflow-io.

tf.io.gfile.listdir('az://rndstoragesample/containersample/efficientdet-finetune/ckpt')
['best_objective.txt', 'checkpoint', 'config.yaml', 'events.out.tfevents.1648598968.my-pipeline-ckh9c-3384085500', 'events.out.tfevents.1648609048.my-pipeline-ks2sj-2126585855', 'graph.pbtxt', 'model.ckpt-0.data-00000-of-00001' ... ... ]

But When I try to set the Azure blob storage path to logdir of tensorboard, I always get "No dashboards are active for the current data set."

root@a601721015eb:~# tensorboard --logdir az://rndstoragesample/containersample/efficientdet-finetune/ckpt --bind_all
TensorBoard 2.8.0 at http://a601721015eb:6006/ (Press CTRL+C to quit)
W0330 03:53:26.889187 140443207591680 projector_plugin.py:489] Failed reading "az://rndstoragesample/containersample/efficientdet-finetune/ckpt/model.ckpt-178"

img1

Reproduction steps

I downloaded tensorflow-io with pip and set accesskey on env var

pip install tensorflow-io
export TF_AZURE_STORAGE_KEY="<my-key>"

Then I check the connection with storage through python script.

import tensorflow_io
import tensorflow

account_name = 'rndstoragesample'
pathname = 'az://{}/aztest'.format(account_name)
tf.io.gfile.mkdir(pathname)
tf.io.gfile.listdir('az://rndstoragesample/containersample/efficientdet-finetune/ckpt')

The connection is looks good, so I try to log the blob directory with tensorboard.

tensorboard --logdir az://rndstoragesample/containersample/efficientdet-finetune/ckpt --bind_all

I got the log below, and face to "No dashboards are active for the current data set." page.

TensorBoard 2.8.0 at http://a601721015eb:6006/ (Press CTRL+C to quit)
W0330 03:53:26.889187 140443207591680 projector_plugin.py:489] Failed reading "az://rndstoragesample/containersample/efficientdet-finetune/ckpt/model.ckpt-178"

Do you guys have any idea for this problem?

Thanks in advance.

japie1235813 commented 2 years ago

Without being able to reproduce the case myself I can only think about couples of guess: The error occurs in this line when tf tried to load the check points

ben-omji commented 2 years ago

@japie1235813 Thank you for your comment. As you said, I checked my checkpoint file and it contains model.ckpt-178 as the model_checkpoint_path. Then I checked model.ckpt-178 exists in .../ckpt/. It's truly there but I just skipped the list of file after model.ckpt-0.data-00000-of-00001 in my log.

Actually, with AWS S3, I can get everything I expected on tensorboard. But with Azure blob storage, I can not get anyting although the codes are exactly same except storage access information.

japie1235813 commented 2 years ago

Thanks for providing the information.

ben-omji commented 2 years ago

Hi, this is the log after I added "az://" to https://github.com/tensorflow/tensorboard/blob/master/tensorboard/util/io_util.py#L20 and build it with bazel build tensorboard:tensorboard

root@a601721015eb:~/tensorboard# ./bazel-bin/tensorboard/tensorboard --logdir az://rndstoragesample/containersample2/efficientdet-finetune/ckpt --bind_all --verbosity=1
I0404 01:20:52.332980 139840485025600 program.py:489] Note: --load_fast behavior only supports local and GCS (gs://) paths; falling back to slower Python-only load path.
I0404 01:20:52.333134 139840485025600 plugin_event_multiplexer.py:106] Event Multiplexer initializing.
I0404 01:20:52.333190 139840485025600 plugin_event_multiplexer.py:126] Event Multiplexer done initializing
I0404 01:20:52.362562 139840485025600 data_ingester.py:128] Launching reload in a daemon thread
I0404 01:20:52.363056 139838087120640 data_ingester.py:102] TensorBoard reload process beginning
I0404 01:20:52.363506 139838087120640 plugin_event_multiplexer.py:203] Starting AddRunsFromDirectory: az://rndstoragesample/containersample2/efficientdet-finetune/ckpt
TensorBoard 2.9.0a0 at http://a601721015eb:6006/ (Press CTRL+C to quit)
I0404 01:20:57.545112 139838087120640 plugin_event_multiplexer.py:209] Done with AddRunsFromDirectory: az://rndstoragesample/containersample2/efficientdet-finetune/ckpt
I0404 01:20:57.545384 139838087120640 data_ingester.py:105] TensorBoard reload process: Reload the whole Multiplexer
I0404 01:20:57.545490 139838087120640 plugin_event_multiplexer.py:214] Beginning EventMultiplexer.Reload()
I0404 01:20:57.545643 139838087120640 plugin_event_multiplexer.py:257] Reloading runs serially (one after another) on the main thread.
I0404 01:20:57.545749 139838087120640 plugin_event_multiplexer.py:267] Finished with EventMultiplexer.Reload()
I0404 01:20:57.545852 139838087120640 data_ingester.py:110] TensorBoard done reloading. Load took 5.183 secs
I0404 01:21:02.551295 139838087120640 data_ingester.py:102] TensorBoard reload process beginning
I0404 01:21:02.551645 139838087120640 plugin_event_multiplexer.py:203] Starting AddRunsFromDirectory: az://rndstoragesample/containersample2/efficientdet-finetune/ckpt
I0404 01:21:07.626314 139838087120640 plugin_event_multiplexer.py:209] Done with AddRunsFromDirectory: az://rndstoragesample/containersample2/efficientdet-finetune/ckpt
I0404 01:21:07.626531 139838087120640 data_ingester.py:105] TensorBoard reload process: Reload the whole Multiplexer
I0404 01:21:07.626597 139838087120640 plugin_event_multiplexer.py:214] Beginning EventMultiplexer.Reload()
I0404 01:21:07.626689 139838087120640 plugin_event_multiplexer.py:257] Reloading runs serially (one after another) on the main thread.
I0404 01:21:07.626765 139838087120640 plugin_event_multiplexer.py:267] Finished with EventMultiplexer.Reload()
I0404 01:21:07.626820 139838087120640 data_ingester.py:110] TensorBoard done reloading. Load took 5.076 secs
I0404 01:21:10.935636 139838078727936 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:10] "GET / HTTP/1.1" 200 -
I0404 01:21:10.965151 139838078727936 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:10] "GET /index.js?_file_hash=07fcc25b HTTP/1.1" 200 -
I0404 01:21:10.968884 139838070335232 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:10] "GET /font-roboto/oMMgfZMQthOryQo9n22dcuvvDin1pK8aKteLpeZ5c0A.woff2 HTTP/1.1" 200 -
I0404 01:21:11.520380 139838078727936 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:11] "GET /icon_bundle.svg HTTP/1.1" 200 -
I0404 01:21:11.554227 139838078727936 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:11] "GET /font-roboto/RxZJdnzeo3R5zSexge8UUZBw1xU1rKptJj_0jans920.woff2 HTTP/1.1" 200 -
I0404 01:21:11.561268 139838070335232 application.py:435] Plugin listing: is_active() for scalars took 0.000 seconds
I0404 01:21:11.561539 139838070335232 application.py:435] Plugin listing: is_active() for custom_scalars took 0.000 seconds
I0404 01:21:11.561725 139838070335232 application.py:435] Plugin listing: is_active() for images took 0.000 seconds
I0404 01:21:11.561880 139838070335232 application.py:435] Plugin listing: is_active() for audio took 0.000 seconds
I0404 01:21:11.562028 139838070335232 application.py:435] Plugin listing: is_active() for debugger-v2 took 0.000 seconds
I0404 01:21:11.562177 139838070335232 application.py:435] Plugin listing: is_active() for graphs took 0.000 seconds
I0404 01:21:11.562328 139838070335232 application.py:435] Plugin listing: is_active() for distributions took 0.000 seconds
I0404 01:21:11.562465 139838070335232 application.py:435] Plugin listing: is_active() for histograms took 0.000 seconds
I0404 01:21:11.562606 139838070335232 application.py:435] Plugin listing: is_active() for text took 0.000 seconds
I0404 01:21:11.562738 139838070335232 application.py:435] Plugin listing: is_active() for pr_curves took 0.000 seconds
I0404 01:21:11.562874 139838070335232 application.py:435] Plugin listing: is_active() for profile_redirect took 0.000 seconds
I0404 01:21:11.563013 139838070335232 application.py:435] Plugin listing: is_active() for hparams took 0.000 seconds
I0404 01:21:11.563209 139838070335232 application.py:435] Plugin listing: is_active() for mesh took 0.000 seconds
I0404 01:21:11.563377 139838070335232 application.py:435] Plugin listing: is_active() for timeseries took 0.000 seconds
I0404 01:21:11.564099 139838070335232 application.py:435] Plugin listing: is_active() for projector took 0.001 seconds
I0404 01:21:11.564291 139838070335232 application.py:435] Plugin listing: is_active() for whatif took 0.000 seconds
I0404 01:21:11.566493 139838070335232 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:11] "GET /data/plugins_listing HTTP/1.1" 200 -
I0404 01:21:11.569489 139838053549824 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:11] "GET /data/environment HTTP/1.1" 200 -
I0404 01:21:11.571566 139838045157120 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:11] "GET /data/runs HTTP/1.1" 200 -
I0404 01:21:11.621406 139838045157120 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:11] "GET /data/runs HTTP/1.1" 200 -
I0404 01:21:11.622396 139838070335232 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:11] "GET /data/environment HTTP/1.1" 200 -
I0404 01:21:11.735285 139838045157120 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:11] "GET /font-roboto/d-6IYplOFocCacKzxwXSOJBw1xU1rKptJj_0jans920.woff2 HTTP/1.1" 200 -
I0404 01:21:11.736663 139838070335232 _internal.py:225] ::ffff:192.168.15.87 - - [04/Apr/2022 01:21:11] "GET /font-roboto/vPcynSL0qHq_6dX7lKVByXYhjbSpvc47ee6xR_80Hnw.woff2 HTTP/1.1" 200 -
I0404 01:21:12.631995 139838087120640 data_ingester.py:102] TensorBoard reload process beginning
I0404 01:21:12.632247 139838087120640 plugin_event_multiplexer.py:203] Starting AddRunsFromDirectory: az://rndstoragesample/containersample2/efficientdet-finetune/ckpt
I0404 01:21:17.706724 139838087120640 plugin_event_multiplexer.py:209] Done with AddRunsFromDirectory: az://rndstoragesample/containersample2/efficientdet-finetune/ckpt
I0404 01:21:17.707039 139838087120640 data_ingester.py:105] TensorBoard reload process: Reload the whole Multiplexer
I0404 01:21:17.707193 139838087120640 plugin_event_multiplexer.py:214] Beginning EventMultiplexer.Reload()
I0404 01:21:17.707406 139838087120640 plugin_event_multiplexer.py:257] Reloading runs serially (one after another) on the main thread.
I0404 01:21:17.707566 139838087120640 plugin_event_multiplexer.py:267] Finished with EventMultiplexer.Reload()
I0404 01:21:17.707705 139838087120640 data_ingester.py:110] TensorBoard done reloading. Load took 5.076 secs
I0404 01:21:22.712980 139838087120640 data_ingester.py:102] TensorBoard reload process beginning
I0404 01:21:22.713310 139838087120640 plugin_event_multiplexer.py:203] Starting AddRunsFromDirectory: az://rndstoragesample/containersample2/efficientdet-finetune/ckpt
I0404 01:21:22.720707 139838087120640 plugin_event_multiplexer.py:209] Done with AddRunsFromDirectory: az://rndstoragesample/containersample2/efficientdet-finetune/ckpt
I0404 01:21:22.720935 139838087120640 data_ingester.py:105] TensorBoard reload process: Reload the whole Multiplexer
I0404 01:21:22.721066 139838087120640 plugin_event_multiplexer.py:214] Beginning EventMultiplexer.Reload()
I0404 01:21:22.721239 139838087120640 plugin_event_multiplexer.py:257] Reloading runs serially (one after another) on the main thread.
I0404 01:21:22.721373 139838087120640 plugin_event_multiplexer.py:267] Finished with EventMultiplexer.Reload()
I0404 01:21:22.721486 139838087120640 data_ingester.py:110] TensorBoard done reloading. Load took 0.009 secs
W0404 01:21:26.998456 139838061942528 projector_plugin.py:489] Failed reading "az://rndstoragesample/containersample2/efficientdet-finetune/ckpt/model.ckpt-1607"
I0404 01:21:27.726734 139838087120640 data_ingester.py:102] TensorBoard reload process beginning
I0404 01:21:27.727046 139838087120640 plugin_event_multiplexer.py:203] Starting AddRunsFromDirectory: az://rndstoragesample/containersample2/efficientdet-finetune/ckpt
I0404 01:21:32.892737 139838087120640 plugin_event_multiplexer.py:209] Done with AddRunsFromDirectory: az://rndstoragesample/containersample2/efficientdet-finetune/ckpt
I0404 01:21:32.892977 139838087120640 data_ingester.py:105] TensorBoard reload process: Reload the whole Multiplexer
I0404 01:21:32.893059 139838087120640 plugin_event_multiplexer.py:214] Beginning EventMultiplexer.Reload()
I0404 01:21:32.893168 139838087120640 plugin_event_multiplexer.py:257] Reloading runs serially (one after another) on the main thread.
I0404 01:21:32.893251 139838087120640 plugin_event_multiplexer.py:267] Finished with EventMultiplexer.Reload()
I0404 01:21:32.893320 139838087120640 data_ingester.py:110] TensorBoard done reloading. Load took 5.167 secs

Unfortunately, it doesn't work for me. I also have tried to find some unexpected result in source code related with this function but I could not find any suspicious thing..

And We currently moved to AWS because of the urgency of this project. However if you have any more idea about this issue, I'll try it for solving this issue.