tensorflow / tensorboard

TensorFlow's Visualization Toolkit
Apache License 2.0
6.67k stars 1.65k forks source link

Fast data loading feedback (`--load_fast=true`; “RustBoard”) #4784

Open wchargin opened 3 years ago

wchargin commented 3 years ago

This thread is for tracking feedback about TensorBoard’s experimental mode for fast data loading. Typical speedups range from 100× to 400×.

Who should try this: Anyone who’s found TensorBoard’s data loading to be slower than they’d like.

Who shouldn’t try this: Windows users (for now).

Feedback: Feedback form, or reply on this thread.

Try it out

To try this out, please uninstall all copies of TensorBoard and then install the latest version of tb-nightly:

pip uninstall -y tensorboard tb-nightly &&
pip install tb-nightly  # must have at least tb-nightly==2.5.0a20210316

Then, invoke TensorBoard with the --load_fast=true flag:

tensorboard --logdir /path/to/logs --load_fast true

Use TensorBoard as you usually would. It should work the same way, just faster.

Feedback

You can respond to this anonymous Google Form, or reply on this thread, or open a new issue. Let us know: did it work? how much faster was it? any suggestions or requests?

Known issues

We know about these, but please let us know if they matter for you, so that we can prioritize working on them:

FAQ

What does “data loading” include?

It includes time spent reading files in your logdir. It does not include time spent painting charts on the frontend.

What is the --load_fast flag?

Pass --load_fast=true to tell TensorBoard to use a new data loading mechanism, which is generally hundreds of times faster.

Is --load_fast=true right for me?

Currently, this mode is supported on Linux and macOS. If you are interested in using it on other platforms, ping @wchargin and I’ll show you how to build it.

Most features of TensorBoard are expected to work with the new data loading mechanism. All standard TensorBoard dashboards (scalars, images, etc.) should work, and flags like --reload_interval should work, too. You can use logdirs on local disk or on GCS buckets (public or private).

Do I need to have TensorFlow installed?

No.

What’s happening under the hood?

Instead of crawling your logdir in a mixture of Python and C++ code with a lot of locking, cross-language marshalling, and slow data manipulation in Python, we read the data in a dedicated subprocess. This program is written in Rust and is optimized for concurrent reading and serving. More design details here.

tgolsson commented 3 years ago

Hello!

Very much interested in this, as we currently maintain a custom entrypoint to make Tensorboard work at all with our data sizes. Unfortunately, I can't get this to work anywhere. Using the latest nightly docker image I get the following error:

root@15bc33cc211f:/# tensorboard --logdir foobar --load_fast=true
Error: Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" }
Traceback (most recent call last):
  File "/usr/local/bin/tensorboard", line 8, in <module>
    sys.exit(run_main())
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/main.py", line 46, in run_main
    app.run(tensorboard.main, flags_parser=tensorboard.configure)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 267, in main
    return runner(self.flags) or 0
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 283, in _run_serve_subcommand
    server = self._make_server()
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 433, in _make_server
    (data_provider, deprecated_multiplexer) = self._make_data_provider()
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 425, in _make_data_provider
    ingester.start()
  File "/usr/local/lib/python3.6/dist-packages/tensorboard/data/server_ingester.py", line 150, in start
    % popen.poll()
RuntimeError: Data server exited with 1; check stderr for details

Presumably it tries to bind some port that's already in use by another process; unfortunately it doesn't say which one.

Also, it doesn't seem to work with logdir_spec, only logdir. This isn't a huge pain, but the error message just states that I didn't pass logdir -- it should probably explicitly state that load_fast and logdir_spec are incompatible.

wchargin commented 3 years ago

@tgolsson: Hi; thank you for your feedback! I hadn’t looked into Docker at all. We bind to port 0, which requests an arbitrary free port to the OS, so it looks like it’s not a port issue but an IPv6 host issue. I’ve filed #4801 and will take a look. I’ve posted therein what I think should be a workaround, in case you’re interested in that sort of thing.

edit: Fixed in #4804; confirmed fix in Docker nightlies.

Also, it doesn't seem to work with logdir_spec, only logdir. This isn't a huge pain, but the error message just states that I didn't pass logdir -- it should probably explicitly state that load_fast and logdir_spec are incompatible.

Yep. As of #4794, if you use --load_fast=auto, we’ll automatically detect unsupported invocations (including --logdir_spec) and fall back to the old codepaths. I can also try to make the error more explicit particularly for --logdir_spec. Filed #4802.

This is super helpful feedback; thank you.

brychcy commented 3 years ago

With tensorboard-plugin-profile (2.4.0) installed, I'm getting errors in the log:

Exception in thread DynamicProfilePluginIsActiveThread:
Traceback (most recent call last):
  File "/Users/till/homebrew2/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/till/homebrew2/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/till/tfnightly-py3.8/lib/python3.8/site-packages/tensorboard_plugin_profile/profile_plugin.py", line 311, in compute_is_active
    self._is_active = any(self.generate_run_to_tools())
  File "/Users/till/tfnightly-py3.8/lib/python3.8/site-packages/tensorboard_plugin_profile/profile_plugin.py", line 693, in generate_run_to_tools
    plugin_assets = self.multiplexer.PluginAssets(PLUGIN_NAME)
AttributeError: 'NoneType' object has no attribute 'PluginAssets'

(They disappear with --load_fast=false)

wchargin commented 3 years ago

Hi @brychcy—thanks! Yes, this is true. The profile plugin uses non-standard approaches to load its data and so won’t work out of the box with --load_fast. I’ll see if we can get it to work, but in the meantime you’ll have to either pass --load_fast=false (if you want to use the profile plugin) or uninstall the profile plugin package (if you don’t care about it and want to silence the errors).

Added a note to the “Known issues” section; thank you!

wchargin commented 3 years ago

@brychcy: I’ve sent the profiler folks a patch: https://github.com/tensorflow/profiler/issues/298

Their build appears to be pretty broken, so I’m not sure how long it will take them to integrate this and push a release.

tgolsson commented 3 years ago

@wchargin Not quite feedback, but I'm wondering if there's any thoughts on multi-directory Rustboard (--logdir dir_a,dir_b in old syntax)? I started doing the work but figured I might ask in case it was intentionally removed or there's a WIP somewhere I'm not seeing.

wchargin commented 3 years ago

@tgolsson: Good question! I was thinking of instead supporting a more general mechanism that also resolves requests like #1708. Imagine something like:

$ tensorboard daemon start
$ tensorboard daemon add dir_a
$ tensorboard --daemon --bind_all
$ tensorboard daemon add dir_b

That is, you could add or remove log directories at runtime without having to relaunch TensorBoard or discarding existing loading progress, and also in a way that naturally supports remote filesystems and doesn't require setting up symlink trees.

Opened #4923 to track this, and would be happy to hear your thoughts.

Raphtor commented 3 years ago

I am getting a lot of warnings about too many open files -- is there a way to reduce or cap the number of open file descriptors?

2021-05-11T14:31:46Z WARN rustboard_core::run] Failed to open event file EventFileBuf("[RUN NAME]"): Os { code: 24, kind: Other, message: "Too many open files" }

I don't have that many runs (~2000), so it shouldn't really be an issue. Using lsof to count the number of open FDs shows over 12k being used...

>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
   6210 tokio-run
   6210 Reloader-
   1035 StdinWatc
   1035 server
   1035 Reloader
    184 gmain
    168 gdbus
    134 grpc_glob
     85 bash
     80 snapd

Compared to <500 in "slow" mode.

>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
    427 tensorboa
    184 gmain
    168 gdbus
     85 bash
     80 snapd
     72 systemd
     71 screen
     52 dconf\x20
     51 dbus-daem
     48 llvmpipe-

In my case, the "slow" mode actually loads files faster since it doesn't run into this issue.

wchargin commented 3 years ago

@Raphtor: interesting, thank you! Both the old and new codepaths keep an open fd for each event file, so I had considered this but expected it not to be a big problem. Let’s follow up in #4955.

sjincho commented 3 years ago

Using --load_fast under GKE with workload identity causes 401 Unauthorized error in rustboard_core::logdir when accessing GCS buckets.

It works fine if I set --load_fast=false.

8bitmp3 commented 3 years ago

Fast data loading may be causing issues with the profiler https://github.com/tensorflow/profiler/issues/344 (one of several issues mentioning this problem recently) - a possible solution for now is to switch it off with %tensorboard --logdir=logs --load_fast=false cc @Terranlee @Jimicy @yisitu

8bitmp3 commented 3 years ago

Update: try the latest Profiler plugin v2.5 (pip install tensorboard_plugin_profile (or tensorboard_plugin_profile==2.5.0)). Then, launch (e.g. %tensorboard --logdir=logs without the --load_fast switch) and select Profiler. Thanks @yisitu 👍

yisitu commented 3 years ago

You're welcome, happy to help!

jstremme commented 3 years ago

Anyone else landing here because they're following instructions from this link regarding using Tensorboard in AzureML?

yisitu commented 3 years ago

Closing as the issue has been resolved after I have released tensorboard_plugin_profile 2.5.0.

stephanwlee commented 3 years ago

Ah, we would like to keep this issue opened to solicit more feedbacks on the feature. Reopening.

yisitu commented 3 years ago

I see, I'll assign it back to you to track the feature.

yoshipon commented 2 years ago

Hi, thank you for building this awesome function!

Is it possible to restrict the data server to communicate with only one TensorBoard process? I would appreciate it if this feature is supported because the current data server seems to be accessible by any users on a shared server though TensorBoard itself can have a simple passcode by specifying --path_prefix.

an-ivanov commented 2 years ago

I've got tensorflow output data and explore it with tensorboad as scalars. Usually I make use of the RELATIVE mode of Horizontal Axis and the graphs are displayed well. But with the --load_fast true option the graphs show the data as points (not varying along the X axis) instead of curves. The WALL mode shows only point as well. An example of my data is attached. train.zip

GeorgePearse commented 2 years ago

Hi @tgolsson really curious about your implementation for large datasets as I'm trying to get tensorboard running for a few 100K. Have you just changed the hard coded limit (100K) in the typescript and rebuilt? What other changes have you made?

tgolsson commented 2 years ago

I'm not sure what that limit is for, but I've never heard of it unfortunately. Our problem was related to having too much regular logging data (scalars, histograms/distributions, images) leading to an infinite queue of "refreshes" because they wouldn't finish before retries.

GeorgePearse commented 2 years ago

Sorry @tgolsson, keep forgetting the number of other components to Tensorboard. My problems are specific to the embedding projector, but I guess that's not what you've had to solve. Thanks anyway!

Jiayuan-Gu commented 2 years ago

Hi, I have encountered a similar issue as #5116 when I use --load_fast=true (implicitly by default). The tf events are stored at a shared file system. ReadRecordError will lead to a termination of updating the latest event for those runs. When I use --load_fast=false, except for slow loading, there are no problems.

lemairecarl commented 2 years ago

Hi, on all compute clusters using our software stack, RustBoard hangs indefinitely at startup and has to be killed (with sigkill, sigterm isn't sufficient i.e. CTRL+C doesn't work). It does not reach the point where it prints something like TensorBoard x.y.z at http://0.0.0.0:PORT/ (Press CTRL+C to quit).

Here is a sample output using -v 1:

(env) [user01@login1 8]$ tensorboard --logdir ~/projects/def-sponsor00/$USER/out --host 0.0.0.0 --port 0 -v 1
TensorFlow installation not found - running with reduced feature set.
I0210 16:25:15.920281 139757372876608 server_ingester.py:290] Server binary (from Python package v0.6.1): /home/user01/env/lib/python3.8/site-packages/tensorboard_data_server/bin/server
I0210 16:25:15.922316 139757372876608 server_ingester.py:138] Spawning data server: ['/home/user01/env/lib/python3.8/site-packages/tensorboard_data_server/bin/server', '--logdir=/home/user01/projects/def-sponsor00/user01/out', '--reload=5', '--samples-per-plugin=', '--port=0', '--port-file=/tmp/tensorboard_data_server_rd4w992q/port', '--die-after-stdin', '--error-file=/tmp/tensorboard_data_server_rd4w992q/startup_error', '--verbose', '--verbose']
[2022-02-10T16:25:15Z DEBUG rustboard_core::cli] Parsed options: Opts { logdir: "/home/user01/projects/def-sponsor00/user01/out", host: "localhost", port: 0, reload: Loop { delay: 5s }, verbosity: 2, die_after_stdin: true, port_file: Some("/tmp/tensorboard_data_server_rd4w992q/port"), error_file: Some("/tmp/tensorboard_data_server_rd4w992q/startup_error"), checksum: false, no_checksum: false, samples_per_plugin: PluginSamplingHint({}) }
I0210 16:25:15.936901 139757372876608 server_ingester.py:160] Polling for data server port (attempt 0)
I0210 16:25:15.938199 139757372876608 server_ingester.py:162] Port file contents: None
[2022-02-10T16:25:15Z TRACE mio::poll] registering event source with poller: token=Token(0), interests=READABLE | WRITABLE
[2022-02-10T16:25:15Z INFO  rustboard_core::cli] Wrote port "36186" to /tmp/tensorboard_data_server_rd4w992q/port
[2022-02-10T16:25:15Z INFO  rustboard_core::cli] Starting load cycle
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Starting load for run "7/lightning_logs/version_7"
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Starting load for run "8/lightning_logs/version_8"
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Finished load for run "8/lightning_logs/version_8" (1.99877ms)
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Finished load for run "7/lightning_logs/version_7" (3.682665ms)
[2022-02-10T16:25:15Z INFO  rustboard_core::cli] Finished load cycle (8.249155ms)
I0210 16:25:16.439103 139757372876608 server_ingester.py:160] Polling for data server port (attempt 1)
I0210 16:25:16.439600 139757372876608 server_ingester.py:162] Port file contents: '36186\n'
[2022-02-10T16:25:20Z INFO  rustboard_core::cli] Starting load cycle
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Starting load for run "7/lightning_logs/version_7"
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Starting load for run "8/lightning_logs/version_8"
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Finished load for run "7/lightning_logs/version_7" (26.968µs)
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Finished load for run "8/lightning_logs/version_8" (22.896µs)
[2022-02-10T16:25:20Z INFO  rustboard_core::cli] Finished load cycle (6.097394ms)
< The last 6 lines repeat indefinitely >

I was wondering if we could use an environment variable to set load_fast=false by default on our clusters.

mhdadk commented 2 years ago

I've had trouble with the --load_fast=true flag. When running tensorboard without setting --load_fast=false, I eventually start getting the following message repeated indefinitely (I've redacted directory names and usernames as XXX):

[2022-08-06T17:40:14Z WARN  rustboard_core::run] Failed to open event file EventFileBuf("XXX/20220806_040526/20220806_040526/events.out.tfevents.1659773299.XXX.XXX.XXX"): Os { code: 24, kind: Other, message: "Too many open files" }

When I get this message, Tensorboard fails to launch. However, I no longer get this message, and Tensorboard launches normally, if I pass --load_fast=false while launching Tensorboard.

drmeerkat commented 2 years ago

I am having trouble with --load_fast on an old server. I don't have access to GLibc-2.18 or above so I have to use patchelf. After a clean install of tensorboard in Python 3.8, when I call tensorboard --logdir . --load_fast=true

/home/.conda/envs/th/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version `GLIBC_2.18' 
not found (required by /home/.conda/envs/th/lib/python3.8/site-packages/tensorboard_data_server/bin/server)

Then I patched server as follows,

patchelf --set-interpreter ~/scratch/mylib/glibc-2.18/lib/ld-linux-x86-64.so.2 --set-rpath ~/scratch/mylib/glibc-2.18/lib/ 
~/.conda/envs/th/lib/python3.8/site-packages/tensorboard_data_server/bin/server

tensorboard won't complain about libc.so.6 not found anymore. Instead, it now gives me this,

TensorFlow installation not found - running with reduced feature set.
Could not start data server: failed to bind to ("localhost", 0): failed to lookup address information: Name or service not known.

Do you have any ideas where could possibly go wrong? Thank you!


Updates well, I am able to use it if I install tfboard with conda via conda -c conda-forge tensorboard tensorboard-data-server since conda handles all the dependencies for me. But it would be nice if pip installed one also works.

Corwinpro commented 2 years ago

Using --load_fast under GKE with workload identity causes 401 Unauthorized error in rustboard_core::logdir when accessing GCS buckets.

It works fine if I set --load_fast=false.

Can this be considered to be a bug? Is there workaround to use --load_fast=true under GKE?

Thank you!

Corwinpro commented 2 years ago

Hi! I think I have a vague understanding of how the original issue can be solved. If someone could help me a bit with the last push, I believe we should be able to use --load_fast=true under GKE.

bmd3k commented 2 years ago

Hi Corwinpro, there was some discussion about this and your PR. There are a couple folks willing to help you shepherd this into the repo. I've created a new issue for this specific error here:

https://github.com/tensorflow/tensorboard/issues/5934

Thanks for your patience and your contribution!

samos123 commented 1 year ago

authentication via default service account is indeed not working when using logdir in 2.8.0, we had to run with --load_fast=false to get it to work. Any plans to support default service account credentials? Also why was this experimental feature turned on by default?

Corwinpro commented 1 year ago

authentication via default service account is indeed not working when using logdir in 2.8.0, we had to run with --load_fast=false to get it to work. Any plans to support default service account credentials? Also why was this experimental feature turned on by default?

Hi, would you mind sharing a bit more information? I might be able to help but that I would need to know how to reproduce your issue. (I am replying here because I contributed to a similar issue in the past, but of course it is up for the repo owners to make the decision). Thank you!

samos123 commented 1 year ago

We have a fairly exotic setup, but you might be able to reproduce it by creating a GCE VM with a custom service account that has GCS permissions, then running tensoarboard --logdir gcs://your-bucket --load_fast=True, this will automatically use the credentials using the GCE metadata server and shoudl result in permission errors. Try the same with --load_fast=False and it works with default Service Account credentials.

Corwinpro commented 1 year ago

@samos123 I assume you meant GKE... The error should not be there as I thought I fixed that. Could you please check which server version you are using? I guess something like rustboard --version. There was a release a few weeks ago but that is only applicable for tf>=2.12 IIUC

samos123 commented 1 year ago

GKE + Workload Identity would use a similar mechanism and I would expect to have same issue. We were using 2.8.0. Could you share the code where the authentication happens with --load_fast=True . I would be able to pin point if it would work with our custom setup.

Corwinpro commented 1 year ago

@samos123 sorry for confusion, I didn't know that the GCE abbreviation exists.

The PR was #5939 , in particular it gets a GCP Access Token using the gcp_auth::AuthenticationManager (gcp_auth is a 3rd party crate) in tensorboard/data/server/gcs/auth.rs. Overall, I'd try to see if gcp_auth works for your setup.

mueller91 commented 1 year ago

On my ubuntu 20.04.6LTS Nvidia A-100 DGX, i cannot get fast loading to work:

Could not start data server: exited with 1; check stderr for details. Try with --load_fast=false and report issues on GitHub. Details: https://github.com/tensorflow/tensorboard/issues/4784

that is all that I get.

Corwinpro commented 1 year ago

@samos123 @mueller91 Can you try https://github.com/tensorflow/tensorboard/releases/tag/2.12.0 or above?

mueller91 commented 1 year ago

@Corwinpro

Does not change it. GLIBC missing might be responsible? However, I have installed it via apt install glibc-source.

[...]
Successfully installed tensorboard-2.14.0
> tensorboard --logdir=. --bind_all --load_fast=true                                                                                           (tensorboard) 
TensorFlow installation not found - running with reduced feature set.
[...]anaconda3/envs/tensorboard/lib/python3.10/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by [...]anaconda3/envs/tensorboard/lib/python3.10/site-packages/tensorboard_data_server/bin/server)
[...]anaconda3/envs/tensorboard/lib/python3.10/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by [...]anaconda3/envs/tensorboard/lib/python3.10/site-packages/tensorboard_data_server/bin/server)

Could not start data server: exited with 1; check stderr for details.
    Try with --load_fast=false and report issues on GitHub. Details:
    https://github.com/tensorflow/tensorboard/issues/4784
wookayin commented 1 year ago

[!IMPORTANT] UPDATE after #6578: As of tensorboard_data_server==0.7.2 for tensorboard 2.15+, GLIBC 2.29 or higher is required. The pre-built wheel shipped with tensorboard >= 2.12 (tensorboard_data_server == 0.7, 0.7.1), download from PyPI, will require GLIBC version 2.34 or higher.

On Ubuntu 20.04 Linux machines where glibc version is 2.31, the rustboard server will fail to launch, trying to find glibc 2.32 - 2.34. Ubuntu 22.04 will be fine, as it's shipped with GLIBC 2.35.

TensorFlow installation not found - running with reduced feature set.
$CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by $CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server)
$CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by $CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server)
$CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by $CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server)
Could not start data server: exited with 1; check stderr for details.

Workaround: On ~Ubuntu 20.04~ or other old systems where GLIBC version is too old, use tensorboard == 2.11 (and tensorboard_data_server == 0.6.1).

FYI, how to figure out the GLIBC version on the system:

$ ldd --version | grep GLIB
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31
$ cat /etc/lsb-release | grep DESCRIPTION
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"

Verifying that tensorboard_data_server>=0.7 is built on too high version of GLIBC:

$ objdump -T $(python -c "from tensorboard_data_server import server_binary; print(server_binary())")  | grep GLIBC
...
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.34  pthread_create
0000000000000000      DF *UND*  0000000000000000  GLIBC_2.34  __libc_start_main

I'd like to kindly ask the tensorboard team to lower the GLIBC requirement in future releases. I will open an issue if needed. -> #6578

bmd3k commented 1 year ago

@wookayin . Thanks for flagging. Yes, please open a new issue!

profPlum commented 11 months ago

@wchargin

Currently, this mode is supported on Linux and macOS.

Hello, I'm very excited for this feature as tensorboard's speed has been a big pain point so far. BUT when I try to use it, it tells me it's not supported on MacOS:

Option --load_fast=true not available: TensorBoard data server not supported on this platform.

You say it is supported on MacOS though, so what's going on here? I've got MacBookPro17,1; Apple M1 chips; MacOS Ventura, version 13.4.1; tb-nightly Version: 2.15.0a20231013; tf-nightly-macos Version: 2.16.0.dev20231013.

P.S. I've gotten same results using non-nightly tensorflow-macos & no tensorflow at all. Also I followed your instructions exactly to uninstall tensorboard & tb_nightly before reinstalling tb_nightly.

wchargin commented 11 months ago

@profPlum: Hazarding a guess:

Apple M1 chips

That's probably your problem. The tensorboard-data-server package currently ships macOS wheels for x86-64 but not for arm64.

If interested, you can build it yourself easily. I just tested it on my laptop from scratch and had it running in three minutes. Here's how:

  1. If you don't already have a recent version of the Rust toolchain, install it from https://www.rust-lang.org/.

  2. Clone this repository (TensorBoard) into, say, ~/git/tensorboard.

  3. In the clone, change into the tensorboard/data/server/ directory.

  4. Run cargo build --release. This will build a data server binary into target/release/rustboard/.

  5. Set the TENSORBOARD_DATA_SERVER_BINARY environment variable to the full path to that binary: e.g.,

    export TENSORBOARD_DATA_SERVER_BINARY=~/git/tensorboard/tensorboard/data/server/target/release/rustboard

    (edit: fixed var name)

  6. Change directories out of the TensorBoard repository to avoid Python import issues, then launch tensorboard with --load_fast true.

If you want to double-check that it's using the data server, you can navigate to http://localhost:6006/data/environment and see whether the debug.data_provider field lists a GrpcDataProvider (fast) or a MultiplexerDataProvider (slow). Or, you can set the environment variable RUST_LOG=debug to see the data server logs.

(I don't currently work on TensorBoard, so consider this not an official response but just a community member who at one point knew this part of the code very well. :-) )

profPlum commented 11 months ago

@wchargin Thanks I appreciate the help! (& I'll let you know if it works) Do you think it is likely that TB devs will give official support to M1 chips soon?

profPlum commented 11 months ago

@wchargin Hi again, I tried your instructions verbatim and it says roughly the same:

TensorFlow installation not found - running with reduced feature set.
Option --load_fast=true not available: TensorBoard data server not supported on this platform.

But to clarify: did you want to me to launch the original (pip) tensorboard again? That point confused me and it is what I did but I'm not sure if it's what you meant.

P.S. With: fresh install of tb_nightly==2.15.0a20231019 & cargo version: 1.73.0 (9c4383fb5 2023-08-26). Also I got same results on a linux docker container.

Frn1nd0 commented 11 months ago

@wchargin Hi, I got issue when running this: %load_ext tensorboard %tensorboard --logdir output

It shows google interface with:

  1. That’s an error. That’s all we know.

Could you please guide me with this? Thanks

bmd3k commented 11 months ago

@Frn1nd0 , your issue is unrelated to fast data loading. Instead you have run into a recent regression with compatibility with Chrome. The Colab team have been investigating. We expect them to keep us updated at the following issue:

https://github.com/googlecolab/colabtools/issues/3990

Frn1nd0 commented 11 months ago

@bmd3k Thanks for the clarification, appreciate that! Hope they can fix this soon.

wookayin commented 11 months ago

Update: #6578 is fixed; as of tensorboard 2.15 GLIBC minimum requirement is 2.29 (compatible with Ubuntu 20.04)

wchargin commented 10 months ago

@profPlum: Oops, sorry, I wrote the environment variable wrong: it should be TENSORBOARD_DATA_SERVER_BINARY. Maybe try again thus?

did you want to me to launch the original (pip) tensorboard again?

Yes.

davidxia commented 7 months ago

Using --load_fast under GKE with workload identity causes 401 Unauthorized error in rustboard_core::logdir when accessing GCS buckets.

It works fine if I set --load_fast=false.

Is this still a bug in the recent versions? I can repro with version 2.11.2.