Open wchargin opened 3 years ago
Hello!
Very much interested in this, as we currently maintain a custom entrypoint to make Tensorboard work at all with our data sizes. Unfortunately, I can't get this to work anywhere. Using the latest nightly docker image I get the following error:
root@15bc33cc211f:/# tensorboard --logdir foobar --load_fast=true
Error: Os { code: 99, kind: AddrNotAvailable, message: "Cannot assign requested address" }
Traceback (most recent call last):
File "/usr/local/bin/tensorboard", line 8, in <module>
sys.exit(run_main())
File "/usr/local/lib/python3.6/dist-packages/tensorboard/main.py", line 46, in run_main
app.run(tensorboard.main, flags_parser=tensorboard.configure)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 303, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 251, in _run_main
sys.exit(main(argv))
File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 267, in main
return runner(self.flags) or 0
File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 283, in _run_serve_subcommand
server = self._make_server()
File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 433, in _make_server
(data_provider, deprecated_multiplexer) = self._make_data_provider()
File "/usr/local/lib/python3.6/dist-packages/tensorboard/program.py", line 425, in _make_data_provider
ingester.start()
File "/usr/local/lib/python3.6/dist-packages/tensorboard/data/server_ingester.py", line 150, in start
% popen.poll()
RuntimeError: Data server exited with 1; check stderr for details
Presumably it tries to bind some port that's already in use by another process; unfortunately it doesn't say which one.
Also, it doesn't seem to work with logdir_spec
, only logdir
. This isn't a huge pain, but the error message just states that I didn't pass logdir -- it should probably explicitly state that load_fast
and logdir_spec
are incompatible.
@tgolsson: Hi; thank you for your feedback! I hadn’t looked into Docker at all. We bind to port 0, which requests an arbitrary free port to the OS, so it looks like it’s not a port issue but an IPv6 host issue. I’ve filed #4801 and will take a look. I’ve posted therein what I think should be a workaround, in case you’re interested in that sort of thing.
edit: Fixed in #4804; confirmed fix in Docker nightlies.
Also, it doesn't seem to work with
logdir_spec
, onlylogdir
. This isn't a huge pain, but the error message just states that I didn't pass logdir -- it should probably explicitly state thatload_fast
andlogdir_spec
are incompatible.
Yep. As of #4794, if you use --load_fast=auto
, we’ll automatically
detect unsupported invocations (including --logdir_spec
) and fall back
to the old codepaths. I can also try to make the error more explicit
particularly for --logdir_spec
. Filed #4802.
This is super helpful feedback; thank you.
With tensorboard-plugin-profile (2.4.0) installed, I'm getting errors in the log:
Exception in thread DynamicProfilePluginIsActiveThread:
Traceback (most recent call last):
File "/Users/till/homebrew2/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/Users/till/homebrew2/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/Users/till/tfnightly-py3.8/lib/python3.8/site-packages/tensorboard_plugin_profile/profile_plugin.py", line 311, in compute_is_active
self._is_active = any(self.generate_run_to_tools())
File "/Users/till/tfnightly-py3.8/lib/python3.8/site-packages/tensorboard_plugin_profile/profile_plugin.py", line 693, in generate_run_to_tools
plugin_assets = self.multiplexer.PluginAssets(PLUGIN_NAME)
AttributeError: 'NoneType' object has no attribute 'PluginAssets'
(They disappear with --load_fast=false)
Hi @brychcy—thanks! Yes, this is true. The profile plugin uses
non-standard approaches to load its data and so won’t work out of the
box with --load_fast
. I’ll see if we can get it to work, but in the
meantime you’ll have to either pass --load_fast=false
(if you want to
use the profile plugin) or uninstall the profile plugin package (if you
don’t care about it and want to silence the errors).
Added a note to the “Known issues” section; thank you!
@brychcy: I’ve sent the profiler folks a patch: https://github.com/tensorflow/profiler/issues/298
Their build appears to be pretty broken, so I’m not sure how long it will take them to integrate this and push a release.
@wchargin Not quite feedback, but I'm wondering if there's any thoughts on multi-directory Rustboard (--logdir dir_a,dir_b
in old syntax)? I started doing the work but figured I might ask in case it was intentionally removed or there's a WIP somewhere I'm not seeing.
@tgolsson: Good question! I was thinking of instead supporting a more general mechanism that also resolves requests like #1708. Imagine something like:
$ tensorboard daemon start
$ tensorboard daemon add dir_a
$ tensorboard --daemon --bind_all
$ tensorboard daemon add dir_b
That is, you could add or remove log directories at runtime without having to relaunch TensorBoard or discarding existing loading progress, and also in a way that naturally supports remote filesystems and doesn't require setting up symlink trees.
Opened #4923 to track this, and would be happy to hear your thoughts.
I am getting a lot of warnings about too many open files -- is there a way to reduce or cap the number of open file descriptors?
2021-05-11T14:31:46Z WARN rustboard_core::run] Failed to open event file EventFileBuf("[RUN NAME]"): Os { code: 24, kind: Other, message: "Too many open files" }
I don't have that many runs (~2000), so it shouldn't really be an issue. Using lsof to count the number of open FDs shows over 12k being used...
>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
6210 tokio-run
6210 Reloader-
1035 StdinWatc
1035 server
1035 Reloader
184 gmain
168 gdbus
134 grpc_glob
85 bash
80 snapd
Compared to <500 in "slow" mode.
>> lsof | awk '{print $1}' | sort | uniq -c | sort -r -n | head
427 tensorboa
184 gmain
168 gdbus
85 bash
80 snapd
72 systemd
71 screen
52 dconf\x20
51 dbus-daem
48 llvmpipe-
In my case, the "slow" mode actually loads files faster since it doesn't run into this issue.
@Raphtor: interesting, thank you! Both the old and new codepaths keep an open fd for each event file, so I had considered this but expected it not to be a big problem. Let’s follow up in #4955.
Using --load_fast
under GKE with workload identity causes 401 Unauthorized
error in rustboard_core::logdir
when accessing GCS buckets.
It works fine if I set --load_fast=false
.
Fast data loading may be causing issues with the profiler https://github.com/tensorflow/profiler/issues/344 (one of several issues mentioning this problem recently) - a possible solution for now is to switch it off with %tensorboard --logdir=logs --load_fast=false
cc @Terranlee @Jimicy @yisitu
Update: try the latest Profiler plugin v2.5 (pip install tensorboard_plugin_profile
(or tensorboard_plugin_profile==2.5.0
)). Then, launch (e.g. %tensorboard --logdir=logs
without the --load_fast
switch) and select Profiler. Thanks @yisitu 👍
You're welcome, happy to help!
Anyone else landing here because they're following instructions from this link regarding using Tensorboard in AzureML?
Closing as the issue has been resolved after I have released tensorboard_plugin_profile 2.5.0.
Ah, we would like to keep this issue opened to solicit more feedbacks on the feature. Reopening.
I see, I'll assign it back to you to track the feature.
Hi, thank you for building this awesome function!
Is it possible to restrict the data server to communicate with only one TensorBoard process? I would appreciate it if this feature is supported because the current data server seems to be accessible by any users on a shared server though TensorBoard itself can have a simple passcode by specifying --path_prefix.
I've got tensorflow output data and explore it with tensorboad as scalars. Usually I make use of the RELATIVE mode of Horizontal Axis and the graphs are displayed well. But with the --load_fast true option the graphs show the data as points (not varying along the X axis) instead of curves. The WALL mode shows only point as well. An example of my data is attached. train.zip
Hi @tgolsson really curious about your implementation for large datasets as I'm trying to get tensorboard running for a few 100K. Have you just changed the hard coded limit (100K) in the typescript and rebuilt? What other changes have you made?
I'm not sure what that limit is for, but I've never heard of it unfortunately. Our problem was related to having too much regular logging data (scalars, histograms/distributions, images) leading to an infinite queue of "refreshes" because they wouldn't finish before retries.
Sorry @tgolsson, keep forgetting the number of other components to Tensorboard. My problems are specific to the embedding projector, but I guess that's not what you've had to solve. Thanks anyway!
Hi, I have encountered a similar issue as #5116 when I use --load_fast=true
(implicitly by default). The tf events are stored at a shared file system. ReadRecordError
will lead to a termination of updating the latest event for those runs. When I use --load_fast=false
, except for slow loading, there are no problems.
Hi, on all compute clusters using our software stack, RustBoard hangs indefinitely at startup and has to be killed (with sigkill, sigterm isn't sufficient i.e. CTRL+C doesn't work). It does not reach the point where it prints something like TensorBoard x.y.z at http://0.0.0.0:PORT/ (Press CTRL+C to quit)
.
Here is a sample output using -v 1
:
(env) [user01@login1 8]$ tensorboard --logdir ~/projects/def-sponsor00/$USER/out --host 0.0.0.0 --port 0 -v 1
TensorFlow installation not found - running with reduced feature set.
I0210 16:25:15.920281 139757372876608 server_ingester.py:290] Server binary (from Python package v0.6.1): /home/user01/env/lib/python3.8/site-packages/tensorboard_data_server/bin/server
I0210 16:25:15.922316 139757372876608 server_ingester.py:138] Spawning data server: ['/home/user01/env/lib/python3.8/site-packages/tensorboard_data_server/bin/server', '--logdir=/home/user01/projects/def-sponsor00/user01/out', '--reload=5', '--samples-per-plugin=', '--port=0', '--port-file=/tmp/tensorboard_data_server_rd4w992q/port', '--die-after-stdin', '--error-file=/tmp/tensorboard_data_server_rd4w992q/startup_error', '--verbose', '--verbose']
[2022-02-10T16:25:15Z DEBUG rustboard_core::cli] Parsed options: Opts { logdir: "/home/user01/projects/def-sponsor00/user01/out", host: "localhost", port: 0, reload: Loop { delay: 5s }, verbosity: 2, die_after_stdin: true, port_file: Some("/tmp/tensorboard_data_server_rd4w992q/port"), error_file: Some("/tmp/tensorboard_data_server_rd4w992q/startup_error"), checksum: false, no_checksum: false, samples_per_plugin: PluginSamplingHint({}) }
I0210 16:25:15.936901 139757372876608 server_ingester.py:160] Polling for data server port (attempt 0)
I0210 16:25:15.938199 139757372876608 server_ingester.py:162] Port file contents: None
[2022-02-10T16:25:15Z TRACE mio::poll] registering event source with poller: token=Token(0), interests=READABLE | WRITABLE
[2022-02-10T16:25:15Z INFO rustboard_core::cli] Wrote port "36186" to /tmp/tensorboard_data_server_rd4w992q/port
[2022-02-10T16:25:15Z INFO rustboard_core::cli] Starting load cycle
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Starting load for run "7/lightning_logs/version_7"
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Starting load for run "8/lightning_logs/version_8"
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Finished load for run "8/lightning_logs/version_8" (1.99877ms)
[2022-02-10T16:25:15Z DEBUG rustboard_core::run] Finished load for run "7/lightning_logs/version_7" (3.682665ms)
[2022-02-10T16:25:15Z INFO rustboard_core::cli] Finished load cycle (8.249155ms)
I0210 16:25:16.439103 139757372876608 server_ingester.py:160] Polling for data server port (attempt 1)
I0210 16:25:16.439600 139757372876608 server_ingester.py:162] Port file contents: '36186\n'
[2022-02-10T16:25:20Z INFO rustboard_core::cli] Starting load cycle
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Starting load for run "7/lightning_logs/version_7"
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Starting load for run "8/lightning_logs/version_8"
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Finished load for run "7/lightning_logs/version_7" (26.968µs)
[2022-02-10T16:25:20Z DEBUG rustboard_core::run] Finished load for run "8/lightning_logs/version_8" (22.896µs)
[2022-02-10T16:25:20Z INFO rustboard_core::cli] Finished load cycle (6.097394ms)
< The last 6 lines repeat indefinitely >
I was wondering if we could use an environment variable to set load_fast=false
by default on our clusters.
I've had trouble with the --load_fast=true
flag. When running tensorboard without setting --load_fast=false
, I eventually start getting the following message repeated indefinitely (I've redacted directory names and usernames as XXX):
[2022-08-06T17:40:14Z WARN rustboard_core::run] Failed to open event file EventFileBuf("XXX/20220806_040526/20220806_040526/events.out.tfevents.1659773299.XXX.XXX.XXX"): Os { code: 24, kind: Other, message: "Too many open files" }
When I get this message, Tensorboard fails to launch. However, I no longer get this message, and Tensorboard launches normally, if I pass --load_fast=false
while launching Tensorboard.
I am having trouble with --load_fast
on an old server. I don't have access to GLibc-2.18 or above so I have to use patchelf. After a clean install of tensorboard in Python 3.8, when I call tensorboard --logdir . --load_fast=true
/home/.conda/envs/th/lib/python3.8/site-packages/tensorboard_data_server/bin/server: /lib64/libc.so.6: version `GLIBC_2.18'
not found (required by /home/.conda/envs/th/lib/python3.8/site-packages/tensorboard_data_server/bin/server)
Then I patched server as follows,
patchelf --set-interpreter ~/scratch/mylib/glibc-2.18/lib/ld-linux-x86-64.so.2 --set-rpath ~/scratch/mylib/glibc-2.18/lib/
~/.conda/envs/th/lib/python3.8/site-packages/tensorboard_data_server/bin/server
tensorboard won't complain about libc.so.6 not found anymore. Instead, it now gives me this,
TensorFlow installation not found - running with reduced feature set.
Could not start data server: failed to bind to ("localhost", 0): failed to lookup address information: Name or service not known.
Do you have any ideas where could possibly go wrong? Thank you!
Updates well, I am able to use it if I install tfboard with conda via conda -c conda-forge tensorboard tensorboard-data-server
since conda handles all the dependencies for me. But it would be nice if pip
installed one also works.
Using
--load_fast
under GKE with workload identity causes401 Unauthorized
error inrustboard_core::logdir
when accessing GCS buckets.It works fine if I set
--load_fast=false
.
Can this be considered to be a bug? Is there workaround to use --load_fast=true
under GKE?
Thank you!
Hi! I think I have a vague understanding of how the original issue can be solved. If someone could help me a bit with the last push, I believe we should be able to use --load_fast=true
under GKE.
Hi Corwinpro, there was some discussion about this and your PR. There are a couple folks willing to help you shepherd this into the repo. I've created a new issue for this specific error here:
https://github.com/tensorflow/tensorboard/issues/5934
Thanks for your patience and your contribution!
authentication via default service account is indeed not working when using logdir
in 2.8.0, we had to run with --load_fast=false
to get it to work. Any plans to support default service account credentials? Also why was this experimental feature turned on by default?
authentication via default service account is indeed not working when using
logdir
in 2.8.0, we had to run with--load_fast=false
to get it to work. Any plans to support default service account credentials? Also why was this experimental feature turned on by default?
Hi, would you mind sharing a bit more information? I might be able to help but that I would need to know how to reproduce your issue. (I am replying here because I contributed to a similar issue in the past, but of course it is up for the repo owners to make the decision). Thank you!
We have a fairly exotic setup, but you might be able to reproduce it by creating a GCE VM with a custom service account that has GCS permissions, then running tensoarboard --logdir gcs://your-bucket --load_fast=True
, this will automatically use the credentials using the GCE metadata server and shoudl result in permission errors. Try the same with --load_fast=False
and it works with default Service Account credentials.
@samos123 I assume you meant GKE... The error should not be there as I thought I fixed that. Could you please check which server version you are using? I guess something like rustboard --version
. There was a release a few weeks ago but that is only applicable for tf>=2.12 IIUC
GKE + Workload Identity would use a similar mechanism and I would expect to have same issue. We were using 2.8.0. Could you share the code where the authentication happens with --load_fast=True
. I would be able to pin point if it would work with our custom setup.
@samos123 sorry for confusion, I didn't know that the GCE abbreviation exists.
The PR was #5939 , in particular it gets a GCP Access Token using the gcp_auth::AuthenticationManager
(gcp_auth
is a 3rd party crate) in tensorboard/data/server/gcs/auth.rs
. Overall, I'd try to see if gcp_auth
works for your setup.
On my ubuntu 20.04.6LTS Nvidia A-100 DGX, i cannot get fast loading to work:
Could not start data server: exited with 1; check stderr for details. Try with --load_fast=false and report issues on GitHub. Details: https://github.com/tensorflow/tensorboard/issues/4784
that is all that I get.
@samos123 @mueller91 Can you try https://github.com/tensorflow/tensorboard/releases/tag/2.12.0 or above?
@Corwinpro
Does not change it.
GLIBC missing might be responsible? However, I have installed it via apt install glibc-source
.
[...]
Successfully installed tensorboard-2.14.0
> tensorboard --logdir=. --bind_all --load_fast=true (tensorboard)
TensorFlow installation not found - running with reduced feature set.
[...]anaconda3/envs/tensorboard/lib/python3.10/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by [...]anaconda3/envs/tensorboard/lib/python3.10/site-packages/tensorboard_data_server/bin/server)
[...]anaconda3/envs/tensorboard/lib/python3.10/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by [...]anaconda3/envs/tensorboard/lib/python3.10/site-packages/tensorboard_data_server/bin/server)
Could not start data server: exited with 1; check stderr for details.
Try with --load_fast=false and report issues on GitHub. Details:
https://github.com/tensorflow/tensorboard/issues/4784
[!IMPORTANT] UPDATE after #6578: As of
tensorboard_data_server==0.7.2
for tensorboard 2.15+, GLIBC 2.29 or higher is required. The pre-built wheel shipped withtensorboard >= 2.12
(tensorboard_data_server == 0.7, 0.7.1
), download from PyPI, will require GLIBC version 2.34 or higher.
On Ubuntu 20.04 Linux machines where glibc version is 2.31, the rustboard server will fail to launch, trying to find glibc 2.32 - 2.34. Ubuntu 22.04 will be fine, as it's shipped with GLIBC 2.35.
TensorFlow installation not found - running with reduced feature set.
$CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.33' not found (required by $CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server)
$CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by $CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server)
$CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by $CONDA_PREFIX/lib/python3.11/site-packages/tensorboard_data_server/bin/server)
Could not start data server: exited with 1; check stderr for details.
Workaround: On ~Ubuntu 20.04~ or other old systems where GLIBC version is too old, use tensorboard == 2.11
(and tensorboard_data_server == 0.6.1
).
FYI, how to figure out the GLIBC version on the system:
$ ldd --version | grep GLIB
ldd (Ubuntu GLIBC 2.31-0ubuntu9.9) 2.31
$ cat /etc/lsb-release | grep DESCRIPTION
DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"
Verifying that tensorboard_data_server>=0.7
is built on too high version of GLIBC:
$ objdump -T $(python -c "from tensorboard_data_server import server_binary; print(server_binary())") | grep GLIBC
...
0000000000000000 DF *UND* 0000000000000000 GLIBC_2.34 pthread_create
0000000000000000 DF *UND* 0000000000000000 GLIBC_2.34 __libc_start_main
I'd like to kindly ask the tensorboard team to lower the GLIBC requirement in future releases. I will open an issue if needed. -> #6578
@wookayin . Thanks for flagging. Yes, please open a new issue!
@wchargin
Currently, this mode is supported on Linux and macOS.
Hello, I'm very excited for this feature as tensorboard's speed has been a big pain point so far. BUT when I try to use it, it tells me it's not supported on MacOS:
Option --load_fast=true not available: TensorBoard data server not supported on this platform.
You say it is supported on MacOS though, so what's going on here? I've got MacBookPro17,1; Apple M1 chips; MacOS Ventura, version 13.4.1; tb-nightly Version: 2.15.0a20231013; tf-nightly-macos Version: 2.16.0.dev20231013.
P.S. I've gotten same results using non-nightly tensorflow-macos & no tensorflow at all. Also I followed your instructions exactly to uninstall tensorboard & tb_nightly before reinstalling tb_nightly.
@profPlum: Hazarding a guess:
Apple M1 chips
That's probably your problem. The tensorboard-data-server
package
currently ships macOS wheels for x86-64 but not for arm64.
If interested, you can build it yourself easily. I just tested it on my laptop from scratch and had it running in three minutes. Here's how:
If you don't already have a recent version of the Rust toolchain, install it from https://www.rust-lang.org/.
Clone this repository (TensorBoard) into, say, ~/git/tensorboard
.
In the clone, change into the tensorboard/data/server/
directory.
Run cargo build --release
. This will build a data server binary
into target/release/rustboard/
.
Set the TENSORBOARD_DATA_SERVER_BINARY
environment variable to the
full path to that binary: e.g.,
export TENSORBOARD_DATA_SERVER_BINARY=~/git/tensorboard/tensorboard/data/server/target/release/rustboard
(edit: fixed var name)
Change directories out of the TensorBoard repository to avoid Python
import issues, then launch tensorboard
with --load_fast true
.
If you want to double-check that it's using the data server, you can
navigate to http://localhost:6006/data/environment and see whether the
debug.data_provider
field lists a GrpcDataProvider
(fast) or a
MultiplexerDataProvider
(slow). Or, you can set the environment
variable RUST_LOG=debug
to see the data server logs.
(I don't currently work on TensorBoard, so consider this not an official response but just a community member who at one point knew this part of the code very well. :-) )
@wchargin Thanks I appreciate the help! (& I'll let you know if it works) Do you think it is likely that TB devs will give official support to M1 chips soon?
@wchargin Hi again, I tried your instructions verbatim and it says roughly the same:
TensorFlow installation not found - running with reduced feature set.
Option --load_fast=true not available: TensorBoard data server not supported on this platform.
But to clarify: did you want to me to launch the original (pip) tensorboard again? That point confused me and it is what I did but I'm not sure if it's what you meant.
P.S. With: fresh install of tb_nightly==2.15.0a20231019 & cargo version: 1.73.0 (9c4383fb5 2023-08-26). Also I got same results on a linux docker container.
@wchargin Hi, I got issue when running this: %load_ext tensorboard %tensorboard --logdir output
It shows google interface with:
Could you please guide me with this? Thanks
@Frn1nd0 , your issue is unrelated to fast data loading. Instead you have run into a recent regression with compatibility with Chrome. The Colab team have been investigating. We expect them to keep us updated at the following issue:
@bmd3k Thanks for the clarification, appreciate that! Hope they can fix this soon.
Update: #6578 is fixed; as of tensorboard 2.15 GLIBC minimum requirement is 2.29 (compatible with Ubuntu 20.04)
@profPlum: Oops, sorry, I wrote the environment variable wrong: it
should be TENSORBOARD_DATA_SERVER_BINARY
. Maybe try again thus?
did you want to me to launch the original (pip) tensorboard again?
Yes.
Using
--load_fast
under GKE with workload identity causes401 Unauthorized
error inrustboard_core::logdir
when accessing GCS buckets.It works fine if I set
--load_fast=false
.
Is this still a bug in the recent versions? I can repro with version 2.11.2.
This thread is for tracking feedback about TensorBoard’s experimental mode for fast data loading. Typical speedups range from 100× to 400×.
Who should try this: Anyone who’s found TensorBoard’s data loading to be slower than they’d like.
Who shouldn’t try this: Windows users (for now).
Feedback: Feedback form, or reply on this thread.
Try it out
To try this out, please uninstall all copies of TensorBoard and then install the latest version of
tb-nightly
:Then, invoke TensorBoard with the
--load_fast=true
flag:Use TensorBoard as you usually would. It should work the same way, just faster.
Feedback
You can respond to this anonymous Google Form, or reply on this thread, or open a new issue. Let us know: did it work? how much faster was it? any suggestions or requests?
Known issues
We know about these, but please let us know if they matter for you, so that we can prioritize working on them:
FAQ
What does “data loading” include?
It includes time spent reading files in your logdir. It does not include time spent painting charts on the frontend.
What is the
--load_fast
flag?Pass
--load_fast=true
to tell TensorBoard to use a new data loading mechanism, which is generally hundreds of times faster.Is
--load_fast=true
right for me?Currently, this mode is supported on Linux and macOS. If you are interested in using it on other platforms, ping @wchargin and I’ll show you how to build it.
Most features of TensorBoard are expected to work with the new data loading mechanism. All standard TensorBoard dashboards (scalars, images, etc.) should work, and flags like
--reload_interval
should work, too. You can use logdirs on local disk or on GCS buckets (public or private).Do I need to have TensorFlow installed?
No.
What’s happening under the hood?
Instead of crawling your logdir in a mixture of Python and C++ code with a lot of locking, cross-language marshalling, and slow data manipulation in Python, we read the data in a dedicated subprocess. This program is written in Rust and is optimized for concurrent reading and serving. More design details here.