TensorBoard.dev fails when more than one model graph in the same run.

rcruzgar commented 3 years ago

Environment information

tensorboard==2.5.0
tensorboard-data-server==0.6.1
tensorboard-plugin-profile==2.4.0
tensorboard-plugin-wit==1.8.0
tensorflow==2.5.0
tensorflow-addons==0.13.0
tensorflow-datasets==4.3.0
tensorflow-estimator==2.5.0
tensorflow-hub==0.12.0
tensorflow-metadata==1.1.0
tensorflow-model-optimization==0.6.0
tensorflow-serving-api==2.4.1
tf-models-official==2.5.0
tf-slim==1.1.0

Browser type and version: Chrome 90.0.4430.212
Python version: 3.8.10

Issue description

Hi! I have successfully created Tensorboard dashboards using data stored in AWS S3 Buckets this way:

AWS_REGION=eu-central-1 tensorboard --logdir s3://bucket_name/subfolder/

Even for few experiments using _--logdirspec flag like this:

AWS_REGION=eu-central-1 tensorboard --logdir_spec=NAME_EXP1:s3://bucket_name/subfolder1/,NAME_EXP2:s3://bucket_name/subfolder2/

I have tried to create a Tensorboard.dev to be shared with people, at least with one experiment, although ideally I would like to compare two at the same time:

AWS_REGION=eu-central-1 tensorboard dev upload --logdir s3://bucket_name/subfolder/

However, I get the following error message:

New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/JN7kMIgRQs2B2dtjF8yIEQ/

[2021-08-04T15:14:22] Started scanning logdir.
[2021-08-04T15:14:31] Total uploaded: 0 scalars, 0 tensors, 1 binary objects (8.9 MB)
Listening for new data in logdir...

Done. View your TensorBoard at https://tensorboard.dev/experiment/JN7kMIgRQs2B2dtjF8yIEQ/
Traceback (most recent call last):
  File "/home/rcruz/.local/bin/tensorboard", line 8, in <module>
    sys.exit(run_main())
  File "/home/rcruz/.local/lib/python3.8/site-packages/tensorboard/main.py", line 46, in run_main
    app.run(tensorboard.main, flags_parser=tensorboard.configure)
  File "/home/rcruz/.local/lib/python3.8/site-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/home/rcruz/.local/lib/python3.8/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/rcruz/.local/lib/python3.8/site-packages/tensorboard/program.py", line 276, in main
    return runner(self.flags) or 0
  File "/home/rcruz/.local/lib/python3.8/site-packages/tensorboard/uploader/uploader_subcommand.py", line 654, in run
    return _run(flags, self._experiment_url_callback)
  File "/home/rcruz/.local/lib/python3.8/site-packages/tensorboard/uploader/uploader_subcommand.py", line 124, in _run
    intent.execute(server_info, channel)
  File "/home/rcruz/.local/lib/python3.8/site-packages/tensorboard/uploader/uploader_subcommand.py", line 444, in execute
    uploader.start_uploading()
  File "/home/rcruz/.local/lib/python3.8/site-packages/tensorboard/uploader/uploader.py", line 213, in start_uploading
    self._upload_once()
  File "/home/rcruz/.local/lib/python3.8/site-packages/tensorboard/uploader/uploader.py", line 228, in _upload_once
    self._request_sender.send_requests(run_to_events)
  File "/home/rcruz/.local/lib/python3.8/site-packages/tensorboard/uploader/uploader.py", line 438, in send_requests
    self._blob_request_sender.add_event(
  File "/home/rcruz/.local/lib/python3.8/site-packages/tensorboard/uploader/uploader.py", line 1011, in add_event
    self.flush()
  File "/home/rcruz/.local/lib/python3.8/site-packages/tensorboard/uploader/uploader.py", line 1041, in flush
    sent_blobs += self._send_blob(
  File "/home/rcruz/.local/lib/python3.8/site-packages/tensorboard/uploader/uploader.py", line 1107, in _send_blob
    for response in self._api.WriteBlob(request_iterator):
  File "/home/rcruz/.local/lib/python3.8/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/home/rcruz/.local/lib/python3.8/site-packages/grpc/_channel.py", line 803, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
        status = StatusCode.INTERNAL
        details = "Received RST_STREAM with error code 2"
        debug_error_string = "{"created":"@1628082871.771904345","description":"Error received from peer ipv4:34.95.66.171:443","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Received RST_STREAM with error code 2","grpc_status":13}"

Only the model graph has been uploaded, but not the scalars nor the images. I assume that the AWS S3 credentials are properly set, because it works for normal Tensorboard.

If I make a "normal tensorboard", without the dev upload option, I get what I want:

EDIT: I have to say that the tensorboard dev upload --logdir doesn't work neither with local data, producing the same error.

Do you have any idea of what could be happening and how to solve this?

Thank you! Best regards, Rubén.

bileschi commented 3 years ago

Thanks for the report and sorry you are having trouble! I'm able to see your experiment, but I only see the graph. Looking in our logs it doesn't look like there are any failures associated with this experiment. It sounds like you are saying that the upload also fails when you try to upload using a local logdir. Is it possible for you to share this logdir with us so we can try to reproduce the error on our end?

If not, can you try running tensorboard locally on the logdir, and let us know if it works?

tensorboard --logdir=$LOG_DIR

rcruzgar commented 3 years ago

Hi @bileschi Thanks for the fast reply.

First of all, regarding my comment on the images and TensorBoard.dev, I have seen that the development has been paused (https://github.com/tensorflow/tensorboard/issues/3585#issuecomment-860702799).

You can see the tfevents files I point to under folder issue here: https://github.com/rcruzgar/github_uploads/tree/master/issue

events.out.tfevents.1626708229.ip-10-0-185-123.eu-central-1.compute.internal

Inside folder issue there is another subdirectory called eval_0, containing: events.out.tfevents.1626708811.ip-10-0-185-123.eu-central-1.compute.internal

I have noticed that pointing to the subdirectory generates successfully a tensorboard.dev (https://tensorboard.dev/experiment/22i1FrDGRt2fwPo0lr51dA/):

tensorboard dev upload --logdir ./issue/eval_0

However, when I only point to issue as

tensorboard dev upload --logdir ./issue/

nothing is uploaded (https://tensorboard.dev/experiment/yVnKXlEgTVaRHYPjXUqYCw/).

Isn't it possible to provide subfolders to tensorboard.dev?

Cheers, R.

bileschi commented 3 years ago

Yeah, pointing at the root folder should work. Thanks for sharing your logdir. Let me check it out and see if I can replicate on my end.

bileschi commented 3 years ago

Ok, I can replicate your issue : Digging in to see if I can find the root cause. I suspect it has something to do with the uploader tripping over the images even though it's supposed to ignore them.

$ tensorboard dev upload --logdir . --one_shot
2021-08-05 11:09:52.375661: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64
2021-08-05 11:09:52.375682: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/QJkQIgS8QZed0MZFwA8lQw/

[2021-08-05T11:09:53] Started scanning logdir.
[2021-08-05T11:09:57] Total uploaded: 0 scalars, 0 tensors, 1 binary objects (8.9 MB)
[2021-08-05T11:09:57] Done scanning logdir.

Done. View your TensorBoard at https://tensorboard.dev/experiment/QJkQIgS8QZed0MZFwA8lQw/
Traceback (most recent call last):
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/bin/tensorboard", line 8, in <module>
    sys.exit(run_main())
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/main.py", line 46, in run_main
    app.run(tensorboard.main, flags_parser=tensorboard.configure)
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/program.py", line 276, in main
    return runner(self.flags) or 0
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader_subcommand.py", line 654, in run
    return _run(flags, self._experiment_url_callback)
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader_subcommand.py", line 124, in _run
    intent.execute(server_info, channel)
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader_subcommand.py", line 444, in execute
    uploader.start_uploading()
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 213, in start_uploading
    self._upload_once()
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 228, in _upload_once
    self._request_sender.send_requests(run_to_events)
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 438, in send_requests
    self._blob_request_sender.add_event(
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 1011, in add_event
    self.flush()
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 1041, in flush
    sent_blobs += self._send_blob(
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 1107, in _send_blob
    for response in self._api.WriteBlob(request_iterator):
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 416, in __next__
    return self._next()
  File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 803, in _next
    raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
    status = StatusCode.INTERNAL
    details = "Received RST_STREAM with error code 2"
    debug_error_string = "{"created":"@1628176197.620893403","description":"Error received from peer ipv4:34.95.66.171:443","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Received RST_STREAM with error code 2","grpc_status":13}"
>

bileschi commented 3 years ago

In the meantime, as a workaround, you can specify the limited allowlist of plugins you want to support (sorry no images) like so:

$ tensorboard dev upload --logdir . --one_shot --plugins scalars

2021-08-05 11:14:04.693247: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64
2021-08-05 11:14:04.693268: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.

New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/5yAlqr5OQ9KJCb4XGMXzuQ/

[2021-08-05T11:14:05] Started scanning logdir.
[2021-08-05T11:14:08] Total uploaded: 57 scalars, 0 tensors, 0 binary objects
[2021-08-05T11:14:08] Done scanning logdir.

Done. View your TensorBoard at https://tensorboard.dev/experiment/5yAlqr5OQ9KJCb4XGMXzuQ/

FYI, to others on TensorBoard team, I noticed that the upload fails when the plugins setting includes graphs like so --plugins scalars,graphs. This might point at the root cause.

bileschi commented 3 years ago

Ok, narrowing down what's going on here: it apppears that the event file in the root dir contains more than one graph. Local TensorBoard handles this cleanly by only displaying the most recent one. TensorBoard.dev, however, throws an error.

I found this by exploring the event file using the EventAccumulator in python.

>>> import tensorboard
>>> from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
>>> path="/usr/local/google/home/bileschi/proj/issue_5188/github_uploads/issue_clone"
>>> event_acc=EventAccumulator(path)
>>> event_acc.Reload()
Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events.  Overwriting the graph with the newest event.
Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
<tensorboard.backend.event_processing.event_accumulator.EventAccumulator object at 0x7efdf8842ac0>

Allow me to update the name of this bug to reflect the underlying issue. In the meantime, if you don't need to view the graphs, you can use the workaround by specifying --plugins=scalars above. If you do need to see the graphs, we can see what surgery we can do.

rcruzgar commented 3 years ago

Thank you, your solution seems to work for one experiment, even specifying a log from an AWS S3 bucket:

AWS_REGION=eu-central-1 tensorboard dev upload --logdir s3://bucket_name/subfolder --plugins=scalars

Just to mention that the flag --one_shot doesn't work when using data from S3 as the logdir.

I can do this without troubles, even visualizing the images:

tensorboard --logdir_spec=NAME_EXP1:./issue/,NAME_EXP2:./issue2

(I have uploaded another log to https://github.com/rcruzgar/github_uploads as issue2)

Using data from S3 as well:

AWS_REGION=eu-central-1 tensorboard --logdir_spec=NAME_EXP1:s3://bucket_name1/subfolder,NAME_EXP2:s3://bucket_name2/subfolder

The problem comes when I want to use tensorboard dev upload with data from S3 and two logdirs. I have tried the following options, with no success:

AWS_REGION=eu-central-1 tensorboard dev upload --logdir=NAME_EXP1:s3://bucket_name1/subfolder,NAME_EXP2:s3://bucket_name2/subfolder --plugins=scalars

Doesn't upload anything: https://tensorboard.dev/experiment/1u6OwPkjRdG4jbgfe2Glug/ . It seems that --logdir and more than one specified logdir doesn't work. So I tried with --logdir_spec:

AWS_REGION=eu-central-1 tensorboard dev upload --logdir_spec=NAME_EXP1:s3://bucket_name1/subfolder,NAME_EXP2:s3://bucket_name2/subfolder --plugins=scalars

But it seems that dev upload doesn't recognize _--logdirspec:

tensorboard: error: unrecognized arguments: --logdir_spec=

Do you have any recommendation? My ultimate solution would be to make an automatic job to download the data from AWS S3, but I would prefer to use tensorboard dev upload directly on S3 data.

bileschi commented 3 years ago

Awesome, I'm glad we are making a little progress here. And thanks for clearly walking through your analysis. It sounds like we have two TensorBoard.dev issues here. I have created two new issues to track those independently.

--one_shot does not work with S3 (#5205)
--logdir_spec is not supported. (#5207)

I'm not sure what we can do to solve your immediate problem, where you would like to view two separate S3 logdirs on the same TensorBoard.dev view. Probably best in the short term to do the download locally and upload from there dance. Note that we are unable to support S3 as cleanly as we support other filesystems, due to some peculiarities in their behavior (see #4786, #4255, pull/38203). TensorBoard's ability to read from S3 is a community supported contribution (we don't ourselves test it or verify it works).

rcruzgar commented 3 years ago

Thank you, @bileschi . I will then download it locally at the moment.

Cheers.

bileschi commented 3 years ago

This should now be fixed.
The root cause was an issue in an older version of nginx which was eating up our custom error signal in the event of a duplicate blob upload. The fix was to update.

Can you please test again to make sure it works, @rcruzgar ? If there is still a problem, please re-open this issue.

Thanks!

rcruzgar commented 3 years ago

Hi @bileschi ,

It works now without specifying the flag plugins.

Thanks!

tensorflow / tensorboard

TensorBoard.dev fails when more than one model graph in the same run. #5188

Environment information

Issue description