Closed rcruzgar closed 3 years ago
Thanks for the report and sorry you are having trouble! I'm able to see your experiment, but I only see the graph. Looking in our logs it doesn't look like there are any failures associated with this experiment. It sounds like you are saying that the upload also fails when you try to upload using a local logdir. Is it possible for you to share this logdir with us so we can try to reproduce the error on our end?
If not, can you try running tensorboard locally on the logdir, and let us know if it works?
tensorboard --logdir=$LOG_DIR
Hi @bileschi Thanks for the fast reply.
First of all, regarding my comment on the images and TensorBoard.dev, I have seen that the development has been paused (https://github.com/tensorflow/tensorboard/issues/3585#issuecomment-860702799).
You can see the tfevents files I point to under folder issue here: https://github.com/rcruzgar/github_uploads/tree/master/issue
events.out.tfevents.1626708229.ip-10-0-185-123.eu-central-1.compute.internal
Inside folder issue there is another subdirectory called eval_0, containing: events.out.tfevents.1626708811.ip-10-0-185-123.eu-central-1.compute.internal
I have noticed that pointing to the subdirectory generates successfully a tensorboard.dev (https://tensorboard.dev/experiment/22i1FrDGRt2fwPo0lr51dA/):
tensorboard dev upload --logdir ./issue/eval_0
However, when I only point to issue as
tensorboard dev upload --logdir ./issue/
nothing is uploaded (https://tensorboard.dev/experiment/yVnKXlEgTVaRHYPjXUqYCw/).
Isn't it possible to provide subfolders to tensorboard.dev?
Cheers, R.
Yeah, pointing at the root folder should work. Thanks for sharing your logdir. Let me check it out and see if I can replicate on my end.
Ok, I can replicate your issue : Digging in to see if I can find the root cause. I suspect it has something to do with the uploader tripping over the images even though it's supposed to ignore them.
$ tensorboard dev upload --logdir . --one_shot
2021-08-05 11:09:52.375661: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64
2021-08-05 11:09:52.375682: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/QJkQIgS8QZed0MZFwA8lQw/
[2021-08-05T11:09:53] Started scanning logdir.
[2021-08-05T11:09:57] Total uploaded: 0 scalars, 0 tensors, 1 binary objects (8.9 MB)
[2021-08-05T11:09:57] Done scanning logdir.
Done. View your TensorBoard at https://tensorboard.dev/experiment/QJkQIgS8QZed0MZFwA8lQw/
Traceback (most recent call last):
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/bin/tensorboard", line 8, in <module>
sys.exit(run_main())
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/main.py", line 46, in run_main
app.run(tensorboard.main, flags_parser=tensorboard.configure)
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/program.py", line 276, in main
return runner(self.flags) or 0
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader_subcommand.py", line 654, in run
return _run(flags, self._experiment_url_callback)
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader_subcommand.py", line 124, in _run
intent.execute(server_info, channel)
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader_subcommand.py", line 444, in execute
uploader.start_uploading()
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 213, in start_uploading
self._upload_once()
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 228, in _upload_once
self._request_sender.send_requests(run_to_events)
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 438, in send_requests
self._blob_request_sender.add_event(
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 1011, in add_event
self.flush()
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 1041, in flush
sent_blobs += self._send_blob(
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/tensorboard/uploader/uploader.py", line 1107, in _send_blob
for response in self._api.WriteBlob(request_iterator):
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 416, in __next__
return self._next()
File "/usr/local/google/home/bileschi/virtualenv/tensorflow-20210805-py3.9/lib/python3.9/site-packages/grpc/_channel.py", line 803, in _next
raise self
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.INTERNAL
details = "Received RST_STREAM with error code 2"
debug_error_string = "{"created":"@1628176197.620893403","description":"Error received from peer ipv4:34.95.66.171:443","file":"src/core/lib/surface/call.cc","file_line":1062,"grpc_message":"Received RST_STREAM with error code 2","grpc_status":13}"
>
In the meantime, as a workaround, you can specify the limited allowlist of plugins you want to support (sorry no images) like so:
$ tensorboard dev upload --logdir . --one_shot --plugins scalars
2021-08-05 11:14:04.693247: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/cuda/lib64
2021-08-05 11:14:04.693268: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
New experiment created. View your TensorBoard at: https://tensorboard.dev/experiment/5yAlqr5OQ9KJCb4XGMXzuQ/
[2021-08-05T11:14:05] Started scanning logdir.
[2021-08-05T11:14:08] Total uploaded: 57 scalars, 0 tensors, 0 binary objects
[2021-08-05T11:14:08] Done scanning logdir.
Done. View your TensorBoard at https://tensorboard.dev/experiment/5yAlqr5OQ9KJCb4XGMXzuQ/
FYI, to others on TensorBoard team, I noticed that the upload fails when the plugins setting includes graphs like so --plugins scalars,graphs
. This might point at the root cause.
Ok, narrowing down what's going on here: it apppears that the event file in the root dir contains more than one graph. Local TensorBoard handles this cleanly by only displaying the most recent one. TensorBoard.dev, however, throws an error.
I found this by exploring the event file using the EventAccumulator
in python.
>>> import tensorboard
>>> from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
>>> path="/usr/local/google/home/bileschi/proj/issue_5188/github_uploads/issue_clone"
>>> event_acc=EventAccumulator(path)
>>> event_acc.Reload()
Found more than one graph event per run, or there was a metagraph containing a graph_def, as well as one or more graph events. Overwriting the graph with the newest event.
Found more than one metagraph event per run. Overwriting the metagraph with the newest event.
<tensorboard.backend.event_processing.event_accumulator.EventAccumulator object at 0x7efdf8842ac0>
Allow me to update the name of this bug to reflect the underlying issue. In the meantime, if you don't need to view the graphs, you can use the workaround by specifying --plugins=scalars
above. If you do need to see the graphs, we can see what surgery we can do.
Thank you, your solution seems to work for one experiment, even specifying a log from an AWS S3 bucket:
AWS_REGION=eu-central-1 tensorboard dev upload --logdir s3://bucket_name/subfolder --plugins=scalars
Just to mention that the flag --one_shot doesn't work when using data from S3 as the logdir.
I can do this without troubles, even visualizing the images:
tensorboard --logdir_spec=NAME_EXP1:./issue/,NAME_EXP2:./issue2
(I have uploaded another log to https://github.com/rcruzgar/github_uploads as issue2)
Using data from S3 as well:
AWS_REGION=eu-central-1 tensorboard --logdir_spec=NAME_EXP1:s3://bucket_name1/subfolder,NAME_EXP2:s3://bucket_name2/subfolder
The problem comes when I want to use tensorboard dev upload with data from S3 and two logdirs. I have tried the following options, with no success:
AWS_REGION=eu-central-1 tensorboard dev upload --logdir=NAME_EXP1:s3://bucket_name1/subfolder,NAME_EXP2:s3://bucket_name2/subfolder --plugins=scalars
Doesn't upload anything: https://tensorboard.dev/experiment/1u6OwPkjRdG4jbgfe2Glug/ . It seems that --logdir and more than one specified logdir doesn't work. So I tried with --logdir_spec:
AWS_REGION=eu-central-1 tensorboard dev upload --logdir_spec=NAME_EXP1:s3://bucket_name1/subfolder,NAME_EXP2:s3://bucket_name2/subfolder --plugins=scalars
But it seems that dev upload doesn't recognize _--logdirspec:
tensorboard: error: unrecognized arguments: --logdir_spec=
Do you have any recommendation? My ultimate solution would be to make an automatic job to download the data from AWS S3, but I would prefer to use tensorboard dev upload directly on S3 data.
Awesome, I'm glad we are making a little progress here. And thanks for clearly walking through your analysis. It sounds like we have two TensorBoard.dev issues here. I have created two new issues to track those independently.
--one_shot
does not work with S3 (#5205)--logdir_spec
is not supported. (#5207)I'm not sure what we can do to solve your immediate problem, where you would like to view two separate S3 logdirs on the same TensorBoard.dev view. Probably best in the short term to do the download locally and upload from there dance. Note that we are unable to support S3 as cleanly as we support other filesystems, due to some peculiarities in their behavior (see #4786, #4255, pull/38203). TensorBoard's ability to read from S3 is a community supported contribution (we don't ourselves test it or verify it works).
Thank you, @bileschi . I will then download it locally at the moment.
Cheers.
This should now be fixed.
The root cause was an issue in an older version of nginx which was eating up our custom error signal in the event of a duplicate blob upload. The fix was to update.
Can you please test again to make sure it works, @rcruzgar ? If there is still a problem, please re-open this issue.
Thanks!
Hi @bileschi ,
It works now without specifying the flag plugins.
Thanks!
Environment information
Issue description
Hi! I have successfully created Tensorboard dashboards using data stored in AWS S3 Buckets this way:
AWS_REGION=eu-central-1 tensorboard --logdir s3://bucket_name/subfolder/
Even for few experiments using _--logdirspec flag like this:
AWS_REGION=eu-central-1 tensorboard --logdir_spec=NAME_EXP1:s3://bucket_name/subfolder1/,NAME_EXP2:s3://bucket_name/subfolder2/
I have tried to create a Tensorboard.dev to be shared with people, at least with one experiment, although ideally I would like to compare two at the same time:
AWS_REGION=eu-central-1 tensorboard dev upload --logdir s3://bucket_name/subfolder/
However, I get the following error message:
Only the model graph has been uploaded, but not the scalars nor the images. I assume that the AWS S3 credentials are properly set, because it works for normal Tensorboard.
If I make a "normal tensorboard", without the dev upload option, I get what I want:
EDIT: I have to say that the tensorboard dev upload --logdir doesn't work neither with local data, producing the same error.
Do you have any idea of what could be happening and how to solve this?
Thank you! Best regards, Rubén.