No reloads when using S3 compatible storage

AlonKellner commented 8 months ago

Im addition to this issue, I posted a question in stackoverflow.

Environment information

Diagnostics

Diagnostics output

`````` --- check: autoidentify INFO: diagnose_tensorboard.py version df7af2c6fc0e4c4a5b47aeae078bc7ad95777ffa --- check: general INFO: sys.version_info: sys.version_info(major=3, minor=10, micro=13, releaselevel='final', serial=0) INFO: os.name: posix INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='c5ff1db54ce4', release='5.15.133.1-microsoft-standard-WSL2', version='#1 SMP Thu Oct 5 21:02:42 UTC 2023', machine='x86_64') INFO: sys.getwindowsversion(): N/A --- check: package_management INFO: has conda-meta: False INFO: $VIRTUAL_ENV: None --- check: installed_packages INFO: installed: tensorboard==2.15.1 WARNING: no installation among: ['tensorflow', 'tensorflow-gpu', 'tf-nightly', 'tf-nightly-2.0-preview', 'tf-nightly-gpu', 'tf-nightly-gpu-2.0-preview'] WARNING: no installation among: ['tensorflow-estimator', 'tensorflow-estimator-2.0-preview', 'tf-estimator-nightly'] INFO: installed: tensorboard-data-server==0.7.2 --- check: tensorboard_python_version INFO: tensorboard.version.VERSION: '2.15.1' --- check: tensorflow_python_version Traceback (most recent call last): File "//diagnose_tensorboard.py", line 511, in main suggestions.extend(check()) File "//diagnose_tensorboard.py", line 81, in wrapper result = fn() File "//diagnose_tensorboard.py", line 267, in tensorflow_python_version import tensorflow as tf ModuleNotFoundError: No module named 'tensorflow' --- check: tensorboard_data_server_version INFO: data server binary: '/usr/local/lib/python3.10/site-packages/tensorboard_data_server/bin/server' INFO: data server binary version: b'rustboard 0.7.2' --- check: tensorboard_binary_path INFO: which tensorboard: b'/usr/local/bin/tensorboard\n' --- check: addrinfos socket.has_ipv6 = True socket.AF_UNSPEC = socket.SOCK_STREAM = socket.AI_ADDRCONFIG = socket.AI_PASSIVE = Loopback flags: Loopback infos: [(, , 6, '', ('127.0.0.1', 0))] Wildcard flags: Wildcard infos: [(, , 6, '', ('0.0.0.0', 0)), (, , 6, '', ('::', 0, 0, 0))] --- check: readable_fqdn INFO: socket.getfqdn(): 'c5ff1db54ce4' --- check: stat_tensorboardinfo INFO: directory: /tmp/.tensorboard-info INFO: .tensorboard-info directory does not exist --- check: source_trees_without_genfiles INFO: tensorboard_roots (1): ['/usr/local/lib/python3.10/site-packages']; bad_roots (0): [] --- check: full_pip_freeze INFO: pip freeze --all: absl-py==2.0.0 aiobotocore==2.9.0 aiohttp==3.9.1 aioitertools==0.11.0 aiosignal==1.3.1 async-timeout==4.0.3 attrs==23.1.0 botocore==1.33.13 cachetools==5.3.2 certifi==2023.11.17 charset-normalizer==3.3.2 filelock==3.13.1 frozenlist==1.4.1 fsspec==2023.12.2 google-auth==2.25.2 google-auth-oauthlib==1.2.0 grpcio==1.60.0 idna==3.6 Jinja2==3.1.2 jmespath==1.0.1 lightning==2.1.3 lightning-utilities==0.10.0 Markdown==3.5.1 MarkupSafe==2.1.3 mpmath==1.3.0 multidict==6.0.4 networkx==3.2.1 numpy==1.26.2 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.3.101 nvidia-nvtx-cu12==12.1.105 oauthlib==3.2.2 packaging==23.2 pip==23.0.1 protobuf==4.23.4 pyasn1==0.5.1 pyasn1-modules==0.3.0 python-dateutil==2.8.2 pytorch-lightning==2.1.3 PyYAML==6.0.1 requests==2.31.0 requests-oauthlib==1.3.1 rsa==4.9 s3fs==2023.12.2 setuptools==65.5.1 six==1.16.0 sympy==1.12 tensorboard==2.15.1 tensorboard-data-server==0.7.2 tensorflow-io==0.35.0 tensorflow-io-gcs-filesystem==0.35.0 torch==2.1.2 torchmetrics==1.2.1 tqdm==4.66.1 triton==2.1.0 typing_extensions==4.9.0 urllib3==2.0.7 Werkzeug==3.0.1 wheel==0.42.0 wrapt==1.16.0 yarl==1.9.4 ``````

Issue description

Here is a repository with a full reproduction of my issue: https://github.com/AlonKellner/s3-tensorboard-issue-reproduction

When using tensorboard with an s3 compatible storage, only the first experiments that the server comes across are shown in the UI.
All experiments that are present during startup are shown fully, if no experiment is present during start up, the first detected experiment will be shown partially.
After an experiment is detected and shown, no further steps and experiments will be reloaded and shown.
When using the --reload_task process option, no experiment is shown whatsoever.

I have personally reproduced this unexpected behavior with both ceph (with an on-prem instance) and minio (with a local docker image, see reproduction repo).

The expected behavior is that any new experiment that is written to the s3 compatible storage should be reloaded in the UI when pressing the reload button, as well as new steps in that new experiment.
Also, I expect this behavior to work correctly with the --reload_multifile=true option.

Workarounds are also welcome, thanks :)

arcra commented 8 months ago

Can you clarify what you mean by "all experiments"? Are you referring to the runs? Could you share a screenshot, to make that clearer?

Do you see any errors in the console logs?

Could it be something related or similar to what is reported in #6713?

AlonKellner commented 8 months ago

Can you clarify what you mean by "all experiments"? Are you referring to the runs? Could you share a screenshot, to make that clearer?

Yes, sorry for using the wrong terminology, when I wrote "experiments" I was referring to "runs".
As for screenshots, I've added screenshots to the reproduction repo, there are 6 of them that explain the full problem, so I won't share all of them here, but here is the one from step-3, when there are 2 runs in minio, but only 1 run is visible in tensorboard: step-3

Do you see any errors in the console logs?

No, the only console logs are:

TensorFlow installation not found - running with reduced feature set.
TensorBoard 2.15.1 at http://2666274e9da3:6006/ (Press CTRL+C to quit)

Could it be something related or similar to what is reported in #6713?

It does not seem like it, since I do not see any errors.

arcra commented 8 months ago

Yes, sorry for using the wrong terminology, when I wrote "experiments" I was referring to "runs".

No worries, I just wanted to make sure I understand what the issue is correctly.

I believe it's (similarly to #6713) an issue with our "no-TF compatibility" implementation of the GFile interface, particularly the support for the S3 files. I believe a workaround might be to install tensorflow, so it would use the TF implementation. If you do, please confirm whether that solves the problem for you.

Unfortunately, we don't have the bandwidth to investigate this with more detail. Our compat support for the S3 filesystem is done as best-effort.

AlonKellner commented 7 months ago

I tried to install tensorflow, but then I get the error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/lightning/fabric/loggers/tensorboard.py", line 208, in log_metrics
    self.experiment.add_scalar(k, v, step)
  File "/usr/local/lib/python3.10/site-packages/lightning/fabric/loggers/logger.py", line 118, in experiment
    return fn(self)
  File "/usr/local/lib/python3.10/site-packages/lightning/fabric/loggers/tensorboard.py", line 191, in experiment
    self._experiment = SummaryWriter(log_dir=self.log_dir, **self._kwargs)
  File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 243, in __init__
    self._get_file_writer()
  File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 273, in _get_file_writer
    self.file_writer = FileWriter(
  File "/usr/local/lib/python3.10/site-packages/torch/utils/tensorboard/writer.py", line 72, in __init__
    self.event_writer = EventFileWriter(
  File "/usr/local/lib/python3.10/site-packages/tensorboard/summary/writer/event_file_writer.py", line 72, in __init__
    tf.io.gfile.makedirs(logdir)
  File "/usr/local/lib/python3.10/site-packages/tensorflow/python/lib/io/file_io.py", line 513, in recursive_create_dir_v2
    _pywrap_file_io.RecursivelyCreateDir(compat.path_to_bytes(path))
tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme 's3' not implemented (file: 's3://tensorboard/test-test/version_0')

My workaround is a bad one, I wrote a simple bash script that restarts tensorboard every minute, that way it reloads all runs every minute, which works for my use-case.

arcra commented 7 months ago

Looks like there's a separate package that might provide support for that filesystem. Can you try installing tensorflow-io and see if that solves your problem?

Sources: https://discuss.tensorflow.org/t/access-s3-on-tensorflow/8633 https://blog.ukjae.io/posts/enabling-s3-filesystem-support-for-tensorflow-serving/

tensorflow / tensorboard