Open eisenjulian opened 4 years ago
@orionr @sanekmelnikov Should we merge these two in 1.4?
The torch.utils.tensorboard
implementation uses a writer in core TensorBoard that supports GCS and S3 if TF is installed and S3 if not installed. Adding GCS in that second case would be great. You would need add a new GCSFileSystem around https://github.com/tensorflow/tensorboard/blob/master/tensorboard/compat/tensorflow_stub/io/gfile.py#L206
@orionr if I do have tensorflow installed, what steps should I follow to make this work? Just simply having it installed doesn't seem to to do the trick, the gs://proj/...
url is just interpreted as the path gs: -> proj -> ...
on local disk.
@amatsukawa, is everything installed in the same conda environment or virtualenv? You should be able to confirm that TensorBoard is returning tensorflow
at https://github.com/tensorflow/tensorboard/blob/master/tensorboard/compat/__init__.py#L52
After some tinkering, this seems to only work if you have tensorflow 2.1 installed, but not with tensorflow 1.14, which is what I had.
Just for folks encountering this later. After further experimentation, it seems this will work with the last v1 release of TF as well (v1.15), if you prefer to have TF1.
Would it be possible to have a version of this which doesn't require TensorFlow to be installed? Maybe an implementation with google-cloud-storage
, since this is considerably lighter than TF. The existence of a backend could be checked when torch.utils.tensorboard is imported, for instance. What do you think?
Any progress on this? Having to install tensorflow solely to get logging to GCS is a bit ridiculous, yet somehow S3 is supported out of the box
I think if anybody is willing to do the work as detailed here, we'd be happy to take some PRs:
The torch.utils.tensorboard implementation uses a writer in core TensorBoard that supports GCS and S3 if TF is installed and S3 if not installed. Adding GCS in that second case would be great. You would need add a new GCSFileSystem around https://github.com/tensorflow/tensorboard/blob/master/tensorboard/compat/tensorflow_stub/io/gfile.py#L206
cc @Reubend
@LarsDu, I was having the same issue with many tools that don't support S3 or GCS bucket. The solution I found on GCP is to use gcsfuse gfuse so GCS bucket is seen as a local directory.
Many GCP tool already have gcsfuse pre-installed (Vertex AI pipeline...). I don't see performances issue when logging summaries for Tensorboard. I can imagine that the same exist for S3 on AWS.
I wound up using tensorboardX for logging since it's much lighter weight than having a Tensorboard install with it's many problematic dependencies
Correction: tensorboardX also has problematic dependencies. I gave up and am trying to include tensorflow in my project, but this makes stakeholders incredibly nervous
gfuse as @tarrade called out looks like a great option here @LarsDu
gcsfuse is even more problematic there are permissions settings that need to be considered for the kubernetes cluster from which we are running experimentation. Ultimately, I ended up simply installing tensorboard
+ tensorflow
in our project.
I've filed this issue with tensorboard: https://github.com/tensorflow/tensorboard/issues/6298
gcsfuse is probably not something we should include here since it's so specific to GCP, but PyTorch Lightning recently implemented fsspec inside of TB: https://lightning.ai/docs/pytorch/stable/common/remote_fs.html Fsspec gives you support for S3, Azure, GCP, etc "for free" by providing a generic interface to all of them. Maybe we could move the Lightning implementation here instead of it being a Lightning-specific feature?
🚀 Feature
When creating a SummaryWriter, specifying a path in an S3 or GCP bucket should directly write to the bucket instead of the local filesystem
Motivation
Both in tensorflow and tesorboardX you can specify s3:// or gs:// paths in your logdir, which greatly simplifies distributed training and monitoring, you also can launch tensorbaord directly from your local machine, or a notebook pointing directly to the bucket, which means no need to launch a machine to share results, just the URL to the results inside the bucket
Additional context
tensorboardX implementation https://github.com/lanpa/tensorboardX/blob/master/tensorboardX/record_writer.py#L57