pytorch / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
https://pytorch.org
Other
80.06k stars 21.52k forks source link

[Tensorboard] Write summaries to S3 or GCS bucket #24468

Open eisenjulian opened 4 years ago

eisenjulian commented 4 years ago

🚀 Feature

When creating a SummaryWriter, specifying a path in an S3 or GCP bucket should directly write to the bucket instead of the local filesystem

Motivation

Both in tensorflow and tesorboardX you can specify s3:// or gs:// paths in your logdir, which greatly simplifies distributed training and monitoring, you also can launch tensorbaord directly from your local machine, or a notebook pointing directly to the bucket, which means no need to launch a machine to share results, just the URL to the results inside the bucket

Additional context

tensorboardX implementation https://github.com/lanpa/tensorboardX/blob/master/tensorboardX/record_writer.py#L57

lanpa commented 4 years ago

https://github.com/lanpa/tensorboardX/pull/528

lanpa commented 4 years ago

@orionr @sanekmelnikov Should we merge these two in 1.4?

orionr commented 4 years ago

The torch.utils.tensorboard implementation uses a writer in core TensorBoard that supports GCS and S3 if TF is installed and S3 if not installed. Adding GCS in that second case would be great. You would need add a new GCSFileSystem around https://github.com/tensorflow/tensorboard/blob/master/tensorboard/compat/tensorflow_stub/io/gfile.py#L206

amatsukawa commented 4 years ago

@orionr if I do have tensorflow installed, what steps should I follow to make this work? Just simply having it installed doesn't seem to to do the trick, the gs://proj/... url is just interpreted as the path gs: -> proj -> ... on local disk.

orionr commented 4 years ago

@amatsukawa, is everything installed in the same conda environment or virtualenv? You should be able to confirm that TensorBoard is returning tensorflow at https://github.com/tensorflow/tensorboard/blob/master/tensorboard/compat/__init__.py#L52

amatsukawa commented 4 years ago

After some tinkering, this seems to only work if you have tensorflow 2.1 installed, but not with tensorflow 1.14, which is what I had.

amatsukawa commented 4 years ago

Just for folks encountering this later. After further experimentation, it seems this will work with the last v1 release of TF as well (v1.15), if you prefer to have TF1.

JulianFerry commented 4 years ago

Would it be possible to have a version of this which doesn't require TensorFlow to be installed? Maybe an implementation with google-cloud-storage, since this is considerably lighter than TF. The existence of a backend could be checked when torch.utils.tensorboard is imported, for instance. What do you think?

LarsDu commented 1 year ago

Any progress on this? Having to install tensorflow solely to get logging to GCS is a bit ridiculous, yet somehow S3 is supported out of the box

orionr commented 1 year ago

I think if anybody is willing to do the work as detailed here, we'd be happy to take some PRs:

The torch.utils.tensorboard implementation uses a writer in core TensorBoard that supports GCS and S3 if TF is installed and S3 if not installed. Adding GCS in that second case would be great. You would need add a new GCSFileSystem around https://github.com/tensorflow/tensorboard/blob/master/tensorboard/compat/tensorflow_stub/io/gfile.py#L206

cc @Reubend

tarrade commented 1 year ago

@LarsDu, I was having the same issue with many tools that don't support S3 or GCS bucket. The solution I found on GCP is to use gcsfuse gfuse so GCS bucket is seen as a local directory.

Many GCP tool already have gcsfuse pre-installed (Vertex AI pipeline...). I don't see performances issue when logging summaries for Tensorboard. I can imagine that the same exist for S3 on AWS.

LarsDu commented 1 year ago

I wound up using tensorboardX for logging since it's much lighter weight than having a Tensorboard install with it's many problematic dependencies

LarsDu commented 1 year ago

Correction: tensorboardX also has problematic dependencies. I gave up and am trying to include tensorflow in my project, but this makes stakeholders incredibly nervous

orionr commented 1 year ago

gfuse as @tarrade called out looks like a great option here @LarsDu

LarsDu commented 1 year ago

gcsfuse is even more problematic there are permissions settings that need to be considered for the kubernetes cluster from which we are running experimentation. Ultimately, I ended up simply installing tensorboard + tensorflow in our project.

I've filed this issue with tensorboard: https://github.com/tensorflow/tensorboard/issues/6298

Reubend commented 1 year ago

gcsfuse is probably not something we should include here since it's so specific to GCP, but PyTorch Lightning recently implemented fsspec inside of TB: https://lightning.ai/docs/pytorch/stable/common/remote_fs.html Fsspec gives you support for S3, Azure, GCP, etc "for free" by providing a generic interface to all of them. Maybe we could move the Lightning implementation here instead of it being a Lightning-specific feature?