tensorflow / profiler

A profiling and performance analysis tool for TensorFlow
Apache License 2.0
359 stars 55 forks source link

Tensorboard profiler not working well with data from gcs bucket? #372

Open miclegr opened 2 years ago

miclegr commented 2 years ago

I'm running the keras profing notebook on colab and all works fine. Then I add a cell for logging into gcloud

from google.colab import auth
auth.authenticate_user()
project_id = 'my-project-name'
bucket_name = 'my-bucket-name'
!gcloud config set project {project_id}

and amend logging path to a gcs path:

# Create a TensorBoard callback
logs = f"gs://{bucket_name}/logs/" + datetime.now().strftime("%Y%m%d-%H%M%S")

tboard_callback = tf.keras.callbacks.TensorBoard(log_dir = logs,
                                                 histogram_freq = 1,
                                                 profile_batch = '500,520')

model.fit(ds_train,
          epochs=2,
          validation_data=ds_test,
          callbacks = [tboard_callback])

and most of the times it works fine, but a few time I've got the "No profile data was found." page when browsing into tensorboard, even after refreshing.

Then I launch a tensorboard session in my local machine with logdir the gcs path:

tensorboard --logdir=gs://my-bucket-name/logs/20211224-151008

and I always get the "No profile data was found." page when browsing into tensorboard, even after refreshing.

Finally I download the logging data from gcs bucket into a directory in my local machine and I start tensorboard with logdir my local path and it always shows the profile data.

Similar to #330 , but not quite like.

tensorboard 2.7.0, tensorboard_profiler_plugin 2.5.0

dkondoetsy commented 2 years ago

I'm experiencing a similar issue.

In my case, tensorboard is running in a k8s pod for profiling tfserving.

Tensorboard is run with the following command:

tensorboard --host 0.0.0.0 --load_fast=false --logdir=[my_gcs_bucket]

After clicking "Capture" from the tensorboard UI and sending requests to the TFServer, the Profile page doesn't show the profile results; it's as if the capture was never run. I verified that the gcs bucket has the xplane.pb trace files.

However, if I run tensorboard locally from my laptop pointing it to the gcs bucket, tensorboard locally does show the profile:

tensorboard --logdir=[my_gcs_bucket] --load_fast=false

Tensorboard version is 2.8.0, but the same issue occurs with version 2.4.1. The issue occurs both with --load_fast=false and without that flag (default set to true). Installed the latest version of tensorboard_plugin_profile: tensorboard_plugin_profile-2.5.0-py3-none-any.whl

Any fix or debugging tips would be greatly appreciated. Thank you.

dkondoetsy commented 2 years ago

Any input on this? We are setting up tensorboard in a large-scale k8s deployment (>1000 pods), and so being able to store event logs in GCS is crucial for enabling this.

I can reproduce the issue locally in docker with latest serving and tensorflow images and Tensorboard 2.7.0, and am happy to send my docker files it if helps. A local docker container runs tensorboard specifying a log directory in GCS. Tried running tensorboard both with and without the --load_fast option enabled, but still nothing appears in the Profile page (or any another page), after a profile capture.

Below is a list of files in GCS produced after a profiling run. Noticed a profile-empty file in the list:

gs://[...]/tensorboard/events.out.tfevents.1643293809.9c7022a74960.profile-empty
gs://[...]/tensorboard/plugins/profile/2022_01_27_14_30_08/tfserving_8500.xplane.pb

The file sizes are: events.out.tfevents.1643293809.9c7022a74960.profile-empty: 40B tfserving_8500.xplane.pb: 8.8MB

Here is the output of tensorboard inspect. Strangely, there are tags but no stats shown for each tag:

Found event files in:
gs://etsy-recsys-ml-dev-data-nxsn/user/dkondo/tensorboard

These tags are in gs://etsy-recsys-ml-dev-data-nxsn/user/dkondo/tensorboard:
audio -
histograms -
images -
scalars -
tensor -
======================================================================

Event statistics for gs://etsy-recsys-ml-dev-data-nxsn/user/dkondo/tensorboard:
audio -
graph -
histograms -
images -
scalars -
sessionlog:checkpoint -
sessionlog:start -
sessionlog:stop -
tensor -

The docker container has access to the GCS bucket. I verified this by exec'ing into the container and using gsutil to list and read files in the bucket. Also, tensorboard inspect works in the bucket.

If I start a separate tensorboard instance from the command line from my laptop pointing to the same gcs bucket as so:

tensorboard --logdir gs://[...]/tensorboard --port 6007 --load_fast false

the profile results appear [after a log reload, clicked in the upper right hand corner of the UI].

dkondoetsy commented 2 years ago

I found that the issue specifically occurs with tensorboard-plugin-profile==2.5.0 with tensorboard 2.4.1 and 2.7.0 (and possibly other versions), but does not occur with version tensorboard-plugin-profile==2.4.0.