tensorflow / profiler

A profiling and performance analysis tool for TensorFlow
Apache License 2.0
355 stars 55 forks source link

"No trace event is collected" when using tensorboard / capture_tpu_profile #380

Open lackhole opened 2 years ago

lackhole commented 2 years ago

At first I was trying to profile BERT in Google Cloud TPU VM(v3-8 | tpu-vm-tf-2.7.0), so I followed the guide while fine tuning BERT.

But when I press capture, it says No trace event is collected, so I thought the problem maybe specific to TPU and posted a question at StackOverflow. * Full log vv

Starting to trace for 1000 ms. Remaining attempt(s): 3
No trace event is collected. Automatically retrying.

Starting to trace for 1000 ms. Remaining attempt(s): 2
No trace event is collected. Automatically retrying.

Starting to trace for 1000 ms. Remaining attempt(s): 1
No trace event is collected. Automatically retrying.

Starting to trace for 1000 ms. Remaining attempt(s): 0
No trace event is collected after 4 attempt(s). Perhaps, you want to try again (with more attempts?).
Tip: increase number of attempts with --num_tracing_attempts.

After that, I thought maybe the tensorboard itself might be the problem so I followed Tensorflow Serving Readme for my personal PC(macOS 10.15 / Ubuntu 18.04) using CPU, but both of them also got stuck with same error : No trace event is collected. Automatically retrying..
Original issue filed at Tensorboard Issue 5517

The output from diagnose_tensorboard.py is pasted at the original issue.

cf. Tensorboard Web toasts "Capture profile successfully. Please refresh." but after 0.5 sec it disappears and nothing happens after refresh.

dmmolitor commented 2 years ago

Have you tried increasing the number of tracing attempts as suggested in the log? Similarly, you can try increasing the profile duration. The potential issues section of the guide has some suggestions for what could be going wrong here and some steps to try. In particular, making sure the TPU is running before capturing the trace.

lackhole commented 2 years ago

@dmmolitor Yes I did. Since the error continues, I changed the TPU architecture to Node and everything worked fine. So I guess there might be some bug with non-Node architecture since tensorboard itself cannot even profile CPU in my laptop as I mentioned above. Thank you.

dmmolitor commented 2 years ago

You are welcome. If your issue is resolved, could you please close the issue?

lackhole commented 2 years ago

@dmmolitor I don't think the issue is resolved, since profiling only works in specific architecture. I'll leave it opened.

chokkyvista commented 1 year ago

I also find myself unable to replicate https://cloud.google.com/tpu/docs/profile-tpu-vm#profile_tab in order to capture profiles on TPU VMs (TPU nodes work fine as @lackhole noted).

In my case, the Tensorboard web UI says Failed to capture profile: empty trace result. image and the tensorboard server records the following errors

I tensorflow/core/profiler/rpc/client/profiler_client.cc:113] Asynchronous gRPC Profile() to localhost:6000
I tensorflow/core/profiler/rpc/client/remote_profiler_session_manager.cc:96] Issued Profile gRPC to 1 clients
I tensorflow/core/profiler/rpc/client/profiler_client.cc:131] Waiting for completion.
E tensorflow/core/profiler/rpc/client/profiler_client.cc:154] Unavailable: failed to connect to all addresses
W tensorflow/core/profiler/rpc/client/capture_profile.cc:133] No trace event is collected from localhost:6000
W tensorflow/core/profiler/rpc/client/capture_profile.cc:145] localhost:6000 returned Unavailable: failed to connect to all addresses

This doesn't look like will get resolved by increasing either the number of retries or the profiling duration 🤔

I also tried the command line tool capture_tpu_profile to no avail (think it only works with TPU nodes).

And here's my TF setup for reference -

$ python3 -m pip list | grep -E 'tensor|cloud-tpu'
cloud-tpu-client              0.10
cloud-tpu-profiler            2.4.0
tensorboard                   2.6.0
tensorboard-data-server       0.6.1
tensorboard-plugin-profile    2.11.1
tensorboard-plugin-wit        1.8.1
tensorflow                    2.6.5
tensorflow-addons             0.16.1
tensorflow-datasets           4.8.2
tensorflow-estimator          2.6.0
tensorflow-hub                0.12.0
tensorflow-io                 0.30.0
tensorflow-io-gcs-filesystem  0.30.0
tensorflow-metadata           1.12.0
tensorflow-model-optimization 0.7.3
tensorflow-text               2.6.0
chokkyvista commented 1 year ago

As it turns out, the localhost:6000 returned Unavailable: failed to connect to all addresses error above was due to me forgetting to start the TF profiler server, which can be easily fixed by adding tf.profiler.experimental.server.start(6000) to the training script.

I was then able to see the following output from the training session, signalling a successful profile capture ✌

I tensorflow/core/profiler/lib/profiler_session.cc:131] Profiler session initializing.
W tensorflow/core/profiler/lib/profiler_session.cc:137] Profiling is late by 25154051 nanoseconds and will start immediately.
I tensorflow/core/profiler/lib/profiler_session.cc:146] Profiler session started.
I tensorflow/core/profiler/lib/profiler_session.cc:66] Profiler session collecting data.
I tensorflow/core/profiler/rpc/profiler_service_impl.cc:67] Collecting XSpace to repository: gs://.../plugins/profile/2023_02_03_20_17_09/localhost_6000.xplane.pb
I tensorflow/core/profiler/lib/profiler_session.cc:164] Profiler session tear down.

On the tensorboard server side though, there's a new error

W tensorflow/core/profiler/convert/xplane_to_tools_data.cc:226] Can not find tool: tool_names. Please update to the latest version of Tensorflow.

which prevented the resulting xplane.pb from being correctly parsed and displayed. Downgrading tensorboard-plugin-profile from 2.11.1 to 2.8.0 to get it more aligned with tensorboard (2.6.0) proved effective 🎉