jax-ml / jax

Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more
http://jax.readthedocs.io/
Apache License 2.0
30.25k stars 2.77k forks source link

Viewing program traces with Perfetto: `ValueError: Invalid trace folder` #13009

Closed ayaka14732 closed 1 year ago

ayaka14732 commented 1 year ago

Description

Following https://jax.readthedocs.io/en/latest/profiling.html:

import jax

with jax.profiler.trace("/tmp/jax-trace", create_perfetto_link=True):
    # Run the operations to be profiled
    key = jax.random.PRNGKey(0)
    x = jax.random.normal(key, (5000, 5000))
    y = x @ x
    y.block_until_ready()

Output:

2022-10-27 23:56:51.130358: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-27 23:56:51.192326: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-10-27 23:56:51.762358: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-10-27 23:56:51.762440: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-10-27 23:56:51.762446: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2022-10-27 23:56:51.839141: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2022-10-27 23:56:51.839247: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2022-10-27 23:56:51.839264: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
Traceback (most recent call last):
  File "/nfs_share/bart-base-jax/2.py", line 3, in <module>
    with jax.profiler.trace("/tmp/jax-trace", create_perfetto_link=True):
  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/ayaka/.venv310/lib/python3.10/site-packages/jax/_src/profiler.py", line 236, in trace
    stop_trace()
  File "/home/ayaka/.venv310/lib/python3.10/site-packages/jax/_src/profiler.py", line 197, in stop_trace
    abs_filename = _write_perfetto_trace_file(_profile_state.log_dir)
  File "/home/ayaka/.venv310/lib/python3.10/site-packages/jax/_src/profiler.py", line 134, in _write_perfetto_trace_file
    raise ValueError(f"Invalid trace folder: {latest_folder}")
ValueError: Invalid trace folder: /tmp/jax-trace/plugins/profile/2022_10_27_23_56_58
Traceback (most recent call last):
  File "/nfs_share/bart-base-jax/2.py", line 3, in <module>
    with jax.profiler.trace("/tmp/jax-trace", create_perfetto_link=True):
  File "/usr/lib/python3.10/contextlib.py", line 142, in __exit__
    next(self.gen)
  File "/home/ayaka/.venv310/lib/python3.10/site-packages/jax/_src/profiler.py", line 236, in trace
    stop_trace()
  File "/home/ayaka/.venv310/lib/python3.10/site-packages/jax/_src/profiler.py", line 197, in stop_trace
    abs_filename = _write_perfetto_trace_file(_profile_state.log_dir)
  File "/home/ayaka/.venv310/lib/python3.10/site-packages/jax/_src/profiler.py", line 134, in _write_perfetto_trace_file
    raise ValueError(f"Invalid trace folder: {latest_folder}")
ValueError: Invalid trace folder: /tmp/jax-trace/plugins/profile/2022_10_27_23_57_00

What jax/jaxlib version are you using?

jax v0.3.23, jaxlib v0.3.22, tensorflow v2.11.0rc1 (compatible with jaxlib)

Which accelerator(s) are you using?

TPU v4-16

Additional system info

Python 3.10.8, Linux 5.8.0-1035-gcp

NVIDIA GPU info

No response

sharadmv commented 1 year ago

I think the TensorFlow profiler is no longer dumping trace.json. I'll investigate and hopefully figure out what happened.

samuelpmish commented 1 year ago

I'm hitting this problem as well. I'm not sure if this is helpful, but when I look in the "invalid trace folder" I see only a single .pb file (and no trace.json)

...
ValueError: Invalid trace folder: /tmp/jax-trace/plugins/profile/2022_10_31_09_53_24

$ cd /tmp/jax-trace/plugins/profile/2022_10_31_09_53_24
$ ls
my_computer_name.xplane.pb
jamesheald commented 1 year ago

I'm experiencing the same problem. Have there been any advances on this?

sharadmv commented 1 year ago

We are currently working on automatically parsing the xplane.pb file and uploading it to Perfetto. Will update this thread when it's done! We have a Thanksgiving holiday next week so hopefully we'll have something to show the week after (cc: @pschuh)

minqi commented 1 year ago

+1

akbir commented 1 year ago

+1

mvsoom commented 1 year ago

Any news?

sharadmv commented 1 year ago

Apologies for the lack of updates! Both Parker and I have been on largely nonoverlapping vacation for the last month or so.

I'm back next week so will hopefully have something for you then.

markschoene commented 1 year ago

Running into the same error. With perfetto flags set to False, the program executes. A single file is created ...xplane.pb However, Tensorboard does not recognize the created file.

Running tensorboard --inspect --event_file=plugins/profile/2023_01_22_17_22_49/taurusi8017.xplane.pb yields the following output, which seems like the files are empty

======================================================================
Processing event files... (this can take a few minutes)
======================================================================

These tags are in plugins/profile/2023_01_22_17_22_49/taurusi8017.xplane.pb:
audio -
histograms -
images -
scalars -
tensor -
======================================================================

Event statistics for plugins/profile/2023_01_22_17_22_49/taurusi8017.xplane.pb:
audio -
graph -
histograms -
images -
scalars -
sessionlog:checkpoint -
sessionlog:start -
sessionlog:stop -
tensor -
======================================================================

Using

jax                               0.4.1
jaxlib                            0.4.1+cuda11.cudnn86

tensorboard                       2.9.1
tensorboard-data-server           0.6.1
tensorboard-plugin-profile        2.8.0
tensorboard-plugin-wit            1.8.1
tensorflow                        2.9.1
sharadmv commented 1 year ago

A quick update: @pschuh has made progress on reviving the old code that generated the trace.json.gz that was uploaded to Perfetto. Once that lands, and we cut a jaxlib release, Perfetto should work again!

sharadmv commented 1 year ago

A single file is created ...xplane.pb However, Tensorboard does not recognize the created file.

This seems right, as of now the profiler will only generate the xplane.pb. However, Tensorboard should recognize it. Did you try pointing the --logdir to the logdir in the profiler?

sharadmv commented 1 year ago

It's back! https://github.com/tensorflow/tensorflow/commit/b1dfc9285409bd9cb07f4598737450773daec573

We should be cutting a release soon so I will update the thread when that's out

minqi commented 1 year ago

Hi there, I was wondering if there are any updates on this front?

sharadmv commented 1 year ago

Ah sorry forgot to update the thread. I think it should work with the latest Jax.

zigzagcai commented 10 months ago

I also met this problem and found it can be resolved by compile and reinstall jaxlib with tensorflow>=2.12