The XLA:GPU profiler has a segfault bug when CUPTI initialization failed:
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fff0401cc7e in nsync::nsync_mu_lock(nsync::nsync_mu_s_*) () from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
(gdb) bt
#0 0x00007fff0401cc7e in nsync::nsync_mu_lock(nsync::nsync_mu_s_*) () from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#1 0x00007ffeff73a904 in xla::profiler::CuptiActivityBufferManager::AddCachedActivityEventsTo(xla::profiler::CuptiEventCollectorDelegate&, unsigned long, unsigned long&)
() from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#2 0x00007ffeff73355e in xla::profiler::CuptiTraceCollector::OnTracerCachedActivityBuffers(std::unique_ptr<xla::profiler::CuptiActivityBufferManager, std::default_delete<xla::profiler::CuptiActivityBufferManager> >) () from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#3 0x00007ffeff7340cd in xla::profiler::CuptiTraceCollectorImpl::Export(tensorflow::profiler::XSpace*, unsigned long) ()
from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#4 0x00007ffeff72a5c8 in xla::profiler::GpuTracer::CollectData(tensorflow::profiler::XSpace*) ()
from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#5 0x00007ffeff74d51b in tsl::profiler::ProfilerController::CollectData(tensorflow::profiler::XSpace*) ()
from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#6 0x00007ffeff74c427 in tsl::profiler::ProfilerCollection::CollectData(tensorflow::profiler::XSpace*) ()
from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#7 0x00007ffeff74bef8 in xla::profiler::PLUGIN_Profiler_CollectData(PLUGIN_Profiler_CollectData_Args*) ()
from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#8 0x00007fff49e9f2e1 in xla::profiler::PluginTracer::CollectData(tensorflow::profiler::XSpace*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#9 0x00007fff4a837efb in tsl::profiler::ProfilerController::CollectData(tensorflow::profiler::XSpace*) ()
from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#10 0x00007fff4a836e37 in tsl::profiler::ProfilerCollection::CollectData(tensorflow::profiler::XSpace*) ()
Hi,
The XLA:GPU profiler has a segfault bug when CUPTI initialization failed:
The segfault is caused by an unitialized
activity_buffers_
here: https://cs.opensource.google/tensorflow/tensorflow/+/master:third_party/xla/xla/backends/profiler/gpu/cupti_collector.cc;drc=17cedabb755224148be9854551d4efd172af10e5;l=630The activity_buffer will only initialized when https://cs.opensource.google/tensorflow/tensorflow/+/master:third_party/xla/xla/backends/profiler/gpu/cupti_tracer.cc;drc=17cedabb755224148be9854551d4efd172af10e5;l=1314 is called.
But when CUPTI failed to initialize, this function is not called. The library uses an unconstructed object so it leads to a segfault.