openxla / xla

A machine learning compiler for GPUs, CPUs, and ML accelerators
Apache License 2.0
2.61k stars 408 forks source link

Bug: Segfault when CUPTI is not correctly initialized. #13853

Open yliu120 opened 3 months ago

yliu120 commented 3 months ago

Hi,

The XLA:GPU profiler has a segfault bug when CUPTI initialization failed:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fff0401cc7e in nsync::nsync_mu_lock(nsync::nsync_mu_s_*) () from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
(gdb) bt
#0  0x00007fff0401cc7e in nsync::nsync_mu_lock(nsync::nsync_mu_s_*) () from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#1  0x00007ffeff73a904 in xla::profiler::CuptiActivityBufferManager::AddCachedActivityEventsTo(xla::profiler::CuptiEventCollectorDelegate&, unsigned long, unsigned long&)
    () from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#2  0x00007ffeff73355e in xla::profiler::CuptiTraceCollector::OnTracerCachedActivityBuffers(std::unique_ptr<xla::profiler::CuptiActivityBufferManager, std::default_delete<xla::profiler::CuptiActivityBufferManager> >) () from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#3  0x00007ffeff7340cd in xla::profiler::CuptiTraceCollectorImpl::Export(tensorflow::profiler::XSpace*, unsigned long) ()
   from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#4  0x00007ffeff72a5c8 in xla::profiler::GpuTracer::CollectData(tensorflow::profiler::XSpace*) ()
   from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#5  0x00007ffeff74d51b in tsl::profiler::ProfilerController::CollectData(tensorflow::profiler::XSpace*) ()
   from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#6  0x00007ffeff74c427 in tsl::profiler::ProfilerCollection::CollectData(tensorflow::profiler::XSpace*) ()
   from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#7  0x00007ffeff74bef8 in xla::profiler::PLUGIN_Profiler_CollectData(PLUGIN_Profiler_CollectData_Args*) ()
   from /usr/local/lib/python3.10/dist-packages/jax_plugins/xla_cuda12/xla_cuda_plugin.so
#8  0x00007fff49e9f2e1 in xla::profiler::PluginTracer::CollectData(tensorflow::profiler::XSpace*) () from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#9  0x00007fff4a837efb in tsl::profiler::ProfilerController::CollectData(tensorflow::profiler::XSpace*) ()
   from /usr/local/lib/python3.10/dist-packages/jaxlib/xla_extension.so
#10 0x00007fff4a836e37 in tsl::profiler::ProfilerCollection::CollectData(tensorflow::profiler::XSpace*) ()

The segfault is caused by an unitialized activity_buffers_ here: https://cs.opensource.google/tensorflow/tensorflow/+/master:third_party/xla/xla/backends/profiler/gpu/cupti_collector.cc;drc=17cedabb755224148be9854551d4efd172af10e5;l=630

The activity_buffer will only initialized when https://cs.opensource.google/tensorflow/tensorflow/+/master:third_party/xla/xla/backends/profiler/gpu/cupti_tracer.cc;drc=17cedabb755224148be9854551d4efd172af10e5;l=1314 is called.

But when CUPTI failed to initialize, this function is not called. The library uses an unconstructed object so it leads to a segfault.

### Tasks
cheshire commented 3 months ago

Are there repro steps?