StableHLO PyBind11 modules links directly to MLIR C++ libraries, can break downstream usage

christopherbate commented 1 month ago

What happened?

The StablehloPythonExtension module is built differently than other MLIR PyBind libraries. This seems to open up opportunity for a class of bugs which can be a real headache to solve. Apologies for the winding explanation below; I've tried to describe the issue as I understand it (but could be wrong on various points, I'm not an expert on the MLIR PyBind11 bindings):

No MLIR dialect PyBind11 library in upstream MLIR seems to reference mlir:: namespace symbols directly (outside of mlir::python, which is special) like how the Stablehlo Python Python module is doing here: https://github.com/openxla/stablehlo/blob/main/stablehlo/integrations/python/StablehloModule.cpp#L541

Upstream MLIR PyBind11 libraries (declared using declare_mlir_python_extension) link to MLIR only through the C API interface. Based on my understanding, this is done in order to eliminate potential compatibility issues (the PyBind11 modules allowed to be built with a set of C++ compilation options which are divorced from the rest of the MLIR/LLVM project compilation settings). In addition, the CMake build system bundles all the underlying required MLIR and LLVM libraries into a monolithic .so file (generated with a name like libMLIRPythonCompilerCLibrary.so under the _mlir_libs directory). For each MLIR dialect that declares a python extension library using PyBind11, the build system creates a separate DSO which links to the monolithic library to resolve the MLIR C API calls.

In StableHLO's declaration of its PyBind11 extension library (https://github.com/openxla/stablehlo/blob/main/stablehlo/integrations/python/CMakeLists.txt#L94), a number of libraries are linked under PRIVATE_LINK_LIBS. These libraries have transitive dependence directly on MLIR C++ libraries like MLIRIR and the StablehloDialect targets, without going through the MLIR C API. This causes symbols to be included in the generated _stablehlo.[pyversion/platform].so library that probably shouldn't be there, which can be seen by dumping symbols with a tool like nm. It shouldn't have any C++ symbols under mlir:: namespace that aren't under ::mlir::python. I see lots of suspect symbols from MLIR namespace included in the text section of the binary. If this were OK, then why go through trouble of creating C API?

This issue was referenced under https://github.com/openxla/stablehlo/issues/2170, but it looks like in their case the library ended up functioning correctly.

However, in my downstream project, this difference broke the Debug build for my project in a way that was difficult to debug. The issue is masked in non-debug builds. When the CMake Debug profile is enabled, the compiler includes static initialization code for static llvm::cl::opt<...> variables from certain portions of LLVM into the StableHLO python extension library. This code seems to find its way into the StableHLO PyBind11 extension library transitively through the libraries linked in through PRIVATE_LINK_LIBS of StablehloPythonExtension mentioned earlier. Therefore, when the StableHLO python module is loaded, a crash will result since it will hit an assertion in LLVM complaining about registering multiple cl::opt options with the same name.

Steps to reproduce your issue

No response

Version information

No response

christopherbate commented 1 month ago

Another data point: StablehloModule.cpp includes MLIR header mlir/CAPI/IR.h here https://github.com/openxla/stablehlo/blob/main/stablehlo/integrations/python/StablehloModule.cpp#L18.

At the top of mlir/CAPI/IR.h, it says:

//===----------------------------------------------------------------------===//
//
// This file contains declarations of implementation details of the C API for
// core MLIR classes. This file should not be included from C++ code other than
// C API implementation nor from C code.
//
//===----------------------------------------------------------------------===//

It defines the wrap() and unwrap() functions for bridging MLIR C API objects with the MLIR C++ API within the C API implementation. It's being directly used in StablehloModule.cpp, defeating the purpose of having a C API in the first place.

The solution to all this is just that the interpreter API for stablehlo::evalModule, mlir::stablehlo::(serialize|deserialize)PortableArtifact , and the portable API needs to be wrapped in a C API that accepts MLIR C API objects.

Can someone on StableHLO team help out in near term? Otherwise I can take an attempt to solve this week.

GleasonK commented 1 month ago

This causes symbols to be included in the generated _stablehlo.[pyversion/platform].so library that probably shouldn't be there, which can be seen by dumping symbols with a tool like nm. It shouldn't have any C++ symbols under mlir:: namespace that aren't under ::mlir::python. I see lots of suspect symbols from MLIR namespace included in the text section of the binary. If this were OK, then why go through trouble of creating C API?

Appreciate the detailed writeup! I haven't worked much with CAPI so don't know much about the pitfalls, also and things have seemed to work as-is so we've definitely taken the "then why go through trouble of creating C API" approach mentioned.

If all that's needed is that CAPI bridge, we could likely take this on - but as of now don't have a reliable way to repro the issue to know if we've resolved it. If you have some time to write this up that would be great! Otherwise we can try and get some cycles to get this done in the coming weeks and followup to see if it works.

christopherbate commented 1 month ago

I think building in Debug CMake profile may produce bugs when running the Python tests, but I'll have to check.

If not, then at least running on Linux nm -C [stablehlo python extension DSO] | grep 't mlir::' and checking for mlir:: symbols not in mlir::python is one possible check.

openxla / stablehlo