Open christopherbate opened 1 month ago
Another data point: StablehloModule.cpp
includes MLIR header mlir/CAPI/IR.h
here https://github.com/openxla/stablehlo/blob/main/stablehlo/integrations/python/StablehloModule.cpp#L18.
At the top of mlir/CAPI/IR.h
, it says:
//===----------------------------------------------------------------------===//
//
// This file contains declarations of implementation details of the C API for
// core MLIR classes. This file should not be included from C++ code other than
// C API implementation nor from C code.
//
//===----------------------------------------------------------------------===//
It defines the wrap()
and unwrap()
functions for bridging MLIR C API objects with the MLIR C++ API within the C API implementation. It's being directly used in StablehloModule.cpp
, defeating the purpose of having a C API in the first place.
The solution to all this is just that the interpreter API for stablehlo::evalModule
, mlir::stablehlo::(serialize|deserialize)PortableArtifact
, and the portable API needs to be wrapped in a C API that accepts MLIR C API objects.
Can someone on StableHLO team help out in near term? Otherwise I can take an attempt to solve this week.
This causes symbols to be included in the generated _stablehlo.[pyversion/platform].so library that probably shouldn't be there, which can be seen by dumping symbols with a tool like nm. It shouldn't have any C++ symbols under mlir:: namespace that aren't under ::mlir::python. I see lots of suspect symbols from MLIR namespace included in the text section of the binary. If this were OK, then why go through trouble of creating C API?
Appreciate the detailed writeup! I haven't worked much with CAPI so don't know much about the pitfalls, also and things have seemed to work as-is so we've definitely taken the "then why go through trouble of creating C API" approach mentioned.
If all that's needed is that CAPI bridge, we could likely take this on - but as of now don't have a reliable way to repro the issue to know if we've resolved it. If you have some time to write this up that would be great! Otherwise we can try and get some cycles to get this done in the coming weeks and followup to see if it works.
I think building in Debug CMake profile may produce bugs when running the Python tests, but I'll have to check.
If not, then at least running on Linux nm -C [stablehlo python extension DSO] | grep 't mlir::'
and checking for mlir::
symbols not in mlir::python
is one possible check.
What happened?
The
StablehloPythonExtension
module is built differently than other MLIR PyBind libraries. This seems to open up opportunity for a class of bugs which can be a real headache to solve. Apologies for the winding explanation below; I've tried to describe the issue as I understand it (but could be wrong on various points, I'm not an expert on the MLIR PyBind11 bindings):No MLIR dialect PyBind11 library in upstream MLIR seems to reference
mlir::
namespace symbols directly (outside ofmlir::python
, which is special) like how the Stablehlo Python Python module is doing here: https://github.com/openxla/stablehlo/blob/main/stablehlo/integrations/python/StablehloModule.cpp#L541Upstream MLIR PyBind11 libraries (declared using
declare_mlir_python_extension
) link to MLIR only through the C API interface. Based on my understanding, this is done in order to eliminate potential compatibility issues (the PyBind11 modules allowed to be built with a set of C++ compilation options which are divorced from the rest of the MLIR/LLVM project compilation settings). In addition, the CMake build system bundles all the underlying required MLIR and LLVM libraries into a monolithic .so file (generated with a name likelibMLIRPythonCompilerCLibrary.so
under the_mlir_libs
directory). For each MLIR dialect that declares a python extension library using PyBind11, the build system creates a separate DSO which links to the monolithic library to resolve the MLIR C API calls.In StableHLO's declaration of its PyBind11 extension library (https://github.com/openxla/stablehlo/blob/main/stablehlo/integrations/python/CMakeLists.txt#L94), a number of libraries are linked under
PRIVATE_LINK_LIBS
. These libraries have transitive dependence directly on MLIR C++ libraries likeMLIRIR
and theStablehloDialect
targets, without going through the MLIR C API. This causes symbols to be included in the generated_stablehlo.[pyversion/platform].so
library that probably shouldn't be there, which can be seen by dumping symbols with a tool likenm
. It shouldn't have any C++ symbols undermlir::
namespace that aren't under::mlir::python
. I see lots of suspect symbols from MLIR namespace included in the text section of the binary. If this were OK, then why go through trouble of creating C API?This issue was referenced under https://github.com/openxla/stablehlo/issues/2170, but it looks like in their case the library ended up functioning correctly.
However, in my downstream project, this difference broke the Debug build for my project in a way that was difficult to debug. The issue is masked in non-debug builds. When the CMake Debug profile is enabled, the compiler includes static initialization code for
static llvm::cl::opt<...>
variables from certain portions of LLVM into the StableHLO python extension library. This code seems to find its way into the StableHLO PyBind11 extension library transitively through the libraries linked in throughPRIVATE_LINK_LIBS
ofStablehloPythonExtension
mentioned earlier. Therefore, when the StableHLO python module is loaded, a crash will result since it will hit an assertion in LLVM complaining about registering multiplecl::opt
options with the same name.Steps to reproduce your issue
No response
Version information
No response