rust-lang / rust-bindgen

Automatically generates Rust FFI bindings to C (and some C++) libraries.
https://rust-lang.github.io/rust-bindgen/
BSD 3-Clause "New" or "Revised" License
4.46k stars 694 forks source link

Add an option to override the representation of enums #1907

Open nbigaouette opened 4 years ago

nbigaouette commented 4 years ago

In https://github.com/nbigaouette/onnxruntime-rs I use bindgen in onnxruntime-sys/build.rs to generate bindings on different platforms (macos, linux, windows).

The original C header file is https://github.com/microsoft/onnxruntime/blob/v1.5.2/include/onnxruntime/core/session/onnxruntime_c_api.h (which is quite large).

This header contains a couple of enums, for example OrtLoggingLevel that gets wrapper as u32: https://docs.rs/onnxruntime-sys/0.0.9/onnxruntime_sys/type.OrtLoggingLevel.html

But on Windows, I get i32s for them (docs.rs doesn't contain windows values, yet).

Because of this I have trouble creating an API that can work both on Windows an other platforms.

Input C/C++ Header

wrapper.h

#include "onnxruntime_c_api.h"

onnxruntime_c_api.h

typedef enum OrtLoggingLevel {
  ORT_LOGGING_LEVEL_VERBOSE,
  ORT_LOGGING_LEVEL_INFO,
  ORT_LOGGING_LEVEL_WARNING,
  ORT_LOGGING_LEVEL_ERROR,
  ORT_LOGGING_LEVEL_FATAL,
} OrtLoggingLevel;

Bindgen Invocation

On macOS:

$ bindgen onnxruntime-sys/wrapper.h --whitelist-type OrtLoggingLevel -- -I target/debug/build/onnxruntime-sys-4dd6b75a91cb40e3/out/onnxruntime/onnxruntime-osx-x64-1.5.2/include

On Windows:

$ bindgen onnxruntime-sys/wrapper.h --whitelist-type OrtLoggingLevel -- -I C:\cargo_target\debug\b
uild\onnxruntime-sys-6a0af3699f92fd5c\out\onnxruntime\onnxruntime-win-x64-1.5.2\include

Actual Results

On macOS:

/* automatically generated by rust-bindgen 0.55.1 */

pub const OrtLoggingLevel_ORT_LOGGING_LEVEL_VERBOSE: OrtLoggingLevel = 0;
pub const OrtLoggingLevel_ORT_LOGGING_LEVEL_INFO: OrtLoggingLevel = 1;
pub const OrtLoggingLevel_ORT_LOGGING_LEVEL_WARNING: OrtLoggingLevel = 2;
pub const OrtLoggingLevel_ORT_LOGGING_LEVEL_ERROR: OrtLoggingLevel = 3;
pub const OrtLoggingLevel_ORT_LOGGING_LEVEL_FATAL: OrtLoggingLevel = 4;
pub type OrtLoggingLevel = ::std::os::raw::c_uint;

On Windows:

/* automatically generated by rust-bindgen 0.55.1 */

pub const OrtLoggingLevel_ORT_LOGGING_LEVEL_VERBOSE: OrtLoggingLevel = 0;
pub const OrtLoggingLevel_ORT_LOGGING_LEVEL_INFO: OrtLoggingLevel = 1;
pub const OrtLoggingLevel_ORT_LOGGING_LEVEL_WARNING: OrtLoggingLevel = 2;
pub const OrtLoggingLevel_ORT_LOGGING_LEVEL_ERROR: OrtLoggingLevel = 3;
pub const OrtLoggingLevel_ORT_LOGGING_LEVEL_FATAL: OrtLoggingLevel = 4;
pub type OrtLoggingLevel = ::std::os::raw::c_int;

Using --rustified-enum="*" I get:

❯ bindgen onnxruntime-sys/wrapper.h --rustified-enum="*" --whitelist-type OrtLoggingLevel -- -I target/debug/build/onnxruntime-sys-4dd6b75a91cb40e3/out/onnxruntime/onnxruntime-osx-x64-1.5.2/include
/* automatically generated by rust-bindgen 0.55.1 */

#[repr(u32)]
#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
pub enum OrtLoggingLevel {
    ORT_LOGGING_LEVEL_VERBOSE = 0,
    ORT_LOGGING_LEVEL_INFO = 1,
    ORT_LOGGING_LEVEL_WARNING = 2,
    ORT_LOGGING_LEVEL_ERROR = 3,
    ORT_LOGGING_LEVEL_FATAL = 4,
}
PS Microsoft.PowerShell.Core\FileSystem::\\VBOXSVR\onnxruntime> bindgen onnxruntime-sys/wrapper.h --rustified-enum="*" --whitelist-type OrtLoggingLevel -- -I C:
\cargo_target\debug\build\onnxruntime-sys-6a0af3699f92fd5c\out\onnxruntime\onnxruntime-win-x64-1.5.2\include
/* automatically generated by rust-bindgen 0.55.1 */

#[repr(i32)]
#[derive(Debug, Copy, Clone, PartialEq, Eq, Hash)]
pub enum OrtLoggingLevel {
    ORT_LOGGING_LEVEL_VERBOSE = 0,
    ORT_LOGGING_LEVEL_INFO = 1,
    ORT_LOGGING_LEVEL_WARNING = 2,
    ORT_LOGGING_LEVEL_ERROR = 3,
    ORT_LOGGING_LEVEL_FATAL = 4,
}

Expected Results

Note how macOS gives me a std::os::raw::c_uint and Windows gives me a std::os::raw::c_int (or #[repr(u32)] vs #[repr(i32)]) for the enum representation.

Why does bindgen generates different code on macOS/Linux and Windows? Can I tell bindgen to use std::os::raw::c_int everywhere? Or i64, or whatever else?

Thank you!

emilio commented 4 years ago

Because that's what the underlying platform is using, see https://github.com/rust-lang/rust-bindgen/issues/1361. So that part of the report is invalid (MSVC is weird here :/).

A callback or such to override the enum repr may be worth it as an option, but note that as soon as you do that (if you make it i64 for example) then structs containing that enum can break and so on, so it's a bit of a footgun. Repurposing this to track that feature request.

ThatGeoGuy commented 3 years ago

Bindgen seems to be doing the wrong thing based on the C standard.

I understand that making things i64 everywhere could be problematic. enums in C aren't 64-bits unless int is 64 bit (more on this later), so converting enum ordinals as i64 would be a bad idea.

On the other hand, MSVC is not the weird one in this case. It is the Unix behaviour that seems wrong. First, lets look at the C standard for what type enums should be:

In both cases you're looking for Section 6.7.2.2, paragraph 3 - Semantics. There it writes (and I quote directly here):

The identifiers in an enumerator list are declared as constants that have type int and may appear wherever such are permitted.

They always have type int in C, which is a signed type. So in all cases on x86_64 (which I assume is what we're targeting here) we would want bindgen to do the correct thing and treat these as i32. On other platforms this may change as int may be i16 on some older platforms. If such a platform existed that int was i64, then perhaps we might consider i64 a correct representation for that platform.

Beyond that, Modern C provides similar advice:

Takeaway 1.5.6.6 Enumeration constants are of type signed int

I'm not sure why u32 is being purported as "correct" behaviour by bindgen here. I think given the above (and the fact that enums can contain signed ordinals), bindgen should reconsider and represent everything as i32 constants on x86_64. If we're being a bit more particular, it should instead look at what type signed int actually compares to, and use that.

tgross35 commented 2 years ago

Why is #[repr(C)] not used here? From the nomicon:

repr(C) is equivalent to one of repr(u*) (see the next section) for fieldless enums. The chosen size is the default enum size for the target platform's C application binary interface (ABI). Note that enum representation in C is implementation defined, so this is really a "best guess". In particular, this may be incorrect when the C code of interest is compiled with certain flags.

As noted, a C enum is always the size of int, so [repr(C)] seems like a better choice than [repr(i32)]. In fact, that would make it safe to make the default enum style be rust or rust_non_exhaustive, instead of const, which I would strongly vouch for (much cleaner and a better representation of the original code).

pvdrz commented 1 year ago

Also https://github.com/rust-lang/rust-bindgen/issues/1966 is somewhat related to this.

pvdrz commented 1 year ago

Why is #[repr(C)] not used here? From the nomicon:

repr(C) is equivalent to one of repr(u*) (see the next section) for fieldless enums. The chosen size is the default enum size for the target platform's C application binary interface (ABI). Note that enum representation in C is implementation defined, so this is really a "best guess". In particular, this may be incorrect when the C code of interest is compiled with certain flags.

As noted, a C enum is always the size of int, so [repr(C)] seems like a better choice than [repr(i32)]. In fact, that would make it safe to make the default enum style be rust or rust_non_exhaustive, instead of const, which I would strongly vouch for (much cleaner and a better representation of the original code).

we could add a flag to switch between #[repr(C)] and #[repr(whatever_int_is_for_this_target)]. Would that be a sensible option? @tgross35 @emilio ?

tgross35 commented 1 year ago

That would be great for my case, just another thing that works a little bit better cross platform without necessarily needing to regenerate bindings.

Is there any reason not to make this option the default for rust style enums?

pvdrz commented 1 year ago

I don't have strong arguments against enabling #[repr(C)] by default. The only thing I can see on the horizon would be that someone is relying on it being #[repr(i32)] for some odd reason.

pvdrz commented 1 year ago

One "open question" that I see here is if we can use #[repr(C)] for every C enum or not.

Apparently, C23 allows to specify the underlying type of an enum. So, I guess, in this particular case, we should use #[repr(the_int_that_c_said)] instead of #[repr(C)].

I'm not sure if there's another way this feature could break something or not.

Edit: An alternative would be to leave this reasoning to the user :trollface:. What I mean is we could use regular expressions so people can use .* if they want to use #[repr(C)] for every enum or filter accordingly.

tgross35 commented 1 year ago

Interesting caveat with the C23 syntax! I guess in those cases it's already determined.

The only thing I can see on the horizon would be that someone is relying on it being #[repr(i32)] for some odd reason.

I think this is pretty unlikely, and there are probably more cases the other way around where i32 is not the correct enum representation. Since the enum size is ABI defined (per the nomicon) and the size of c_int isn't fixed anyway (16 bits on avr and msp430, looking at the source), I don't think defaulting to #[repr(C)] would break anything that doesn't already have to use some sort of workarounds because i32 isn't right anyway

pvdrz commented 1 year ago

I'll start working on this but I'll wait for @emilio's opinion on which should be the default.