status-im / nimbus-eth2

Nim implementation of the Ethereum Beacon Chain
https://nimbus.guide
Other
531 stars 230 forks source link

[RFC] producing more robust wrappers for external C libraries #640

Closed stefantalpalaru closed 2 years ago

stefantalpalaru commented 4 years ago

intro

I've been looking into ncurses programming for the last few weeks and this must be the most preprocessor-macro-happy API I've seen so far. Crazy stuff like #define getmaxyx(win,y,x) (y = getmaxy(win), x = getmaxx(win)) that breaks the pattern of passing pointer parameters for the output of what looks like a regular function but is a macro.

Then there are whole blocks as macros: https://github.com/mirror/ncurses/blob/8890c8f28a1db5995ef17f52a7d8c0b9cf574210/include/curses.h.in#L1139

In the past, I manually fixed, coerced and patched the output of c2nim, looking for all kind of clever ways to preserve some flexibility, but that proved less resilient than I thought: https://github.com/status-im/nim-nat-traversal/pull/1

This brings me to:

requirements

the intermediate C wrapper proposal

Have a C wrapper that has full access to the original API, preprocessor tricks included, and exports them in a simplified syntax that c2nim is guaranteed to parse properly. Let's say this intermediate wrapper is in "internal_wrapper.c" and its newly introduced C functions declared in "internal_wrapper.h". Run c2nim on "internal_wrapper.h" to produce "internal_wrapper.nim" and include that in a user-facing "wrapper.nim" that adds any extra Nim types and procedures we may need.

Consider running c2nim as part of every Nim compilation, if it can be proven that some C constructs that end up copied in "internal_wrapper.nim" can be changed by upstream without changing the public API.

In "wrapper.nim" make sure that "internal_wrapper.c" is compiled and linked with the resulting binary. Be careful not to do this more than once per binary.

open questions

How to deal with public data types depending on preprocessor defines? e.g.: "PyObject_HEAD" - the macro that expands to the first struct member in most structs in the Python C API. Or this CPP orgy: http://liw.iki.fi/liw/texts/cpp-trick.html

mratsim commented 4 years ago

Did you try the super hidden compiletime FFI? https://github.com/nim-lang/Nim/pull/10150

zah commented 4 years ago

https://github.com/fragcolor-xyz/nimline comes to mind as a way to gradually consume C/C++ library without creating a complete wrapper for it. Otherwise, https://github.com/nimterop/nimterop is even closer to what you describe with its ability to parse headers files at compile-time and to turn them into Nim definitions.

arnetheduck commented 4 years ago

maintaining the internal wrapper seems equivalent to maintaining a nim version of that same code.

it's important to remember that C header files by and large consist of two things:

the ABI description is what we're looking to give to the nim compiler so that it understands what code to produce in order to interop with the compiled library (struct sizes, function arguments arguments etc etc) - traditionally we've done this half-half in nim and c: nim contained names used in nim and included the c headers also - this could lead to problems because there are two sources of information, for example about the size of a struct: one in nim and one in C.. as an example, consider an incomplete struct:

type X = object
  f: int
struct X {
  int f;
  int g;
};

this will seemingly work with the wrapper strategy of including header file, but the mismatch between the nim compiler and the c compiler looking at the same object will soon become apparent - alignment, memory size etc are all wrong leading to subtle miscompiles when more advanced cases are considered

libraries also come in flavours: those that were meant to be consumed from other languages and those that weren't - the former will have a simple, lowest-common-denominator based API (like glib or llvm-c and friends).. some libraries, like LLVM, actually offer what you're describing as part of their own repo: https://github.com/llvm/llvm-project/tree/master/llvm/include/llvm-c - a simplified view of the library in question, but maintained by upstream.

if they were not meant to be wrapped, they will use a lot of C-only features (like preprocessor and macros) - these cannot be wrapped without manual input. consider swig - it's a general purpose wrapper generator that contains a wealth of experience wrapping C code - and a whole language of itself to deal with the edge cases commonly encountered when wrapping, and even then there are things that need manual intervention.

if anything, nimterop comes close in nim by using a better C parser, or you'd need a full C compiler (like zig does it by including clang), to properly use and understand C header files

generalized wrapping between languages is only possible if the target language is a strict superset of the source language - in the case of nim it is not - there are things you can express in C that you cannot express in nim, thus it is not possible to build a fully general tool that requires no manual or additional information.

arnetheduck commented 4 years ago

in short - the nimterop or c2nim way of wrapping gives a lot of bang for the buck: you wrap 95% of everything automatically, but leave the rest to manual intervention simply because automating that last 5% is more work than doing it manually. it is in the nature of most useful libraries that have stood the test of time that they do not change - thus there is very little maintenance involved once the initial work has been done

stefantalpalaru commented 4 years ago

Did you try the super hidden compiletime FFI?

What for?

https://github.com/fragcolor-xyz/nimline comes to mind as a way to gradually consume C/C++ library without creating a complete wrapper for it.

How do you access preprocessor defines with that?

Otherwise, https://github.com/nimterop/nimterop is even closer to what you describe with its ability to parse headers files at compile-time and to turn them into Nim definitions.

It comes with a huge binary and last time I tried it it was worse at parsing C than c2nim. Had no idea what to do with anonymous unions as struct members, for example.

If I want a better c2nim, I'll reimplement it using libclang instead of relying on a C parsing library that is not used in any compiler.

generalized wrapping between languages is only possible if the target language is a strict superset of the source language - in the case of nim it is not

Nim gets transpiled to C. That's enough for generating wrappers without waiting for upstream to write them for you.

it is not possible to build a fully general tool that requires no manual or additional information

The intermediate C wrapper would be written manually.

in short - the nimterop or c2nim way of wrapping gives a lot of bang for the buck: you wrap 95% of everything automatically, but leave the rest to manual intervention

It's not enough to support multiple versions of the public C API. You manually produce an altered copy of a specific version of those C headers, then you're stuck with them. I want those headers to be included and used dynamically so errors can appear on API changes (like they do between incompatible ffmpeg versions).

it is in the nature of most useful libraries that have stood the test of time that they do not change - thus there is very little maintenance involved once the initial work has been done

If only that were true... Unfortunately, upstream thinks that major (and sometimes minor) version changes are a license to kill backwards compatibility. I want to see a compile time error in that case, not an obscure runtime stack overflow due to changed function parameters.

arnetheduck commented 4 years ago

Nim gets transpiled to C.

nim outputs C, but makes assumptions about what the C compiler sees, as mentioned: struct size, alignment, pointer size, sizeof(double) and a whole bunch of other things - as soon as you venture into the territory where nim and C have differing views of the same things, you end up with the very bugs that you want to avoid - when sizeof(nim-view) != sizeof(c-view), you're out of luck and the only way to work that out is to explicitly, in nim (not in the c code that nim generates), have an accurate representation, or in the case of inlined code a copy, of the original code. there is no way around this - unless you start parsing the full C language and give that understanding to the nim compiler/macro.. this is what nimterop aims to do, albeit unsuccessfully (parsing C is harder than it seems, specially when you take into account platform-specific and system-installed headers)

arnetheduck commented 4 years ago

Unfortunately, upstream thinks that major (and sometimes minor) version changes are a license to kill backwards compatibility.

yes, that is kind of the point of a major release: the way to think about ncurses 5 vs ncurses 6 is that they are two different, similarly named libraries that happen to have a few things in common - to handle that correctly, you need two different wrappers (that have a few things in common)

mratsim commented 4 years ago

Did you try the super hidden compiletime FFI?

What for?

Sorry I misunderstood the actual problem

stefantalpalaru commented 4 years ago

Here's what the current state-of-the-art looks like, limited to dlopen-ed symbols and manually translated preprocessor macros: https://github.com/rnowley/nim-ncurses/blob/c628c003f1f8b8e67849725d4e83ab5c5ac83064/ncurses.nim#L592

I think we can do better with a C wrapper, albeit with more initial work to write C functions that hide all those aspects we can't robo-translate into Nim.

And keep in mind that I'm after functionality, not mirroring the original API. I don't want the equivalent of int maxy, maxx; getmaxyx(stdscr, maxy, maxx) ending up in Nim code, for example, because it breaks established patterns.

So the question is if I get that by using both C and Nim to create my wrapper and if it will be flexible enough to work with future versions of ncurses-6.x and reliable enough to refuse to compile when the upstream headers change in a significant way.

I mean, look at how this NULL check was introduced in a bunch of macros: https://github.com/mirror/ncurses/commit/3eda6f30a84d53844d2ebceadb457e2e7e9cfbf3#diff-76c2c1047ac829d06e97e1ca5d8af0b2R1168

I want to use that change without having to manually copy it in a Nim template.

arnetheduck commented 4 years ago

well, if upstream wants to break things, they will break things and you'll have to adjust, as part of maintaining the wrapper - that's just the way of the world. is it really that common that it merits something more than a manual one-off? I don't think so - focus instead on all the value you're getting from partially being able to reuse ncurses and not having to do rewrite all of it.. it's not a binary choice.

once you start trying to maintain the same wrapper for multiple upstream versions, you add confusion and complexity - that lack of clarity will quickly come back to bite you when trying to figure out where some issue lies. this is why the world is moving in the direction of project-controlled versioning, because testing with multiple versions is simply not feasible at scale - when you can't rely on memcpy even to stay compatible between glibc versions, that should tell you something..

stefantalpalaru commented 4 years ago

you'll have to adjust, as part of maintaining the wrapper

[...]

once you start trying to maintain the same wrapper for multiple upstream versions, you add confusion and complexity

It's not too hard to keep copying upstream's headers but it's too hard to not copy them in order to support multiple versions? It does not compute.

stefantalpalaru commented 4 years ago

An example of my proposal in action, at a smaller scale: https://github.com/status-im/nim-libbacktrace

Most of the wrapper is in C (and a bit of C++). There was no friction in the leaky abstraction that is Nim's FFI, no ugly workarounds for dealing with a pointer to the undefined struct backtrace_state in the public header, no running through hoops to get the dynamically generated defines in "backtrace-supported.h" (created by ./configure), no worrying about C code calling Nim callbacks and messing with the GC, etc.

The downside is having to do manual memory management and deal with platform-specific shenanigans like printing size_t on Windows.

genotrance commented 4 years ago

Thanks @mratsim for sharing this thread. I read it a few times and had the following comments.

None of this wrapping does anything for the real Nim <=> C/C++ hygiene required when it comes to struct size, alignment, callbacks, and other caveats as pointed out already, though it is possible to improve the state of affairs by encoding as much of this knowledge into the tool as possible rather than expecting users to know about it. This is certainly an open area but all my learnings have been coded into nimterop. The more everyone contributes, it will only get better.

Meanwhile, here's a short script using nimterop that downloads, builds, links and wraps libbacktrace. Here's the full output generated during compilation including the generated wrappers. Now, I have no idea whether it does anything to reduce the friction mentioned earlier but I do value being able to create a wrapper in less than 5 minutes.

I felt I couldn't claim the same about ncurses but sure enough, 10 minutes later, I have this wrapper which again downloads, builds, static links and wraps libncurses. Here's the full build output including the wrapper. Note how I skipped some symbols since they are treated as equal to other symbols due to Nim's stylistic equivalence and two missing types which I stubbed out. Can be better but good enough for now. There are also some skipped declarations in the output as well which Nimterop doesn't recognize yet. But not bad for 10 minutes.

Overall, I am very interested in feedback to see how things can be made better in this space. As much as I understand the value of static wrappers that can be checked in and preserved, I know we are inching closer to reproducible builds with Nimterop and that will help reduce some of the general concerns I've heard in the past.

stefantalpalaru commented 4 years ago

Nimterop is a build time tool

That's perfectly fine.

Compilers aren't the only ones who need to understand code.

But they are the only ones that live or die based on their ability to parse it properly.

tree-sitter gives you a full AST representation of the code

What's good enough for syntax highlighting is not good enough for code generation.

I prefer 15MB of code that can be compiled anywhere compared to the monster that libclang would be to distribute or expect on developers' systems

If you were getting the same results, sure, but you're not. Zig got it right: https://ziglang.org/#Integration-with-C-libraries-without-FFIbindings

backtrace_state* {.bycopy.} = object

That should be a void pointer or, at most, a type alias for a void pointer. Is this empty object passed by copy some sort of ugly workaround to get that? More importantly, does it work or does it try to dereference a pointer to an undefined struct?

genotrance commented 4 years ago

What's good enough for syntax highlighting is not good enough for code generation. If you were getting the same results, sure, but you're not. Zig got it right: https://ziglang.org/#Integration-with-C-libraries-without-FFIbindings

Sure, Zig lives and dies by LLVM and has clang as a required dependency. Nim does not so tree-sitter was a better compromise. More importantly though, if you can tell me specifically what is not good enough, it will either improve the status quo or highlight something I cannot see.

That should be a void pointer or, at most, a type alias for a void pointer.

All you have is:

struct backtrace_state;

backtrace_create_state() returns a pointer to such an object and that's all we need to know. We then pass around the pointer. Here's some code:

proc error_callback(data: pointer, msg: cstring, errnum: cint) {.cdecl.} =
  echo msg, errnum

var
  bin = commandLineParams()[0]
  state = backtrace_create_state(bin.cstring,
                               BACKTRACE_SUPPORTS_THREADS.cint, error_callback, nil)
  fhandle = open("output.txt", fmWrite)

backtrace_print(state, 0, addr fhandle)
stefantalpalaru commented 4 years ago

All you have is:

struct backtrace_state;

In a C header, that means that I can only use pointers to that opaque struct. Anything else and I get an error from the C compiler that can't guess the struct's size. How did that translate through your leaky abstraction?

For the record, I think you're right about generating the wrappers dynamically. You're just using the wrong tools to do it and, even with the right tools, you still risk getting things lost in translation.

Now take a closer look at my mostly-C wrapper for libbacktrace. Nothing gets lost under the cracks of a mechanical translation. Nim users don't have to know anything about which pointers need to be manually freed and which don't, that the backtrace_state struct should be initialised just once per thread, etc.

It's obviously harder to write and maintain C code, but it's the devil we know and it's the best way to interface with a C API. It's also the most supported way, because that's what upstream uses to test the library. If a C library tries to hang on to a Nim-allocated string that gets garbage collected, you can't go complaining upstream. If something goes wrong with your automated code translation tool, you get to debug it yourself. Same for Nim's FFI. When is the last time you had to debug a C compiler?

That's why I think that a manual C wrapper is the most robust solution we have right now. Might also be the most maintainable in the long run.

genotrance commented 4 years ago

I can only use pointers to that opaque struct

That's exactly how libbacktrace expects you to use that opaque structure. Same thing in libarchive. Do you have another use case?

You're just using the wrong tools to do it and, even with the right tools, you still risk getting things lost in translation.

I'm receptive to the limitations as well as the perils that C brings with it. However, these are just general statements that don't add any value. I don't expect you to write a dissertation for my benefit but at least share some real examples of what is missing or wrong.

No doubt C is the best way to handle a C library, translation only goes so far. But I am not convinced that we need 100% perfection for something useful.

stefantalpalaru commented 4 years ago

That's exactly how libbacktrace expects you to use that opaque structure.

Then why are you defining backtrace_state as a generic object that can be freely used in Nim code?

these are just general statements

I thought that quoting the code where you present an undeclared struct as a Nim object was clear enough. Those are all specific statements. You lost information about backtrace_state. What did you thought I was talking about, all this time?

at least share some real examples of what is missing or wrong

That's what I've been doing from the start.

I am not convinced that we need 100% perfection for something useful

Ugly hacks are useful. "Write once, maintain never" code is useful. Fragile code that was only meant to be temporary is useful. This is not about minimal viability. This is about solid foundations upon which we can build our projects.

stefantalpalaru commented 4 years ago

"The Unix C library API can only be reliably used from C"

genotrance commented 4 years ago

This is a nice example, thanks for sharing. It is standard advice since bypassing a user API and accessing the underlying implementation is a bad idea in any abstraction.

Considering the following C:

extern int *__errno_location (void) __THROW __attribute_const__;

# define errno (*__errno_location ())

Even in C, you can directly call __errno_location(). It isn't recommended but not prevented either. Same story in Nim - you need to know what you are doing.

Traditionally, we would access errno this way:

var errno {.importc: "errno", header: "<errno.h>".}: cint

This is safe and not accessing implementation details that could change but there's no easy way to automate that, it would have to be manually wrapped.

If I instead add code in nimterop to generate templates for #define macros, we end up with:

proc errno_location(): ptr cint {.importc: "__errno_location", header: "<errno.h>".}

template errno(): untyped =
  errno_location()[];

No doubt the Nim side will contain implementation details but as I mentioned in my first comment, nimterop generates wrappers at build time. This means that if the upstream implementation does change, so does the wrapper. In some sense, even though we have the #define implementation converted to Nim, it stays in sync with upstream so changes propagate.

I'm curious about your concerns with the above. Thanks for engaging.

stefantalpalaru commented 4 years ago

var errno {.importc: "errno", header: "<errno.h>".}: cint

This reminds me of that time I named a Nim proc param "errno" and it caused a hard-to-debug segfault on Windows: https://github.com/status-im/nim-nat-traversal/commit/884511f59a5498bdd60378e3ab28e33c88bea202

Corresponding Nim bug report: https://github.com/nim-lang/Nim/issues/11153

The root cause was Nim trying very hard to shield the programmer from the underlying C API and being doomed to fail in some corner cases. This failure is unavoidable when you can't translate perfectly all the input domain into the output one.

My solution is to drastically reduce the input domain by writing most of the wrapper in C and exposing it through a much simpler API. Your solution is to partially replicate a C compiler's parsing rules using a syntax highlighting engine. Which one will fail more often?

mratsim commented 2 years ago

Now that we have been in production for over a year, we're pretty much set and mature in terms of our C libraries needs. I don't think there is tech debt to resolve for the libraries we need, maintenance cost for bumping them has been minimal and so pursuing this is likely not a priority compared to polishing user-facing features.

Closing.