Load struct definitions from C header.

jemc commented 7 years ago

We discussed on a recent sync call the idea of being able to load struct definitions from a C header, so that Pony code could potentially depend on platform dependent struct definitions.

This came up in discussion of https://github.com/ponylang/ponyc/issues/1513, in which an openssl dependency was added to the pony runtime in order to put ponyint functions that use the SSL_CTX type there. We discussed that it would be better if we could avoid that by writing those accessors in Pony. But to do that, we'd need Pony to be able to load the struct defs from a header when compiling.

This would probably require a libclang dependency for ponyc to be able to read C header files.

This idea needs more discussion to flesh out the details and any feasibility issues.

jemc commented 7 years ago

We discussed this on the sync call.

The best approach we could think of is to declare a normal struct with normal fields, but use AST annotations (#64) to link the Pony struct to a specific C struct, and to link each Pony field to a specific C field.

The struct would be treated as normal throughout the compiler and the type system, and the annotations would only matter when reaching the LLVM code generation pass, which would use libclang to read the appropriate C header and influence the memory layout of the Pony struct to match the C struct.

That's a great start to an RFC, we just need someone to draw up a detailed plan that hashes out the details.

jemc commented 7 years ago

There was some discussion in https://github.com/ponylang/ponyc/issues/1552 about the prospect of fully-automating the process of FFI-wrapping C libraries.

Specifically, @agarman said:

Best possible FFI solution would be direct interop with C header files. There's a lot of boilerplate code written to wrap a C library. It's work that can and should be automated as doing it manually is error prone.

If I correctly understand what's being proposed, than I strongly disagree with this sentiment. We discussed these ideas in detail on the sync call from this date (you can listen to the audio here: https://pony.groups.io/g/dev/files/Pony%20Sync/January%2018,%202017), but I will try to summarize some of my main points below to faciliate easier discussion.

For background, I've wrapped a lot of different FFI libraries, in several different languages (Python, Ruby, Pony), and I've even led a team that successfully created an open-source solution that does fully-automate the FFI-wrapping process (see https://github.com/zeromq/zproject), and through those experiences, I have come to believe in the following claim:

In general (that is, for the set of useful C libraries in the world that people might want to use via FFI), from the code found in C headers alone, it is not possible to derive a correct and useful FFI wrapper in an object-oriented language. There is simply not enough information in a C header to tell us everything we need to know about the functions in the library, to map to the concepts that are required for correct and useful usage of the library. Much of this information is in documentation, or implicit convention, (or for a bad library which uses neither, left up to guessing), neither of which can be parsed by an automated process. Problems compound when the host language uses garbage collection, or has features that other languages don't normally express (like reference capabilities in Pony).

Examples of information that is missing from the header:

Information about object ownership / memory management
- is the return value from this function a "borrowed" reference, or is "ownership" transferred to the caller, so that the caller is responsible for de-allocating/destroying/decrementing the value when done?
- is the argument to this function a "borrowed" reference (that will still be owned by the caller after the function is done), or is "ownership" transferred to the library, so that the library is responsible for destroying the object at some later time? When/how should the object be destroyed?
- if an object needs to be de-allocated/destroyed/decremented by the FFI wrapper, how should this be done? Should the caller use free? Is there a special pool allocator free function associated with this library? Is there a special destroy or decrement function in this library associated with this particular type of object?
- are references linked to eachother in such a way that de-allocating/destroying/decrementing a "root" object will invalidate all objects that were/contained pointers to memory within it? If so, what is the nature of this link and how does the caller need to handle or be aware of it?
Information about semantics
- When I see an argument of type my_struct_t*, is it meant to accept the address of a local struct value to be filled (like "another return value"), or is it meant to accept an already-filled struct to do something with (like a "true argument")? Something else?
- When I see an argument of type my_struct_t**, is it meant to accept the address of a pointer to a local struct, for the purposes of making it point to another struct pointer? Or maybe to set it to NULL, so that the reference is invalidated (the zproject libraries use this style)? Or maybe it's meant to accept a list of pointers to structs? If a list, how is it terminated, by a size argument, or does it need a NULL item added to the end, or maybe it is a fixed-size list?
- How are errors denoted? A non zero return value? A zero return value? A NULL return value to a function that was supposed to return a struct pointer? Setting the error value is some global "context" object? Errno? Calling pony_throw?
Information about immutability/mutability
- Am I allowed to mutate this return value? Are other functions in the library allowed to mutate it? Are other threads in the library allowed to mutate in the background?
- If I pass a value as to the library, will the library mutate it? Am I allowed to mutate it? Am I allowed to pass it to another actor to mutate it?
- These questions are especially important in Pony, where these concepts are first-class language feature (ref caps) that C doesn't even know how to express.

As an FFI-wrapper developer, you can come up with answers to these questions for the library/objects that you're working with. It's not always an easy process, but you're the only one who can find the answers.

I agree there is a lot of boilerplate in FFI-wrapping, but this boilerplate is not lifeless or universal - the type of boilerplate you use is encoding the answers to these questions. You can't make a universal boilerplate because the answers are not universal for the general case - you can only do it by holding to specific assumptions about the type of library being wrapped, which in turn limit which libraries your boilerplate works with.

If you're wondering about my earlier mention of a project where we successfully automated the entire process of wrapping a C library, this is exactly what we did - we made assumptions that all libraries being wrapped would use the CZMQ "CLASS" style, and moreover, we required that all libraries provide an XML API description, which we used to generate the FFI wrappers, and the C header itself. The XML API description was designed to contain answers to all the questions we needed to know about within the confines of the CZMQ "CLASS" style of library, and with that information we finally had all we needed to automate the FFI-wrapping process.

from C to an object-oriented language simply doesn't work, especially for a garbage collected one, unless you constrain to only handle - the C header does .

jemc commented 7 years ago

In light of that diagnosis, I suggest that we stick to methods of assisting with FFI-wrapping. "Partial automation" instead of "full automation" if you will.

An ideal FFI-wrapping system would give you ways to answer those questions about all objects/functions involved, and do the rest of the work for you. We can probably get partway to that ideal, at least. But we need to keep the complexity and profound differences among the set of all C useful libraries we want todwrap - they are not all designed in the same way - not by a long shot.

agarman commented 7 years ago

@jemc by direct interop with C headers, I meant only that structs wouldn't have to be re-typed in pony & that pony would provide compiler checks for unambiguously incorrect @ invocations.

Any details & semantics associated with usage of a C library are left to the user of that library to get correct.

ponylang / rfcs

Load struct definitions from C header. #75