modularml / mojo

The Mojo Programming Language
https://docs.modular.com/mojo/manual/
Other
23.3k stars 2.59k forks source link

[Feature Request] [stdlib] A Foreign Function Interfacing exclusive package (`ffi`) #3520

Open martinvuyk opened 1 month ago

martinvuyk commented 1 month ago

Review Mojo's priorities

What is your request?

Reorganize and polish sys/ffi.mojo and take it out to be it's own package with it's own set of ease of use capabilities for using Mojo as the glue between languages.

For all major languages for which support is judged as worth being added, adding basic type conversions etc. In the case of C, also adding the basic POSIX functions (libc).

An excerpt from an example implementation:

# Adapted from https://github.com/crisadamo/mojo-Libc which doesn't currently
# (2024-07-22) have a licence, so I'll assume MIT licence.
# Huge thanks for the work done.

# /ffi/c/types.mojo

struct C:
    """C types. This assumes that the platform is 32 or 64 bit, and char is
    always 8 bit (POSIX standard).
    """

    alias char = Int8
    """Type: `char`. The signedness of `char` is platform specific. Most
    systems, including x86 GNU/Linux and Windows, use `signed char`, but those
    based on PowerPC and ARM processors typically use `unsigned char`."""
    alias s_char = Int8
    """Type: `signed char`."""
    alias u_char = UInt8
    """Type: `unsigned char`."""
    alias short = Int16
    """Type: `short`."""
    alias u_short = UInt16
    """Type: `unsigned short`."""
    alias int = Int32
    """Type: `int`."""
    alias u_int = UInt32
    """Type: `unsigned int`."""
    alias long = Int64
    """Type: `long`."""
    alias u_long = UInt64
    """Type: `unsigned long`."""
    alias long_long = Int64
    """Type: `long long`."""
    alias u_long_long = UInt64
    """Type: `unsigned long long`."""
    alias float = Float32
    """Type: `float`."""
    alias double = Float64
    """Type: `double`."""
    alias void = Int8
    """Type: `void`."""
    alias ptr_addr = Int
    """Type: A Pointer Address."""

alias NULL = UnsafePointer[C.void]()
"""Null pointer."""

# ===----------------------------------------------------------------------=== #
# Utils
# ===----------------------------------------------------------------------=== #
fn char_ptr_to_string(s: UnsafePointer[C.char]) -> String:
    ...
fn strlen(s: UnsafePointer[C.char]) -> C.u_int:
    ...
...

# /ffi/c/networking.mojo

fn socket(domain: C.int, type: C.int, protocol: C.int) -> C.int:
    """Libc POSIX `socket` function.

    Args:
        domain: Address Family see AF_ alises.
        type: Socket Type see SOCK_ alises.
        protocol: Protocol see IPPROTO_ alises.

    Returns:
        A filedescriptor for the socket.

    Notes:
        [Reference](https://man7.org/linux/man-pages/man3/socket.3p.html).
        Fn signature: `int socket(int domain, int type, int protocol)`.
    """
    return external_call["socket", C.int, C.int, C.int, C.int](
        domain, type, protocol
    )

# /ffi/c/logging.mojo
# TODO
fn errno() -> Int:
    """Get the `errno` global variable.

    Returns:
        The current value of the variable.
    """
    return 0

fn strerror(errnum: Int) -> UnsafePointer[C.char]:
    """Libc POSIX `strerror` function.

    Args:
        errnum: The number of the error.

    Returns:
        A Pointer to the error message.

    Notes:
        [Reference](https://man7.org/linux/man-pages/man3/strerror.3.html).
        Fn signature: `char *strerror(int errnum)`.
    """
    return external_call["strerror", UnsafePointer[C.char], Int](errnum)

What is your motivation for this change?

Mojo might be the best language for heterogeneous compute, but most infrastructure projects are written in the JVM family of languages and C/C++. If we can offer a set of tools for easy interop language adoption will be faster, we only need to look at the case of Zig where it's cross-compilation capabilities are arguably one of it's biggest strengths and a common entrypoint for it's adoption. And also a bit needless to point out, but flexibility to be the code logic layer is the biggest use case for Python.

Any other details?

In my particular case I'd love to see Mojo kernels being able to be used in Data Bases and finally have a plug and play way to use GPUs for analytical workloads with engines/query planners like Spark (written in Scala) without falling into the fanatical mentality of rewriting everything in X language because of Y reasons...

Disclaimer: ABI compatibility/stability guarantees, dynamic and static linking, and many other things about language interop are way beyond my area of knowledge. That is why I'm posting this as a Feature Request and not a proposal since I'm not even sure of how exactly the end result would look like as I'm not a person who'd use this daily. I'm currently trying to implement a socket package, that's why I've found this need.

JoeLoser commented 1 month ago

We talked about this recently at our weekly team design meeting, and we're +1 for it. I actually started work on this internally but had to set it down for some other urgent things. I'll probably get back to this soon in the coming days and wrap it up. FYI we also started on homing some libc things in sys/_libc.mojo; we can gradually move things over that use externall_call for libc functions to clean those up in the stdlib.

martinvuyk commented 1 month ago

That is great to know! I'd love it if we could organize it a bit like I did over here since I've been polishing and working on it for quite some time and I think it's a bit better organized than just one huge file. I've updated all signatures to latest nightly and fixed all docstrings to follow stdlib standards, and also added a lot of constants and renamed values to what is standard. Just let me know and we can progressively open PRs, though I'm also not sure how we'd unit test these especially without having access to the errno global variable which is my problem currently :( . And also, every time an error occurs while doing socket stuff Mojo crashes horribly (uncaught exception even when wrapping with try catch because of some weird bug), so that is yet another challenge for testing.

martinvuyk commented 1 month ago

@JoeLoser FYI, finally got these to work :tada: . Python has some weird handling of this with their use_errno parameter, I think just executing this function to get the current value is better than setting up a variable that needs to be updated.

fn get_errno() -> C.int:
    """Get a copy of the current value of the `errno` global variable for the
    current thread.

    Returns:
        A copy of the current value of `errno` for the current thread.
    """

    @parameter
    if os_is_windows():
        var errno = stack_allocation[1, C.int]()
        _ = external_call["_get_errno", C.void, UnsafePointer[C.int]](errno)
        return errno[]
    else:
        return external_call["__errno_location", UnsafePointer[C.int]]()[]

fn set_errno(errnum: C.int):
    """Set the `errno` global variable for the current thread.

    Args:
        errnum: The value to set `errno` to.
    """

    @parameter
    if os_is_windows():
        _ = external_call["_set_errno", C.int, C.int](errnum)
    else:
        external_call["__errno_location", UnsafePointer[C.int]]()[0] = errnum
owenhilyard commented 3 weeks ago

Is there a way we can make alias char = Int8 respect the system ABI? We also probably want to do the same for size_t, ssize_t, ptrdiff_t, and intptr_t.

martinvuyk commented 3 weeks ago

if you mean adjusting the datatypes to the platform then yes it is already being done in sys.ffi, but I've extended it to the other types here. You basically just do this:

fn _c_long_dtype() -> DType:
    # https://en.wikipedia.org/wiki/64-bit_computing#64-bit_data_models

    @parameter
    if is_64bit() and os_is_windows():
        return DType.int32  # LLP64
    elif is_64bit():
        return DType.int64  # LP64
    else:
        return DType.int32  # ILP32
soraros commented 3 weeks ago

Is there a way we can make alias char = Int8 respect the system ABI?

Could you clarify what do you mean by respect the system ABI more concretely? As far as I know, at least size_t and ssize_t are defined "properly".