maciejhirsz / logos

Create ridiculously fast Lexers
https://logos.maciej.codes
Apache License 2.0
2.83k stars 113 forks source link

C export or C codegen? #260

Open Andersama opened 2 years ago

Andersama commented 2 years ago

My projects are normally in c++ or c, not familiar enough with rust, but from what I've read it's apparently possible to export functions that can be called with c. Looking at the codegen cli's output:

impl<'s> ::logos::Logos<'s> for Token {
    type Extras = State;
    type Source = str;
    const ERROR: Self = Token::Unknown;
    fn lex(lex: &mut ::logos::Lexer<'s, Self>) {
        use logos::internal::{CallbackResult, LexerInternal};
        type Lexer<'s> = ::logos::Lexer<'s, Token>;
        fn _end<'s>(lex: &mut Lexer<'s>) {
            lex.end()
        }
        //etc
    }
}

I'm not familiar enough with rust's syntax to know what exactly is happening, but I suspect it can be converted to c, or a wrapper around fn lex could be exported.

#[no_mangle]
pub extern "C" fn lex_string(lex_context: ?, str: *mut u8, sz: usize) {
    //call logo's lex
}
derekdreery commented 1 year ago

I'm replying late because I think it is an interesting topic. Perhaps it's still useful? :smile_cat:

The following requires that you hard-code a particular token type. It builds, but there may be safety errors, so please audit before you use!

//! The following assumes you have a token type called `token_t` that you impl `logos::Token` for and that is repr(C).
use std::{
    ffi::{c_int, c_void},
    slice, str,
};

// mock up the logos stuff we need
#[allow(non_camel_case_types)]
#[repr(C)]
pub struct token_t;
struct Lexer<T>(std::marker::PhantomData<T>);
impl<T> Lexer<T> {
    fn new(#[allow(unused_variables)] input: &str) -> Self {
        Lexer(std::marker::PhantomData)
    }
}
impl<T> Iterator for Lexer<T> {
    type Item = T;
    fn next(&mut self) -> Option<Self::Item> {
        todo!()
    }
}

/// Create a lexer from an input string
///
/// # Safety
///   - The caller is responsible for keeping `input` alive and constant until `my_lexer_free(*mut my_lexer_t)` is called.
///   - `input` must be a valid utf-8 byte sequence with size `input_sz`.
///   - the returned type is not thread-safe
#[no_mangle]
pub unsafe extern "C" fn my_lexer_create(input: *const u8, input_sz: usize) -> *mut c_void {
    let input: &'static [u8] = slice::from_raw_parts(input, input_sz);
    let input = str::from_utf8_unchecked(input);
    let lexer: Lexer<token_t> = Lexer::new(input);
    Box::into_raw(Box::new(lexer)) as *mut _
}

#[no_mangle]
pub unsafe extern "C" fn my_lexer_free(lexer: *mut c_void) {
    let lexer: Box<Lexer<token_t>> = Box::from_raw(lexer as *mut _);
    // This would happen anyway at the end of the scope
    drop(lexer)
}

/// Returns true if there was another element.
///
/// There are other ways you could represent `Option<token_t>` in C if you prefer.
#[no_mangle]
pub unsafe extern "C" fn my_lexer_next(lexer: *mut c_void, token: *mut token_t) -> c_int {
    let lexer: &'static mut Lexer<token_t> = &mut *(lexer as *mut _);
    match lexer.next() {
        Some(t) => {
            *token = t;
            1
        }
        None => 0,
    }
}

Link to playground

Andersama commented 1 year ago

Been a while, I might revisit playing around with this library again if I can make sense of the rust, just because it's so nice. I think I tried a rough version of the language I was going to parse and found I was doing better with a handwritten lexer. I can't remember. But in any case, just for simple things I'd definitely love to use this over writing something by hand.

I take it that:

#[repr(C)]
pub struct token_t;

the #[repr(C)] forces a struct layout like we'd expect in C so that we can use it later in the wrappers?

derekdreery commented 1 year ago

So repr(C) means "lay this out how you would lay the equivalent struct out in C, for example the fields in

#[repr(C)]
pub struct MyType {
    id: u32,
    rest: *mut u8,
}

and

struct my_type_t {
    uint32_t id;
    char *rest;
}

will have the same alignment, position, with the same padding between them... you can reinterpret one as the other. If you have

pub extern "C" fn foo() -> MyType { .. }

in Rust you can

struct my_type_t foo(void);
/* .. */
struct my_type_t bar = foo();
derekdreery commented 1 year ago

You can use cbindgen to generate the header files for you.

Andersama commented 1 year ago

I would've just written everything by hand, thanks for the tip.