tildeio / helix

Native Ruby extensions without fear
https://usehelix.com
ISC License
1.98k stars 60 forks source link

Try to avoid copying when using strings #8

Open flo-l opened 8 years ago

flo-l commented 8 years ago

I'm leaving this here more as a note to myself and also to maybe get some early feedback.

Currently helix copies a ruby strings contents in order to use it in rust. I want to change that.

There is a pair of functions, namely rb_gc_register_address(VALUE *addr) and rb_gc_unregister_address(VALUE *addr), which can be used to prevent the GC from freeing some memory and allow it to be freed again later.

Maybe it could be used to improve perf considerably. I will open a PR as soon as I've something worth reviewing. Also this project is awesome.

chancancode commented 8 years ago

I think to keep in mind is safety. Ruby doesn't have a way to track ownership natively, if we took a Ruby string and didn't copy it, and we keep a borrow to the string while calling back into Ruby code, its content could be mutated, causing safety issues on the Rust side. One possible solution is to freeze the string while there is outstanding (readonly) borrow to the Ruby string and then "unfreeze" it when we are done.

However, Ruby does not actually expose the functionality to "unfreeze" a frozen object. Of course, it's implemented as a marker bit under-the-hood, and since we are operating at the C level, we can just "unflip" the bit. But I'm not sure if we would feel comfortable messing with the internals like that, given Ruby might perform other optimizations and who knows what other semantics there might be in the future.

Another solution is to perhaps implement that as a global "call back into the VM" lock.

flo-l commented 8 years ago

That is indeed a good point I have not considered.

I don't know how the "call back into the VM" lock is best implemented. An option could be to wrap internal data of Ruby Objects (like the data pointer of String) in a custom type, which has a lifetime tied to the VALUE it belongs to, which in term has a lifetime tied to some "VM token", that needs to be moved into each call into the Ruby VM. The construction of these VM tokens should be unsafe, so that users don't cheat the system by creating a new one for each call. So each call into Ruby would need a VM token and return a VM token, where the latter can be used to make the next call into Ruby. That would ensure that programmers can't use data obtained from a previous Ruby VM call. If we make the VM token a zero-sized type this wouldn't add any runtime overhead, just code verbosity.

Or maybe accessing a Ruby Object should only be possible via either an unsafe method that does not copy, or a safe method that does. The docs could describe the invariants one has to uphold for the code to be safe if one uses the unsafe fn.

The two proposals could also be comined. So people either have to copy all Ruby data or avoid the copies and have to deal with VM tokens.

For me it seems the most common use case is implementing one Ruby method in pure Rust, with the goal of improving performance. So calling into Ruby from Rust seems orthogonal, as Ruby tends to be slow. But of course there are always reasons to do it anyway...

However, before attempting to tackle the problem I'm waiting for the repo owners to merge the master with the original branch, see #9. They have diverged massively, with original beeing way ahead in terms of features.

wagenet commented 7 years ago

@flo-l main development is back on master now. Is this something you're still interested in?

flo-l commented 7 years ago

I'd still be interested!

What do you think of the VM token idea I outlined above?

wagenet commented 7 years ago

@chancancode @wycats ^

flo-l commented 7 years ago

@chancancode @wycats ping :)

wycats commented 7 years ago

Direction

Right now, Helix classes have an extra field in their struct that points back at the Ruby object. The Ruby object is created when the object crosses into Ruby, which makes it possible to cheaply create Helix classes in Rust without allocating a Ruby object (useful for creating intermediate objects for internal computation).

That field is a helix::Metadata, which today is just a simple alias to VALUE.

Ultimately, I think we should enrich that field to include ownership information. Straw man:

enum Ownership {
    // The struct is owned by Rust and is not wrapped in any Ruby object.
    // This is the starting state for a new Helix class, and can also be used
    // to model Helix methods that take Helix objects by value.
    Rust,

    // When a Helix object crosses into a Helix method that takes it using `&`,
    // its state is changed to Borrowed.
    Shared,

    // When a Helix object crosses into a Helix method that takes it using `&mut`,
    // its state is changed to Unique.
    Unique,

    // Once a Helix object crosses into Rust, its ownership state is "Ruby" until
    // it has crossed back into Rust.
    Ruby
}

// Note that if the state of a Helix object is already Unique, it cannot be passed into
// another Helix method. If the state of a Helix object is already Shared, it cannot
// be passed into another Helix method that takes it via `&mut`.

struct Metadata {
    // `value` is None until it crosses into Ruby for the first time
    value: Option<sys::VALUE>,
    ownership: Ownership
}

What this means is that when a Helix object crosses into Rust, we will either discover that the kind of ownership that the method requests is impossible or flip its ownership.

This is similar to the dynamic approach used by RefCell.

Note that this doesn't address being allowed to take &str from a Ruby String, since we can't put a Rust struct into an existing Ruby string. We could support taking a &str from a frozen string, and I think we should see whether that's sufficient for zero-copy use cases.

The original post here was also correct that we need to use rb_gc_register_address if we ever take ownership of a Ruby object and put it into a heap location (because the conservative GC will fail to mark it :scream:). I think we should have a RootedValue struct that is implemented thusly:

struct RootedValue {
    inner: sys::VALUE
}

impl RootedValue {
    fn new(inner: sys::VALUE) -> RootedValue {
        unsafe { sys::rb_gc_register_address(*mut value) };
        RootedValue { inner }
    }
}

impl Drop for RootedValue {
    fn drop(&mut self) {
        unsafe { sys::rb_gc_unregister_address(*mut self.inner) };
    }
}

This will allow us to easily root Ruby objects that we have ownership of and want to temporarily move onto the heap. We probably want to automatically root any values passed into Rust by-value (if they get Rust ownership) to avoid safety footguns here.