rust-lang / rfcs

RFCs for changes to Rust
https://rust-lang.github.io/rfcs/
Apache License 2.0
5.9k stars 1.56k forks source link

RFC: RArrow Dereference for Pointer Ergonomics #3577

Open EEliisaa opened 7 months ago

EEliisaa commented 7 months ago

This RFC improves ergonomics for pointers in unsafe Rust. It adds the RArrow token as a single-dereference member access operator. x->field desugars to (*x).field, and x->method() desugars to (*x).method().

Before:

(*(*(*pointer.add(5)).some_field).method_returning_pointer()).other_method()

After:

pointer.add(5)->some_field->method_returning_pointer()->other_method()

Rendered

ChayimFriedman2 commented 7 months ago

An alternative is to make a postfix dereference operator (v.*.field or something alike).

EEliisaa commented 7 months ago

An alternative is to make a postfix dereference operator (v.*.field or something alike).

This is a good alternative. With .* even more expressions become ergonomic than with RArrow. In particular long ones that end with a dereference without a field or method access. As far as I can see there are no grammar ambiguities that prevent this either. Most important is that some kind of non-prefix dereference operator exists. React with either:

  1. ❤️ - If you want .*
  2. 🎉 - If you want ->
  3. Both ❤️ and 🎉 if you want both.
Lokathor commented 7 months ago

Clarification Question: Are you suggesting p.* = 3; for postfix "whole thing" assignment and p.*.field for postfix field accessing?

EEliisaa commented 7 months ago

Clarification Question: Are you suggesting p.* = 3; for postfix "whole thing" assignment and p.*.field for postfix field accessing?

Yes, p.* would be equivalent to (*p). Meaning:

p.* = 3; is equivalent to (*p) = 3; p.*.field is equivalent to (*p).field

Lokathor commented 7 months ago

Then maybe p.*field (one less .) for field accessing, which is what I typoed while posting my question above. Easy to miss the little dot with all the rest of the punctuation.

EEliisaa commented 7 months ago

Then maybe p.*field (one less .) for field accessing, which is what I typoed while posting my question above. Easy to miss the little dot with all the rest of the punctuation.

With one less ., there would be an ambiguity between postfix .* and infix .*. The last dot is not a part of the operator.

Lokathor commented 7 months ago

When would you have a place followed by another place in an expression or statement?

EEliisaa commented 7 months ago

When would you have a place followed by another place in an expression or statement?

Never, it would not be a grammar ambiguity. It would be a readability ambiguity. It could also be confused with a.*b in C++.

Lokathor commented 7 months ago

Well, now we're into the realm of opinions I suppose.

p.*.field.*.field just seems like kinda too much.

Also I don't know C++ so I've got no idea what that a.*b thing would do in C++.

EEliisaa commented 7 months ago

Well, now we're into the realm of opinions I suppose.

p.*.field.*.field just seems like kinda too much.

Also I don't know C++ so I've got no idea what that a.*b thing would do in C++.

The RArrow operator doesn't have this problem. Perhaps this is a reason to have both postfix .* and infix ->?

kennytm commented 7 months ago

@Lokathor If you think there are too many dots I'd prefer a*.b rather than a.*b, the latter looks like it's dereferencing b.

(indeed as C++ is mentioned, a.*b is the pointer-to-member access operator where b is a pointer-to-member variable.)


Also, while I think it's not a concern in practice, the following is valid Rust today:

fn main() {
    dbg!(5.*-6.0);
}
Lokathor commented 7 months ago

That wouldn't actually break, since 5 isn't an identifier

kennytm commented 7 months ago

in a.* the expression a doesn't need to be an identifier, a[i].*, a(x,y).*, a.5.* are all valid.

it certainly can't break in practice, just that I think the parser needs more special rules to distinguish a.5.* from 5.*, not really a big deal.

clarfonthey commented 7 months ago

p.*.field.*.field just seems like kinda too much.

I thought about this a lot earlier today and honestly, I disagree. For a while I was almost very convinced of this as a reason why we should adopt the -> operator, but ultimately, the issue is actually not that -> would be nice, but that it's not enough.

If you have a long expression you want to dereference, it seems counter-intuitive that the dereferencing happens from left-to-right except the last one, which is placed at the very beginning. So, you'd probably want to add a .* at the very end to accomplish that. Alternatively, you could end with a postfix arrow, which just feels wrong.

Ultimately, .*. isn't that bad of an operator; you can type it pretty easily by doing periods with your right hand and shift+8 with your left hand, or by using a numpad. It looks a bit weird, but it makes the dereferencing abundantly clear in the middle of the expression (which is where the unsafety happens), whereas with arrows your brain kind of tends to gloss over them. (At least, mine does.)

So, I'm more in favour of postfix dereference than right-arrows, but I do think that it's important to explore why. It makes a lot of sense why C had them and still does, but I don't think that Rust should, especially with its focus on memory safety, since we want the dereferences to stick out in the middle of the code as places where bad things can happen.

Lokathor commented 7 months ago

If you have a long expression you want to dereference, it seems counter-intuitive that the dereferencing happens from left-to-right except the last one, which is placed at the very beginning.

If a->b is (*a).b then it's already in "dereferenced form".

clarfonthey commented 7 months ago

If a->b is (*a).b then it's already in "dereferenced form".

The point here is that (*a).b is possible with arrows, but not *(*a).b. In other words, you can doa.*.b.*.c.*.d.* but only *a->b->c->d.

kennytm commented 7 months ago

For the original RArrow proposal, are these supported or not?

let a: *const [u8; 256];
(*a)[3];
// a.*[3];
// a->[3]; //?

let f: *const fn(u32) -> u32;
(*f)(5);
// f.*(5);
// f->(5); // ?

let o: *const Option<NonNull<u64>>;
(*o)?.as_ref().checked_add(7)?;
// o.*?.as_ref().checked_add(7)?;
// o->?->checked_add(7)?;
CraftSpider commented 7 months ago

I really like the idea of postfix dereference via .*, especially with the examples given by @kennytm - while trailing arrows could be allowed, at least to me the postfix star syntax feels cleaner, especially as the last operator in a sequence. a-> = 1 feels very odd, while a.* = 1 looks better. I'll also note I'm not generally in favor or adding new operators or syntax without good reason, but .* feels much more like just allowing an existing operator in a new way (think postfix match or similar - it's really just allowing *a to be written postfix)

VitWW commented 7 months ago

If we like .* but we have some ambiguity in use, we could use "mix" ->* as alternative, like p->*.field->*.field. And since ->* is longer than *, the * would be used in most cases.

let a: *const [u8; 256];
(*a)[3];
// a->*[3];

let f: *const fn(u32) -> u32;
(*f)(5);
// f->*(5);
RalfJung commented 7 months ago

FWIW there is a parallel thread on IRLO.

@EEliisaa it's not a good idea to open two threads about the same thing at the same time. Then discussion will be split among the two places and the same arguments will have to be repeated everywhere.

steffahn commented 7 months ago

The proposal makes unsafe code, that which is the most safety-critical code, easier to read, understand, and maintain.

It also prevents preferring references over raw pointers. This prevents common mistakes that create UB by simultaneous mutable references.

As discussed in the article Rust's Unsafe Pointer Types Need An Overhaul, the Tilde token could be used for walking field pointers of different types without changing the level of indirection. The proposed arrow operator is different. The arrow dereferences and yields a place expression. This is important because it is the only way to completely eliminate excess parentheses.

This seems like a weird argument / section to me. The argument being laid out is to make “unsafe code easier to read, understand, and maintain” with a focus of preventing “common mistakes that create UB by simultaneous mutable references”.

Then it quotes Gankra’s tilde token alternative, a proposal for “walking field pointers of different types without changing the level of indirection”. I would probably describe this tilde operator more as something that prevents not just “changing the level of indirection” (i.e. reading from the pointer) but also implicitly crating references. I know that “doesn’t change levels of indirection” is a quote from the article, but so is:

  • You never have to worry about accidentally tripping over autoderef or any other thing that is nice for safe code but a huge hazard for unsafe code.

So it prevents hazards[^1], apparently. Hazards from autoderef and other things, things that are language features which prefer references over raw pointers, implicitly created references even, which IMO can make the code hard to understand and maintain, too.

Yet you go on and dismiss the tilde operator not based any of the goals you listed before mentioning it, but only in order to further “eliminate excess parentheses”.

I believe that introducing -> operator should only be considered, if it’s already the case that (*pointer).field and (*pointer).method() expressions were easy to understand, explicit in their behavior, aiding as much as possible in allowing users to avoid UB from introducing unwanted references, i.e. perfect in all regards except the additional parenthesis.

If that's the case, I’m open to arguments as to why, otherwise – if the (*pointer).field/(*pointer).method() isn’t optimal – I think the current verbosity leaves a great opportunity to (at least try to) come up with something better that differs not just in syntax but also in behavior and/or associated restrictions, improving ease of understanding and reducing hazards. As Gankra’s article also called out:

By getting rid of the (*ptr). “syntactic salt”, programmers are motivated to move to the nicer and more robust new syntax. Yes I really think this syntax is nice! It’s certainly better than -> in C!

[^1]: To demonstrate these hazards in principle:

If you have `p->field` in `C`, that gives you access to the field `field`, and nothing more. If you write `p->field` in the proposed Rust extension (or `(*p).field` currently), there can be happening a lot more already. That is, `p`’s target type could implement `Deref`/`DerefMut`. Suddenly, you are implicitly creating a reference that accesses the whole of `*p`, not just the field `field`, pass that reference to a `deref`[`_mut`] function, and dereference the result. I wouldn’t call this “feature parity”. (On that note, comparing to `C` should also note that `C` doesn’t have method chains, so the long method chain above the statement “This is identical to C and C++” isn’t the best example IMO.)

Comparing to `C++`, while `->` is overloadable, as far as I understand that axis is only relevant for custom pointer types. For normal pointers `p->field` should be as predictable as in `C`.

And looking at method calls, `p->method()` in Rust could be calling a method that _semantically_ very clearly only accesses part of the value `p` points to. (For example, `p->get(i)` with `p: *const [u8]`.) However, if the method is taking a reference to the whole of `*p`, that has semantic meaning, and can work to create UB by invalidating references to other parts of `*p`. A method like `get` is designed for “ordinary references land” where you could never _have_ a pointer to the whole thing without being allowed to access the whole thing.

On the other hand, in `C++`, methods operate on raw `this` pointers, so something like a `p->get(i)` method would generally *not* come with any UB hazards from interactions with access to elements *different* from the one at index `i`. I would be cautions with using the term of feature “parity” with C++ here, when the syntax only _looks_ the same but is actually more hazardous than C++.
shepmaster commented 7 months ago

My naïve expectation is that a macro would get a lot of the way there: *pm!(o->middle->inner->a) = 1;. Since this hasn't been mentioned in the RFC or comments yet, there must be something that I'm missing. It'd be good to explicitly call out what the limitations of such an approach would be in the RFC itself.

Full example ```rust use core::mem::MaybeUninit; macro_rules! pm { ($h:ident $(-> $t:ident)*) => { ::core::ptr::addr_of_mut!((*$h) $(. $t)*) }; } #[derive(Debug)] #[repr(C)] struct Outer { middle: Middle, } #[derive(Debug)] #[repr(C)] struct Middle { inner: Inner, } #[derive(Debug)] #[repr(C)] struct Inner { a: u8, b: u8, } fn usage(mut outer: MaybeUninit) -> Outer { unsafe { let o = outer.as_mut_ptr(); *pm!(o->middle->inner->a) = 1; *pm!(o->middle->inner->b) = 2; outer.assume_init() } } fn main() { let o = MaybeUninit::uninit(); let o = usage(o); dbg!(o); } ```
WaffleLapkin commented 7 months ago

I've previously wanted to add support for .*, .&, .&mut so I'll chime in for a second (and then chime out, since I don't have energy for a prolonged discussion).


I've made an implementations of these operators a while back, you can see it in this branch. There are conflicts, but this might give you an idea about approximate amount of work to implement them (not much, if this ever gets to the implementation stage feel free to ping me).

Although do note that the main problem here is agreeing on details and convincing lang team/community that this is a good idea (I thought that I'm unlikely to move this forward enough, so didn't bother writing an RFC).


A while back I implemented a feature in rust-analyzer which makes "reborrow inline hints" render as postfix de/refs.

There are three settings to configure this

I would recommend people in this thread try this config option, to see how you feel about this syntax. Here is an example from random rustc file I have open (I use mode: "prefer_postfix"):

2024-02-24_17-45


My personal opinions:

Either way I hope that this thread will be constructive and we'll be able to do something nice 💚

Lokathor commented 7 months ago

-> does not really make sense in rust, esp given that postfix deref syntax does not help with pointers

This point I would dispute. Particularly since that's what this RFC actually started off with arguing for.

Not to repeat too much of the RFC itself... Given a pointer to a struct, working with the fields of the struct is much easier with an arrow operator any time the last part of the chain is a non-pointer:

// harder to read
(*p).field
(*(*p).field1).field2

// easier to read
p->field
p->field1->field2

The arrow is only "not enough" if the last part of the chain is itself a pointer, in which case you may need the extra deref

p->ptr_field // the pointer field
*p->ptr_field // the _target_ of the pointer.

So I think just the -> operator would be a strong improvement to Rust, and it's a reasonable thing to consider if "minimal churn" is held as a strong value.

RalfJung commented 7 months ago

@Lokathor Postfix .* seems superior over -> though, in the sense of being (a) more consistent with the prefix syntax, (b) also covering the case where one does not access a field/method after the deref, (c) focusing on a single operation, rather than tying together two unrelated operations (deref and place projection / method call).

As far as I can tell, the only thing -> has going for it is that people are familiar with it from C/C++. It's not even easier to type, at least on a US keyboard. (Not sure about other layouts, there are too many to make a general statement.^^) Maybe it looks nicer but I'd argue that mostly is down to familiarity as well.

Lokathor commented 7 months ago

I'm actually the most swayed by "it's not easy to type on a non-US keyboard", because rust should be easy to type.

RalfJung commented 7 months ago

Specifically which non-US-keyboard layout are you referring to?

On a German keyboard they also seem pretty similar. In both cases it's one unmodified key and one shift-modified key. If anything, .* wins since these keys are much closer to each other than the ones for ->.

Lokathor commented 7 months ago

Oh you said US, I misread it as non-US. Sorry for the confusion.

RalfJung commented 7 months ago

I had a bunch of negations in there, it was probably unnecessarily confusing.

Lokathor commented 7 months ago

I still find p.*.field to be completely weird to read, and I wish there was some way to not have three punctuation in a row for such a core operation, but I could probably get over it if that's what people can type easiest.

RalfJung commented 7 months ago

Oh right it's .*. vs ->, not just .*. So it's one character more. That does make it a bit more annoying to type.

Lokathor commented 7 months ago

So, things that might show up in otherwise normal math code:

let a = p.*.ptr_field.* + 7;

let b = p.*.field1.*.field2.* * 4;

let c = p.*.0.* + 2.2;
EEliisaa commented 7 months ago

@kennytm No, they were not. Another reason to prefer the postfix operator (.*).

The poll says that .* is a clear winner. Summary:

.* can be thought of as a postfix application of the already existing * operator. .* covers additional important cases. .* is unifunctional. It is not both pointer projection and dereference at the same time.

@Lokathor With some imagination, I think .* in the sequence .*. is clear. I think there is a bias that will be overcome as soon as .* is adopted.

@WaffleLapkin VERY neat!

Should we open a new RFC with title Postfix Dereference?

Lokathor commented 7 months ago

It is sure technically clear, it's clear in meaning, I understand what the code intends, what the programmer who wrote it wanted... However you want to describe that part of things. But also: that's never been a problem I've had to begin with. I've never been unable to understand "when a dereference happens".

What I mean is that it's still visually noisy. It's punctuation soup. My eyes do not parse what's written quickly, and I have to slow way down to make sense of what I'm looking at. To get the token tree off the page and into my brain.

EEliisaa commented 7 months ago

Understandable. Either way, .* is still the only solution for the additional cases. The additional cases do not have any other alternative solutions. Ideally there would be both -> and .*. Unfortunately this is unlikely to ever happen, since .* covers all cases where -> is used, and since the poll says what it says. I think starting with .* is a good way forward.

tgross35 commented 7 months ago

Existing similar things: with std::ops::Deref in scope, foo.deref() is an existing postfix equivalent to *foo for safe code. ptr.read() is postfix but copies the value. ptr.as_ref().unwrap_unchecked() is &*ptr. Of course a library solution can't provide place-ness.

A keyword like ptr.deref.foo looks nicer than ptr.*.foo IMO, but that is a new bag of worms (and more characters).

joshtriplett commented 7 months ago

:+1: for the idea of postfix dereference.

.* is a reasonable choice:

Another I've seen proposed for postfix dereference is ^: ptr^.method().

Lokathor commented 7 months ago

I would strongly favor postfix ^

It's unfamiliar in the instant you first see it, but it feels like you learn it once and then you don't forget it.

RalfJung commented 7 months ago

It does look a lot less noisy, yes.

A point worth considering: we don't have many ASCII characters left, is this a good enough use case to burn one of them? It might well be.

Are there parsing issues? ^ is also XOR. So ptr ^ .5 could be mistaken as XOR of ptr and a float value. Now what if ptr is a custom type that implements both Deref and BitXor<f32>? That seems nonsensical but then both parsings would even yield well-typed results I think?

Lokathor commented 7 months ago

Case 1: The compiler will tell you that "float literals must have an integer part". You currently have to write it as ptr ^ 0.5 if you wanted to "xor with an f32", which seems a lot more difficult to misread (though still possible).

Case 2: Just playing around with it a bit, ops used with punctuation (eg: a^b instead of a.bitxor(b)) don't seem to trigger "deref and try again" logic when the impl is missing. You just immediately get the error.

kennytm commented 7 months ago

^ was used in Pascal and its derivative because they do use this character to indicate pointer type (var p : ^Integer ; p := @v; p^ := 123). This is not the case for Rust though, which IMO would be quite confusing if used.

and again because ^ is already bitxor you have the same https://github.com/rust-lang/rfcs/pull/3577#issuecomment-1957309583 issue around prefix vs binary -.

fn main() {
    let p = &10;
    dbg!(p^ - 5);
}
matthieu-m commented 7 months ago

I feel like one extremely important point that is not being discussed here is the very desugaring.

Is desugaring x->field to (*x).field really a good idea in the first place?

The problem of *x is that it creates a reference to x, with all that entails:

I can only speak from my own experience, but in general, if I could have a reference instead of a pointer, I would have a reference instead of a pointer. Instead, if I've got a pointer in my hands, it's because there's something special about it, and borrowing is quite often what's special.

Accidentally borrowing is terrible: it introduces UB. This goes against the very goals of this RFC: there's nothing ergonomic about introducing UB.

Which, at this point, makes me question the very motivating example:

pointer.add(5)->some_field->method_returning_pointer()->other_method()

Where is the // SAFETY comment here?

And since you need to justify each and every step -- yes, really, that's the burden you took on when you decided to write unsafe code -- then you may as well break them down so it's clearer which justification refers to which step:

//  SAFETY:
//  - `pointer` points to a sequence of at least 6 elements since <...>.
let element = pointer.add(5);

//  SAFETY:
//  - `element` is not null and well aligned since `pointer` was.
//  - `element` points to a sufficiently sized memory block since `pointer` pointed to a sufficiently sized sequence.
//  - `element` points to a live value since <...>.
//  - `element` can be borrowed immutably since <...>.
let element = &*element;

//  SAFETY:
//  - `element.some_field` is not null and well aligned since <...>.
//  - `element.some_field` points to a sufficiently sized memory block since <...>.
//  - `element.some_field` points to a live value since <...>.
//  - `element.some_field` can be borrowed immutably since <...>.
let some_field = &*element.some_field;

let pointer = some_field.method_returning_pointer();

//  SAFETY:
//  - `pointer` is not null and well aligned since <...>.
//  - `pointer` points to a sufficiently sized memory block since <...>.
//  - `pointer` points to a live value since <...>.
//  - `pointer` can be borrowed immutably since <...>.
let thing = &*pointer;

thing.other_method()

And I think we can argue that once due diligence is made, &* vs -> is the least of our worries.


I note that there's value in projection because it enables navigating the fields without forming intermediate references which could potentially blow up in our faces.

RalfJung commented 7 months ago

The problem of *x is that it creates a reference to x, with all that entails:

I am not sure what you mean, but it doesn't create a reference. It creates a place. The requirements you state only apply if the place is later turned into a reference, but that may or may not happen.

I note that there's value in projection because it enables navigating the fields without forming intermediate references which could potentially blow up in our faces.

Again, this should be "intermediate places". Other than that I think this is basically rephrasing this earlier argument. It hasn't been picked up in follow-on discussion much.

I agree that the ~ operator is valuable even if this RFC gets accepted, but postfix deref seems valuable and aligned with modern Rust even if ~ is a thing. (Note that the discussion moved away from -> and towards postfix deref.)

Lokathor commented 7 months ago

Also, I don't believe that anyone is suggesting that p-> or p.* or p^ or any other syntax would be a safe operation. So, you'd still have it within an unsafe block and you can still put every single safety comment you want on that block or within that block or wherever you like.

Personally, I think you're overdoing it quite a bit with a list of comments on every single access.

EEliisaa commented 7 months ago

The problem of *x is that it creates a reference to x

No, it does not. It creates a place.

If I could have a reference instead of a pointer, I would have a reference instead of a pointer

Hence the term irreducible encapsulation.

Accidentally borrowing is terrible: it introduces UB. This goes against the very goals of this RFC: there's nothing ergonomic about introducing UB.

You got it backwards. Since it does not create a reference, this RFC reduces UB.

steffahn commented 7 months ago

I agree that the ~ operator is valuable even if this RFC gets accepted, but postfix deref seems valuable and aligned with modern Rust even if ~ is a thing. (Note that the discussion moved away from -> and towards postfix deref.)

Interesting idea! Combining the two, one could go as far as to lint against any use-case of deref on pointers that does not claim access to the whole pointed-to value. Assuming all those cases could then use ~ instead.

That way .* on a raw pointer always means about as much as taking a reference to the whole pointed-to value.[^1] The only remaining implicitness then would be whether that by-reference access is immutable or mutable.

[^1]: Making a copy (powered by Copy trait) of the value falls under access-by-immutable reference; AFAICT the safety conditions should be the same. Similarly, assigning to the value falls under access-by-mutable reference. Anything else you could do to a place?

matthieu-m commented 7 months ago

The problem of *x is that it creates a reference to x, with all that entails:

I am not sure what you mean, but it doesn't create a reference. It creates a place. The requirements you state only apply if the place is later turned into a reference, but that may or may not happen.

Thanks for the correction. I knew of places but I typically just immediately turn them into references so didn't think of the distinction.

I tried searching, but could not find, the safety requirements for turning a pointer into a place. Are those the requirements of derefencing a pointer? (So everything I listed but borrowing)

You got it backwards. Since it does not create a reference, this RFC reduces UB.

Unless, of course, -> (or whatever) is used to call a method, right?

Not creating a reference is nice. Though I do note there's likely still quite a laundry list of pre-conditions which need to be validated, regardless.

Lokathor commented 7 months ago

A place isn't quite an operation of its own. Making a place is one step in read or writing, in which case either the reading or writing rules apply, for example.

EDIT: also, yes, calling a method can create a reference depending on the method used. However, even using self methods on a value behind a pointer would need to read the pointer to get the self value so there's not a way for methods to fully safely be used with pointers or anything like that.

matthieu-m commented 7 months ago

Is there any way to apply -> (or .* or whatever) to a user-defined type?

I tend not to use raw pointers a lot, because I like to leverage types to enforce invariants. At the very least, this means using NonNull<T>, and signalling potential nullity via Option<NonNull<T>>.

I would expect the ability to define -> (or .*, ...) on such user-defined types.

Is there a way to represent places in the type system so that writing the function is possible?


Otherwise, as mentioned by @steffahn, we may be better off having two operators:

This way, custom types can benefit from the syntax sugar instead of being second-class, and it's clear to the reader whether a reference is formed, or not.

RalfJung commented 7 months ago

Is there any way to apply -> (or .* or whatever) to a user-defined type?

.* is exactly the same as prefix *. So, it calls Deref/DerefMut as usual.

Something like DerefRaw would be a completely separate RFC, that has basically nothing to do with this RFC.