rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.67k stars 12.75k forks source link

#[repr(transparent)] for str and String #70473

Open RReverser opened 4 years ago

RReverser commented 4 years ago

Similarly to https://github.com/rust-lang/rust/issues/52976, I'd suggest that str and String should have a guaranteed transparent representation of [u8] and Vec<u8> correspondingly.

They are a newtype wrappers of these types under the hood anyway, and I don't see any reasons why they would need to behave differently in the future, and making them different probably would become a breaking change anyway as it would break things like String::from_raw_parts constructed from raw parts of the Vec.

While String::from_raw_parts already covers most common cases, there are still other cases where it would be desired to have guaranteed representation.

One such example is constructing Rc<String> from a Rc<Vec<u8>> where a library might want to check UTF-8 validity before construction and then Rc::from_raw + Rc::into_raw one type into another without actually copying the data.

Currently this is not possible to do safely because it's not guaranteed that the *const Vec<u8> returned from the first into_raw is compatible with *const String accepted by the second.

While this particular case could be alleviated by adding a special method, there are lots of other containers where such compatibility between these types might be desirable, and I don't see any reasons why it couldn't be guaranteed at a lower level.

mversic commented 2 years ago

From the looks of it, it seems that str is already guaranteed to have the same repr as [u8]. Actually, it's more than that. I suspect that &str has a guaranteed representation as &[u8] even though references/pointers to slices don't have a defined representation. This is used all over the core::str to transmute string slices into byte slices. Here is one example, core::str::from_utf8 is another

ChrisDenton commented 2 years ago

To be clear, the standard library is allowed to depend on implementation details that aren't necessarily stable. That core does something internally is not a guarantee that it'll work in future (or past) versions of Rust. The std is distributed with the compiler so can change as and when needed.

dtolnay commented 2 years ago

This is what we should have at the top of every single file instead of license headers.

mversic commented 2 years ago

To be clear, the standard library is allowed to depend on implementation details that aren't necessarily stable. That core does something internally is not a guarantee that it'll work in future (or past) versions of Rust. The std is distributed with the compiler so can change as and when needed.

Ofc, I am aware. Poor choice of words on my part. I just thought it's worth pointing out. Though in this particular case I don't know how they would do it any other way.

RReverser commented 2 years ago

To be clear, the standard library is allowed to depend on implementation details that aren't necessarily stable. That core does something internally is not a guarantee that it'll work in future (or past) versions of Rust. The std is distributed with the compiler so can change as and when needed.

Yeah, that's why this issue, like the referenced issue about Box, is more about documenting and settings things in stone rather than changing behaviour.

orowith2os commented 1 year ago

Would it be illegal for String to ever add more fields, on top of the current pointer + capacity + length? The docs say this:

A String is made up of three components: a pointer to some bytes, a length, and a capacity.

But I think it's a bit too vague to answer my question.

If it's allowed to add more fields, then the only way you're guaranteed to access the internal Vec of a String is by using as_mut_vec. Whereas, right now, you can transmute, pointer cast, and so on, on top of as_mut_vec.

Noting that you can also pointer cast from a larger String to a Vec if the layout was guaranteed such that a Vec is the very first field of a String (repr(C), like how gobject ensures child classes can be casted up to parents)

RalfJung commented 1 year ago

Whereas, right now, you can transmute, pointer cast, and so on, on top of as_mut_vec.

No, you cannot. The docs don't say how these three components are encoded, at which type, and in which order.

This is just like for Vec, which similarly may only be "deconstructed" via into_raw_parts (or the equivalent stable APIs).

orowith2os commented 1 year ago

No, you cannot.

You can. It's unsafe as hell, so it's not guaranteed to work, but the bare minimum requirement for it to compile successfully is met: a String is the same size as a Vec<u8>. If a String were to be repr(transparent), then it's guaranteed to work. Or, that's my thought process. I'm open to discussing just how it works, or doesn't, elsewhere.

But back to my original point, would it be illegal for a String to change its internals, at all? The current wording seems a bit vague. And every bit of logic I'm seeing involving the String API says it's guaranteed to be a Vec<u8> so long as the sizes are the same and the API doesn't have a breaking change.

RReverser commented 1 year ago

It's unsafe as hell, so it's not guaranteed to work, but the bare minimum requirement for it to compile successfully is met: a String is the same size as a Vec<u8>.

Even unsafe code has rules on what you can or cannot do. If you're just transforming between unrelated types, you're basically begging the compiler to introduce UB. That's not what unsafe is for - it's for writing code that is still UB-free, just not memory safe in ways that compiler can prove.

Your code is not even guaranteed to work on all existing platforms, because w/o repr(transparent) the ABI representation of some struct A(B); can be different than representation of the nested field B - speaking as someone who ran into this in the past, although I don't remember which platform it was. So String is not guaranteed to have same ABI as Vec<u8> even though one is a wrapper for another.

would it be illegal for a String to change its internals, at all

This is what this issue is about. It's currently perfectly legal for String to change internal layout and break any such code that relies on such transmute.

It would be safe to use transmute only if it was specified as repr(transparent).

RalfJung commented 1 year ago

You can.

Typically, when people ask "can you do X?", there is an implied requirement that you can do it without causing UB. If you want to use a different definition of "can", you should say so.

In this case, transmuting String to Vec causes library UB, since the layout of String is not guaranteed. Therefore, you cannot. It may happen to compile and not even cause language UB right now, but that's a happy accident, nothing more.

wyfo commented 2 months ago

One such example is constructing Rc from a Rc<Vec> where a library might want to check UTF-8 validity before construction and then Rc::from_raw + Rc::into_raw one type into another without actually copying the data.

I ran in the same kind of issue today. Is there any cons about this change that are blocking it?