Closed asajeffrey closed 5 years ago
This bogged down the discussion at https://github.com/rust-rfcs/unsafe-code-guidelines/issues/16
I'll add this to the introduction i'm working on 🥇
@RalfJung's blog post on the subject... https://www.ralfj.de/blog/2018/08/22/two-kinds-of-invariants.html
I think #16 got bogged down because there was so little to talk about in terms of representation so folks went on chatting about validity instead.^^
Part of the process here is even figuring out what some of these terms mean. IMO https://github.com/rust-rfcs/unsafe-code-guidelines/blob/master/active_discussion/representation.md separates representation quite clearly from any discussion of "invariants".
One thing I'm not sure about is that because of niche optimization, representation can depend on validity, which is a bit odd because it depends on the state of memory. Should we have something like "bitstring validity" which doesn't depend on memory state? Then repr and "bitstring validity" would be mutually dependent, so repr wouldn't depend on memory.
I guess you can think of it that way, but a compiler can't know the full set of addresses that are allocated or not allocated anyway, so it's not like type layout could actually depend on memory.
(I guess hypothetically a JIT compiler could know that some address range is not available for Rust data and use that for layout optimizations, but would that be actually something we want to rule out?)
Layout computation just has to do stuff that works with validity for any memory. I do not see a problem.
@rkruppe: This is an issue about making the spec human-readable, not really about the technical impact.
Oh, and where do discussions of provenance end up? Are they part of validity, or part of safety?
Provenance are part of "messy stuff that's barely understood, do we really want to talk about it". ;)
More seriously though, I'd say that's the "aliasing model" discussion, with Stacked Borrows and so on. It's part of "specifying the behavior of Rust programs", but I'd keep it separate from validity.
I'd be quite happy if we end up without provenance. An example which is UB in C due to provenance:
fn foo() {
let x = false;
let y = true;
if (&x as usize + 1) == (&y as usize) {
let z: &[bool; 2] = unsafe { &*(&x as *const bool as *const [bool;2]); }
bar(z);
}
}
Should we consider foo()
to be UB? What if bar
is a C function?
@asajeffrey if that isn't UB, that would be crazy to me.
Can we please not discuss provenance in this issue? Open a new one or better a topic on Zulip if you really want to go there at this time already, but this issue seems like the wrong place.
@RalfJung good point, we're off-topic. I filed #52.
So... is it worth introducing a concept of "bitstring validity" (name to be bikeshedded), which does not depend on the state of memory? The idea being that representation and bitstring validity are mutually dependent, and that validity depends on both of them?
Examples:
bool
is 0x0
or 0x1
,&T
is non-zero and a multiple of T
s alignment,Pro: bitstring validity is simpler than validity (especially if that ends up bringing in shadow state like lifetimes or provenance) and makes it clear what representation can depend on.
Con: yet another term to define.
@asajeffrey I have now several times defined "bitstring validity" in terms of validity and you have not yet commented on that. :)
So I am wondering if, from your perspective, there is anything wrong with just saying "bitstring validity means validity for some (existentially quantified) memory"?
Slightly more formally:
bitsring_valid(type, value) := exists memory, valid(type, value, memory)
@RalfJung the problem with that defn is that it means that representation still depends on all of validity. Imagine the poor reader trying to workout the bitstring representation of a value, and discovering that they have to read about lifetimes, stacked borrows, etc.
I agree that it should be the case that bitstring-validity is validity in some memory, but the question is should that be a defn or a derived property?
I think it would be silly to define a property when it can be exactly derived. That will just lead to inconsistencies.
Also, since bitstring validity is literally only relevant for enum layout optimizations, most users should not ever get in contact with that notion. When you are writing unsafe code, if you are wondering about bistring validity, you are doing something wrong.
@RalfJung "silly" unless you take into account the poor human reader.
Bitstring validity is needed in more cases than just enum layout, e.g. if I'm writing a (de)serializer like https://github.com/frankmcsherry/abomonation, I might want to know what the valid bitstrings are.
As discussed on https://rust-lang.zulipchat.com/#narrow/stream/136281-wg-unsafe-code-guidelines/topic/meeting-2018-12-13, we could do with getting some definitions in place, to avoid us getting bogged down again.
AFAIK the terms that are uncontroversial are "size", "alignment", "validity" and "safety". We are discussing whether we also need "bitstring validity", and whether "representation" means "size"+"alignment" or if it also includes "bitstring validity".
So this came on the meeting today. There was interest in doing a blog post about the work in this WG, to advertise it a bit more, try to involve more people, etc.
I think that without an informal definition of what we mean by "representation, validity, and safety invariants of a type", new people will have trouble getting up to speed, separating the different discussions or figuring out what they are about, etc. So it might be worth it to at least write some informal explanations of what these terms mean to help people get started and follow along. We should obviously be clear that these definitions aren't necessarily "precise".
"silly" unless you take into account the poor human reader.
We can spell out what this derived property looks like. That doesn't make it a definition though.
if I'm writing a (de)serializer like frankmcsherry/abomonation, I might want to know what the valid bitstrings are.
Why would you? Whenever you construct values, they must be valid, not just valid for some other memory.
As long as there's some way for the reader to understand that a valid bool
is either true
or false
, without having to read anything about the semantics of memory, then I don't really mind. I'm just worried that a reader is going to get lost in the definitions if we don't have a clear way of saying "you only need to read these defns if you're worried about refererence types".
Oh, and note to self, if we define "V
is bitstring valid for type T
" as "there exists a memory M
s.t. V
is valid for type T
in memory M
" then this isn't a property we can define by induction on the structure of T
. For example with this defn, any non-zero word-aligned address p
is valid for type &&usize
and for type &bool
but (p,p)
is not valid for type (&&usize, &bool)
.
this isn't a property we can define by induction on the structure of T
It is not a defined but a derived property, but it satisfies the structural principles.
Concerning "representation", why don't we just replace that term by "layout"? That's how it is called in the Rust compiler. Layout consists of: Size, alignment, ABI details (for passing things by-value) and field offsets.
this isn't a property we can define by induction on the structure of T
It is not a defined but a derived property, but it satisfies the structural principles.
Er, what do you mean by "the structural principles"? There's no inductive definition equivalent to "V is bitstring valid for T", c.f. example above. "V is valid for T in memory M" is inductive in T.
Concerning "representation", why don't we just replace that term by "layout"? That's how it is called in the Rust compiler. Layout consists of: Size, alignment, ABI details (for passing things by-value) and field offsets.
Fine, though we've used the word "representation" a lot, replacing it by "layout" would be quite a bit of work.
It seems weird to call it layout - layout sounds to me like a property of subobjects in structures, not of distinct objects, and only in memory on computers, not in the value system of Rust - distinguishing between value representation and object representation makes sense, while value layout doesn't make sense at all.
@ubsan You'd reserve the word "layout" just for objects in memory, e.g. on the heap or pointed to on the stack? So we'd reserve the right for the compiler to do what it likes with locals that aren't addressed using &x
?
@asajeffrey under as-if, the compiler can do whatever. I just think layout is not a good name for "representation".
layout sounds to me like a property of subobjects in structures, not of distinct objects,
Of course it is a property of both whole objects and their subobjects -- as is the representation we discussed here: We talked e.g. about how repr(C)
unions have offset 0 for their fields, and then of course the representation of these fields apply.
Nothing in our discussion was exclusively about distinct objects. When we talk about single-field structs having the same representation/layout as their only field, of course this is still true when you put them into e.g. an array and want to use the AFAIK not-yet-decided rule that arrays are represented just like homologous tuples.
The discussions we have had in the representations area were all about size, alignment, field offset and ABI concerns. We didn't talk about an abstract set of values vs a concrete set of bits and how they map to each other (which is what "value vs object representation" means to me, but maybe I am misinterpreting?).
Fine, though we've used the word "representation" a lot, replacing it by "layout" would be quite a bit of work.
I was basically just confused when I wanted to write the definition and found no way to distinguish "layout" from what we discussed as "representation".
Er, what do you mean by "the structural principles"? There's no inductive definition equivalent to "V is bitstring valid for T", c.f. example above. "V is valid for T in memory M" is inductive in T.
I mean the fact that bitstring validity of a struct is equivalent to bitstring validity of every field. That's a property inductive definitions have by construction, but it can still also hold for derived definitions such as "valid for some memory".
But, anyway, I propose we continue this discussion as part of the validity invariant for references. That's the only type where the distinction between validity and bitstring validity matters, right?
Er, what do you mean by "the structural principles"? There's no inductive definition equivalent to "V is bitstring valid for T", c.f. example above. "V is valid for T in memory M" is inductive in T.
I mean the fact that bitstring validity of a struct is equivalent to bitstring validity of every field. That's a property inductive definitions have by construction, but it can still also hold for derived definitions such as "valid for some memory".
Yes, the problem is that it doesn't hold for "valid for some memory", e.g. any non-zero word-aligned address p
is "valid for some memory" at type &&usize
and at type &bool
but (p,p)
is not "valid for some memory" at type (&&usize, &bool)
.
But, anyway, I propose we continue this discussion as part of the validity invariant for references. That's the only type where the distinction between validity and bitstring validity matters, right?
Fine by me, especially if we can confine all discussions of memory and related shadow state to the defn of validity of references.
(p,p) is not "valid for some memory" at type (&&usize, &bool).
What makes you say that? I think it is. In fact it could even be safe. We don't have TBAA.
For (&bool, &&usize)
things should be OK, but for example given
#[repr(u8)] enum Foo { A = 1, B = 2 }
#[repr(u8)] enum Bar { C = 3, D = 4 }
then any non-zero address p
is valid for &Foo
and &Bar
individually in some memory, but (p, p): (&Foo, &Bar)
is not valid in any memory edit: in case valid references have to point to valid memory (which AIUI is still a possibility at this point).
@rkruppe You are assuming that validity for Foo
requires the pointer to point to valid data. Some people have been suggesting this, but I think that's actually a bad idea (and I'll spell out my reasoning in the issue for reference validity once it is opened).
Yeah I realized as soon as I sent the comment and tried to sneak it in as an edit but I was too slow 😅
(p,p) is not "valid for some memory" at type (&&usize, &bool). What makes you say that? I think it is. In fact it could even be safe. We don't have TBAA.
A memory that makes p
valid at type &&usize
is:
p -> q
q -> 0x1234
A memory that makes p
valid at type &bool
is:
p -> 0x1
I don't think there's any memory that makes both of them valid (since refs have to be non-zero and aligned, so bool
and &usize
have no overlap) so there's no memory which makes (p,p)
valid at type (&&usize, &bool)
.
Ah, this may be because your defn of validity doesn't require p to point to valid data.
So, we have some definitions in the glossary now. @asajeffrey are there more terms that you think should be added? If not, can we close this?
Sure, fine by me to close this.
We should add basic definitions of terms like "representation", "validity" and friends.