rust-lang / unsafe-code-guidelines

Forum for discussion about what unsafe code can and can't do
https://rust-lang.github.io/unsafe-code-guidelines
Apache License 2.0
659 stars 57 forks source link

Definitions of "representation", "validity" etc. #50

Closed asajeffrey closed 5 years ago

asajeffrey commented 5 years ago

We should add basic definitions of terms like "representation", "validity" and friends.

asajeffrey commented 5 years ago

This bogged down the discussion at https://github.com/rust-rfcs/unsafe-code-guidelines/issues/16

avadacatavra commented 5 years ago

I'll add this to the introduction i'm working on 🥇

asajeffrey commented 5 years ago

@RalfJung's blog post on the subject... https://www.ralfj.de/blog/2018/08/22/two-kinds-of-invariants.html

RalfJung commented 5 years ago

I think #16 got bogged down because there was so little to talk about in terms of representation so folks went on chatting about validity instead.^^

Part of the process here is even figuring out what some of these terms mean. IMO https://github.com/rust-rfcs/unsafe-code-guidelines/blob/master/active_discussion/representation.md separates representation quite clearly from any discussion of "invariants".

asajeffrey commented 5 years ago

One thing I'm not sure about is that because of niche optimization, representation can depend on validity, which is a bit odd because it depends on the state of memory. Should we have something like "bitstring validity" which doesn't depend on memory state? Then repr and "bitstring validity" would be mutually dependent, so repr wouldn't depend on memory.

hanna-kruppe commented 5 years ago

I guess you can think of it that way, but a compiler can't know the full set of addresses that are allocated or not allocated anyway, so it's not like type layout could actually depend on memory.

(I guess hypothetically a JIT compiler could know that some address range is not available for Rust data and use that for layout optimizations, but would that be actually something we want to rule out?)

RalfJung commented 5 years ago

Layout computation just has to do stuff that works with validity for any memory. I do not see a problem.

asajeffrey commented 5 years ago

@rkruppe: This is an issue about making the spec human-readable, not really about the technical impact.

asajeffrey commented 5 years ago

Oh, and where do discussions of provenance end up? Are they part of validity, or part of safety?

RalfJung commented 5 years ago

Provenance are part of "messy stuff that's barely understood, do we really want to talk about it". ;)

More seriously though, I'd say that's the "aliasing model" discussion, with Stacked Borrows and so on. It's part of "specifying the behavior of Rust programs", but I'd keep it separate from validity.

asajeffrey commented 5 years ago

I'd be quite happy if we end up without provenance. An example which is UB in C due to provenance:

  fn foo() {
    let x = false;
    let y = true;
    if (&x as usize + 1) == (&y as usize) {
      let z: &[bool; 2] = unsafe { &*(&x as *const bool as *const [bool;2]); }
      bar(z);
    }
  }

Should we consider foo() to be UB? What if bar is a C function?

strega-nil commented 5 years ago

@asajeffrey if that isn't UB, that would be crazy to me.

RalfJung commented 5 years ago

Can we please not discuss provenance in this issue? Open a new one or better a topic on Zulip if you really want to go there at this time already, but this issue seems like the wrong place.

asajeffrey commented 5 years ago

@RalfJung good point, we're off-topic. I filed #52.

asajeffrey commented 5 years ago

So... is it worth introducing a concept of "bitstring validity" (name to be bikeshedded), which does not depend on the state of memory? The idea being that representation and bitstring validity are mutually dependent, and that validity depends on both of them?

Examples:

Pro: bitstring validity is simpler than validity (especially if that ends up bringing in shadow state like lifetimes or provenance) and makes it clear what representation can depend on.

Con: yet another term to define.

RalfJung commented 5 years ago

@asajeffrey I have now several times defined "bitstring validity" in terms of validity and you have not yet commented on that. :)

So I am wondering if, from your perspective, there is anything wrong with just saying "bitstring validity means validity for some (existentially quantified) memory"?

Slightly more formally:

bitsring_valid(type, value) := exists memory, valid(type, value, memory)
asajeffrey commented 5 years ago

@RalfJung the problem with that defn is that it means that representation still depends on all of validity. Imagine the poor reader trying to workout the bitstring representation of a value, and discovering that they have to read about lifetimes, stacked borrows, etc.

I agree that it should be the case that bitstring-validity is validity in some memory, but the question is should that be a defn or a derived property?

RalfJung commented 5 years ago

I think it would be silly to define a property when it can be exactly derived. That will just lead to inconsistencies.

Also, since bitstring validity is literally only relevant for enum layout optimizations, most users should not ever get in contact with that notion. When you are writing unsafe code, if you are wondering about bistring validity, you are doing something wrong.

asajeffrey commented 5 years ago

@RalfJung "silly" unless you take into account the poor human reader.

Bitstring validity is needed in more cases than just enum layout, e.g. if I'm writing a (de)serializer like https://github.com/frankmcsherry/abomonation, I might want to know what the valid bitstrings are.

asajeffrey commented 5 years ago

As discussed on https://rust-lang.zulipchat.com/#narrow/stream/136281-wg-unsafe-code-guidelines/topic/meeting-2018-12-13, we could do with getting some definitions in place, to avoid us getting bogged down again.

AFAIK the terms that are uncontroversial are "size", "alignment", "validity" and "safety". We are discussing whether we also need "bitstring validity", and whether "representation" means "size"+"alignment" or if it also includes "bitstring validity".

gnzlbg commented 5 years ago

So this came on the meeting today. There was interest in doing a blog post about the work in this WG, to advertise it a bit more, try to involve more people, etc.

I think that without an informal definition of what we mean by "representation, validity, and safety invariants of a type", new people will have trouble getting up to speed, separating the different discussions or figuring out what they are about, etc. So it might be worth it to at least write some informal explanations of what these terms mean to help people get started and follow along. We should obviously be clear that these definitions aren't necessarily "precise".

RalfJung commented 5 years ago

"silly" unless you take into account the poor human reader.

We can spell out what this derived property looks like. That doesn't make it a definition though.

if I'm writing a (de)serializer like frankmcsherry/abomonation, I might want to know what the valid bitstrings are.

Why would you? Whenever you construct values, they must be valid, not just valid for some other memory.

asajeffrey commented 5 years ago

As long as there's some way for the reader to understand that a valid bool is either true or false, without having to read anything about the semantics of memory, then I don't really mind. I'm just worried that a reader is going to get lost in the definitions if we don't have a clear way of saying "you only need to read these defns if you're worried about refererence types".

asajeffrey commented 5 years ago

Oh, and note to self, if we define "V is bitstring valid for type T" as "there exists a memory M s.t. V is valid for type T in memory M" then this isn't a property we can define by induction on the structure of T. For example with this defn, any non-zero word-aligned address p is valid for type &&usize and for type &bool but (p,p) is not valid for type (&&usize, &bool).

RalfJung commented 5 years ago

this isn't a property we can define by induction on the structure of T

It is not a defined but a derived property, but it satisfies the structural principles.

RalfJung commented 5 years ago

Concerning "representation", why don't we just replace that term by "layout"? That's how it is called in the Rust compiler. Layout consists of: Size, alignment, ABI details (for passing things by-value) and field offsets.

asajeffrey commented 5 years ago

this isn't a property we can define by induction on the structure of T

It is not a defined but a derived property, but it satisfies the structural principles.

Er, what do you mean by "the structural principles"? There's no inductive definition equivalent to "V is bitstring valid for T", c.f. example above. "V is valid for T in memory M" is inductive in T.

Concerning "representation", why don't we just replace that term by "layout"? That's how it is called in the Rust compiler. Layout consists of: Size, alignment, ABI details (for passing things by-value) and field offsets.

Fine, though we've used the word "representation" a lot, replacing it by "layout" would be quite a bit of work.

strega-nil commented 5 years ago

It seems weird to call it layout - layout sounds to me like a property of subobjects in structures, not of distinct objects, and only in memory on computers, not in the value system of Rust - distinguishing between value representation and object representation makes sense, while value layout doesn't make sense at all.

asajeffrey commented 5 years ago

@ubsan You'd reserve the word "layout" just for objects in memory, e.g. on the heap or pointed to on the stack? So we'd reserve the right for the compiler to do what it likes with locals that aren't addressed using &x?

strega-nil commented 5 years ago

@asajeffrey under as-if, the compiler can do whatever. I just think layout is not a good name for "representation".

RalfJung commented 5 years ago

layout sounds to me like a property of subobjects in structures, not of distinct objects,

Of course it is a property of both whole objects and their subobjects -- as is the representation we discussed here: We talked e.g. about how repr(C) unions have offset 0 for their fields, and then of course the representation of these fields apply.

Nothing in our discussion was exclusively about distinct objects. When we talk about single-field structs having the same representation/layout as their only field, of course this is still true when you put them into e.g. an array and want to use the AFAIK not-yet-decided rule that arrays are represented just like homologous tuples.

The discussions we have had in the representations area were all about size, alignment, field offset and ABI concerns. We didn't talk about an abstract set of values vs a concrete set of bits and how they map to each other (which is what "value vs object representation" means to me, but maybe I am misinterpreting?).

RalfJung commented 5 years ago

Fine, though we've used the word "representation" a lot, replacing it by "layout" would be quite a bit of work.

I was basically just confused when I wanted to write the definition and found no way to distinguish "layout" from what we discussed as "representation".

RalfJung commented 5 years ago

Er, what do you mean by "the structural principles"? There's no inductive definition equivalent to "V is bitstring valid for T", c.f. example above. "V is valid for T in memory M" is inductive in T.

I mean the fact that bitstring validity of a struct is equivalent to bitstring validity of every field. That's a property inductive definitions have by construction, but it can still also hold for derived definitions such as "valid for some memory".

But, anyway, I propose we continue this discussion as part of the validity invariant for references. That's the only type where the distinction between validity and bitstring validity matters, right?

asajeffrey commented 5 years ago

Er, what do you mean by "the structural principles"? There's no inductive definition equivalent to "V is bitstring valid for T", c.f. example above. "V is valid for T in memory M" is inductive in T.

I mean the fact that bitstring validity of a struct is equivalent to bitstring validity of every field. That's a property inductive definitions have by construction, but it can still also hold for derived definitions such as "valid for some memory".

Yes, the problem is that it doesn't hold for "valid for some memory", e.g. any non-zero word-aligned address p is "valid for some memory" at type &&usize and at type &bool but (p,p) is not "valid for some memory" at type (&&usize, &bool).

But, anyway, I propose we continue this discussion as part of the validity invariant for references. That's the only type where the distinction between validity and bitstring validity matters, right?

Fine by me, especially if we can confine all discussions of memory and related shadow state to the defn of validity of references.

RalfJung commented 5 years ago

(p,p) is not "valid for some memory" at type (&&usize, &bool).

What makes you say that? I think it is. In fact it could even be safe. We don't have TBAA.

hanna-kruppe commented 5 years ago

For (&bool, &&usize) things should be OK, but for example given

#[repr(u8)] enum Foo { A = 1, B = 2 }
#[repr(u8)] enum Bar { C = 3, D = 4 }

then any non-zero address p is valid for &Foo and &Bar individually in some memory, but (p, p): (&Foo, &Bar) is not valid in any memory edit: in case valid references have to point to valid memory (which AIUI is still a possibility at this point).

RalfJung commented 5 years ago

@rkruppe You are assuming that validity for Foo requires the pointer to point to valid data. Some people have been suggesting this, but I think that's actually a bad idea (and I'll spell out my reasoning in the issue for reference validity once it is opened).

hanna-kruppe commented 5 years ago

Yeah I realized as soon as I sent the comment and tried to sneak it in as an edit but I was too slow 😅

asajeffrey commented 5 years ago

(p,p) is not "valid for some memory" at type (&&usize, &bool). What makes you say that? I think it is. In fact it could even be safe. We don't have TBAA.

A memory that makes p valid at type &&usize is:

  p -> q
  q -> 0x1234

A memory that makes p valid at type &bool is:

  p -> 0x1

I don't think there's any memory that makes both of them valid (since refs have to be non-zero and aligned, so bool and &usize have no overlap) so there's no memory which makes (p,p) valid at type (&&usize, &bool).

asajeffrey commented 5 years ago

Ah, this may be because your defn of validity doesn't require p to point to valid data.

RalfJung commented 5 years ago

So, we have some definitions in the glossary now. @asajeffrey are there more terms that you think should be added? If not, can we close this?

asajeffrey commented 5 years ago

Sure, fine by me to close this.