p4lang / p4-spec

Apache License 2.0
177 stars 80 forks source link

Clarify uninitialized values with examples #988

Closed MollyDream closed 1 year ago

MollyDream commented 2 years ago

The spec talks about uninitialized values in section 8.22, but given the amount of information, this section is very hard to follow and easy to misinterpret. I think adding more examples will clarify "a consistent value [does not have to be] returned across multiple such [reads]" of any uninitialized value, and will clarify what a "read" actually means. Example 1:

bit<8> x;
bit<8> y = x+1;
bool r = (y==y);

The “read” of the uninitialized variable x occurs inside the expression evaluation x+1, taking an unspecified 8-bit value for the subexpression x. The value of x+1 initializes y deterministically, so the following two reads of y gives the same value, leading to r = true. Example 2:

bit<8> x;
bool r = (x==x);

The “read” of the uninitialized variable x occurs twice inside the expression evaluation x == x, each possibly resulting in a different value, sor can be either true or false. Example 3:

void f(in bit<8> x, out bool y) {
   y = (x==x);
}
void g(...) {
  bit<8> a; 
  bool b;
  f(a,b);
}

Copy-in of an uninitialized variable to an in parameter of a function counts as a "read", resulting in a deterministic unspecified value of x in the function f. Therefore, y = true. Example 4:

void f(out bit<8> x) {
  bool r1 = (x == x);
}
void g(...) {
  bit<8> a = 1; 
  f(a);
  bool r2 = (a==a);
}

Since x is an uninitialized out parameter inside the function f, r1 can be either true or false. After executing f, copy-out of uninitialized variables **does not*** count as a read, so a becomes uninitialized, and r2 can be either true or false.

* This is based on our understanding of the spec. We think this design choice of "copy-in is a read but copy-out not" makes sense since it makes the function inlining smooth. Specifically, when the compiler inlines a function call, all the in parameters just have to be replaced by the arguments, and all the out arguments be re-declared; this pertains the behavior of uninitialized parameters in function inlining.

jafingerhut commented 2 years ago

@MollyDream I am all for examples clarifying some of these subtleties, so thanks for your write-up.

I agree that when I wrote the original text of this, I was not explicitly addressing what constitutes a "read" of a variable, and copy-in and copy-out were not explicitly addressed, and it is worth considering whether either or both of those are a "read" or not.

I was curious if you found anything in the current spec that would cause different behavior relative to uninitialized values for copy-in vs. copy-out? Or perhaps your choice above is based upon consideration of the kinds of transformations that a correct P4 compiler can make to the original program?

QinshiWang commented 2 years ago

In general, I don't strongly support this, but just for information. The reason in favor of treating copy-out (and maybe also copy-in sometimes) as non-read is that it makes function inlining more efficient. For example,

void f(out bit<8> x, out bit<8> y) {
  y = 0;
}

{
  bit<8> x, y; // x is uninitilized
  f(x, y);
}

If copy-out is not read, we can safely transform it into

{
  bit<8> x, y; // x is uninitilized
  y = 0;
}

If copy-out is read, it's not safe to do it because

{
  bit<8> x, y; // x is uninitilized
  y = 0; // x is still uninitilized
}
jafingerhut commented 2 years ago

During 2022-Jan-03 LDWG meeting, Mihai & Andy came to the realization that perhaps p4c's LocalCopyPropagation pass can result in output programs that are more nondeterministic than the input program, unless the compiler can somehow restrict the use of this transformation to cases where the variables/fields being accessed in the expression are "at least as deterministic as" the original program, e.g. all of them have initialized values.

AI Andy: Try to find a small P4_16 program where the current p4c implementation's LocalCopyPropagation pass produces a program that is strictly more non-deterministic than the input program, and add it here. My guess is that finding such a program will be straightforward, e.g. one with an expression like (a-a) or (a^a) where a is initialized in the user's P4 program and thus these expressions are defined to deterministically evaluate to 0 in the input program, but replaced with what look like equal expressions but containing uninitialized variables only, so they do not deterministically evaluate to 0 in the compiler-transformed program. The harder question is whether there are practical programs that have desired behavior in their user-written form, but buggy behavior in their compiler-transformed form.

In general, I am fine with any definition of the P4 language spec here for which we know which compiler transformations are correct (i.e. result is at least as determinstic as the input program), and which are incorrect, and the incorrect ones are optimization passes we are willing to live without.

jfingerh commented 2 years ago

For anyone curious, there is a very short P4 program for which today's p4c can increase the number of possible behaviors of an input P4 program, because of the LocalCopyPropagation pass, when at least some of the variables in the resulting expressions are uninitialized: https://github.com/jafingerhut/p4-guide/tree/master/nondeterminism#demonstration-of-localcopypropagation-increasing-nondeterminism-of-input-program

Filed this issue for p4c with a potential idea for preventing this: https://github.com/p4lang/p4c/issues/3022

jafingerhut commented 2 years ago

It seems to me that the core of the questions here are not when reads occur, but when a variable is considered to be initialized, vs. uninitialized.

It seems to make intuitive sense that assigning a value to a variable X should make X become initialized.

So open questions raise by this issue are:

(a) should a direction 'in' parameter always become initialized when its action/function/control/parser is invoked? Or should there be some cases where it can remain uninitialized?

The text of Example 3 in the original issue suggests that the answer should be that a direction in parameter always becomes initialized. I do not know if that is the best/right answer, but wanted to explicitly state the assumption being made in that example. To my knowledge, the current spec does not answer this question explicitly.

(b) when an action/function/control/parser is invoked as foo(a), and the parameter corresponding to a is declared with direction out, should a always become initialized by the call foo(a)? Or should there be some cases where it can remain uninitialized?

The text of Example 4 in the original issue suggests that the answer is that a direction out parameter a at the call site does NOT necessarily become initialized after the call foo(a). Again, I do not know if that is the best/right answer, and again, I do not think that the current spec answers this question explicitly.

(c) Both of the questions above arise for direction inout parameters.

As I mentioned in the LDWG meeting on 2021-Jan-04, I suspect that different choices to the answers to question (a), (b), and (c) may affect which compiler optimizations are correct (i.e. never make the result after the transformation more nondeterministic), and which are incorrect (i.e. can make the result more nondeterminstic). I don't know which transformations are affected by the answers to (a), (b), and (c) yet, let alone how these answers affect them.

mihaibudiu commented 2 years ago

Here is an article about uninitialized values from someone who wrote a Ph.D. thesis on semantics: https://www.ralfj.de/blog/2019/07/14/uninit.html

jfingerh commented 2 years ago

Thanks for that link. It looks highly relevant to P4's current definition of the behavior of uninitialized variables, from what I can tell. I haven't read all the linked discussions yet, as I only spent a couple of minutes on the directly linked article. From that alone, I get the impression that the author has thought about these issues, but I do not yet know if they have recommendations that help us answer the question posed by this issue.

mihaibudiu commented 2 years ago

The main point of his text is that we (the LDWG) should define what "undefined" means, and not specific implementations, like Tofino or BMv2. The abstract machine for the language definition, aka the spec, has choices to make, and we make them. We can choose whether a read of an undefined variable "freezes" the value or does not. Both are legitimate choices.

jnfoster commented 2 years ago

Ralf was almost a P4 h4x0r! (I was this close to hiring him as a postdoc.) Sadly, he decided to go to some place called MIT.

jfingerh commented 2 years ago

"The main point of his text is that we (the LDWG) should define what "undefined" means, and not specific implementations, like Tofino or BMv2." I agree with this statement, and believed it before I read the linked article. I still do not know the best way to answer the questions raised by this issue, though :-(

mihaibudiu commented 2 years ago

There may be no "best" answer. This is why language design is hard. As you see, other communities struggle with the same problems.

QinshiWang commented 2 years ago

Just recall the reason that we need partially initialized value by bit. Consider we assigning x by parts:

void f(in bit<4> y, in bit<4> z) {
   bit<8> x;
   x[3:0] = y;
   x[7:4] = z;
   ...
}

The only intuitive interpretation of x after two partial assignments is x = y ++ z. We must have bitwise value to support it.

jnfoster commented 1 year ago

In the interest of tidying up the set of active issues on the P4 specification repository, I'm marking this as "stalled" and closing it. Of course, we can always re-open it in the future if there is interest in resurrecting it.