vgurevich commented 4 years ago

Personnel

[x] Owner: @vgurevich
[x] Supporters: @AashleshaPatil-BFN

Design

[x] Document: see below

Implementation

[ ] p4-spec:
[x] p4c: https://github.com/p4lang/p4c/pull/2786

Process

[x] LDWG discussed: https://docs.google.com/document/d/1XSgdXeG1UuF1FM_XAqxDrHeN4dHZWBnJPKVS6SnGNwM/edit#, August 2020 meeting
[ ] LDWG approved:
[ ] Merged into p4-spec:
[ ] Merged into p4c:

Introduction

The current spec provides a very straightforward definition of parsing semantics, including the detailed definition of the behaviors, expected from the methods of the packet_in extern.

It turns out, however, that the specified behavior is overly conservative and prevents most optimizations that can be performed on high-speed hardware. In fact, the detailed analysis of one popular implementation shows that it follows a different set of rules and making it spec-compliant "to the letter" will make the code a lot slower and more expensive (in terms of the required resources). Moreover, the current semantics makes it impractical to implement "parse-payload-as-a-header" approach, which is often used in a feature called User-Defined Fields (UDF) -- a standard in any switch/router today.

In the normal course of events, the differences are imperceptible. The problem happens when an exceptional situation (such as error.PacketTooShort) occurs -- that's where the theoretical model described in the spec and the practice start to diverge and that's the problem the current proposal aims to rectify.

Problem Description

As per current (v.1.2.1) spec, the way we can reason about the parser program is based on two assumptions:

All individual operations operations (and, specifically packet_in methods) employ "all-or-nothing" semantics. All the checks (e.g. whether there are enough bytes in the packet) are done beforehand and only after all checks have been performed, an extract() or lookahead() or advance() take place.
As everywhere else in P4, sequential execution is assumed -- the effects of executing the sequence of statements within a parser state have to be the same as is these statements were executed strictly sequentially.

It turns out that both assumptions place a very heavy burden on many high-speed implementations. We will start with the second one, since it is the key here.

One can notice, that sequential execution requirement works very well for actions and controls, i.e. highly parallel, high-speed targets have no problems parallelizing seemingly sequential code without altering program semantics. Unfortunately, the parsers present a unique challenge due to the fact that they can have exceptions, such as previously mentioned error.PacketTooShort and the main problem is that currently these exceptions are defined as precise exceptions. In other words as per today's spec

Each exception can be clearly and unambiguously associated with a particular statement in a given parser state
- Note that not all statements can produce exceptions (e.g. simple assignment statements should not)
One can safely assume that the code in the statements prior to the one where the exception occurred, was fully executed
One can safely assume that the code in the statement where the exception occurred was not executed. This is the consequence of the atomicity ("all-or-nothing") property generally ascribed to each statement.
One can also assume that the code in the subsequent statements was not executed

These assumptions make most optimizations extremely difficult or outright not possible, because parsers in high-speed hardware also employ a lot of parallel techniques and when you parallelize things it is impossible to have precise exceptions. In fact, the opposite is true -- one almost certainly can't tell which statements have been executed and which ones have not been. If one wants to implement precise exception semantics, then they should (instruct the compiler to) replace parallel execution with sequential and that is prohibitively expensive.

Some Background on Hardware Parsers

There are several different hardware parsing techniques in existence, but they have many things in common:

Parsing is a relatively slow operation, since it needs to be done in multiple steps. The reasons for that are: a. Bytes arrive at a finite speed. E.g if the device runs at 1GHz and the port speed is 100Gbps, then 100bits are received at each clock tick. b. Transitions in the state machine do take time. One needs to extract the Ethernet header (or, at least, Ethertype) in order to be able to decide what to do next. It is possible to shorten the required parsing time by doing deeper lookaheads and having more transitions, but in practice all parsers support limited number of transitions and do cap the amount of time that the packet can spend in the parser
Hardware parsers are typically capable of executing many operations (expressed as P4 statements) in parallel.
Hardware parsers have some very specific limits that need to be taken into account a. The amount of data that can be extracted per clock is usually limited b. The number of bytes the pointer can be advanced by is usually limited c. The depth of lookahead is usually limited d. Packet length is usually not available in advance e. The overall number of operations that can be executed in one state (clock), however, is limited

As a result, the compilers for the high-speed targets have to perform the following transformations of the original parser graph to ensure the highest degree of parallelization, while at the same time satisfying the target-specific constraints. For example: a. Sequences of statements that do not have data dependencies between them are parallelized and effectively reordered (there is no order when things are done in parallel anyways) b. "Long" states, that require more operations that can be performed in parallel, might be split into sequences of shorter states to comply with target-specific constraints

It is easy to see that these vital optimizations are not quite compatible with both "all-or-nothing" (atomic) statement semantics and the precise nature of exceptions as currently defined.

Analysis

In reality, the vast majority of data plane programs rarely require this strict semantics. Instead, most programs use a mix of the following three "modes" that I'll describe below under the following names, in the order I see them being used in practice:

"No assumptions" mode
"As-much-as-you-can" mode
"All-or-nothing" mode.

"No Assumptions" Mode

Aside from the fact that most programs do not bother processing any parser exceptions in the first place :), the ones that do actually assume nothing about packets that were received with the parser exception. To process such packets they use the following strategies:

The packets are dropped
The packets are reparsed and processed differently a. The most common case is that the packets are mirrored to the control plane and thus are re-parsed by the egress pipeline. Typically "re-parsed" means that the whole packet is treated as payload b. The less common case is that the packets are resubmitted and parsed again by the ingress parser, usually not as deep or using a different algorithm (based on the knowledge that the parser exception took place on the first pass)

This mode is used in probably 99% of all the code. Obviously, it allows the compiler to do a lot of optimizations of the code, using the strategies described above.

"As-much-as-you-can" Mode

This mode can typically be found in a couple of specific places inside a typical "industry-standard" parser and is used to implement the so-called Used-Defined Field (UDF) functionality, where the data plane allows the user to match on the contents of an arbitrary header, not necessarily recognized by the parser (and that P4 program).

The typical code looks like this:

     typedef bit<32> l4_udf_t;
     typedef bit<64> l7_udf_t
     l4_udf_t  l4_udf;

     state parse_ipv4 {
        pkt.extract(hdr.ipv4);
        transition select(hdr.ipv4.ihl) {
            5: parse_ipv4_no_options;
            6..15: parse_ipv4_options;
            default: bad_ipv4_ihl;
       }
    }

    state parse_ipv4_options {
        pkt.extract(hdr.ipv4_options, ((bit<32>)hdr.ipv4.ihl - 5) * 4);
        transition parse_ipv4_no_options;
    }

    /* Process IPv4 (L4) UDF */
    state parse_ipv4_no_options {
        l4_udf = pkt.lookahead<l4_udf_t>();
        transition select(hdr.ipv4.protocol, hdr.ipv4.frag_offset) {
                ....
        }
    }

     /* Process IPv6 (L3) UDF */
     state parse_ipv6 {
        pkt.extract(hdr.ipv6);
        l4_udf = pkt.lookahead<l4_udf_t>();
        transition select(hdr.ipv6.next_hdr) {
             ... 
        }
    }

    /* Process L7 UDFs */
    state parse_tcp {
         pkt.extract(hdr.tcp);
         l7_udf = pkt.lookahead<l7_udf_t>(); /* Sometimes you might see an extract of a similar header as well */
         transition accept;
    }

In this example the desired behavior in case the packet is too short for the whole UDF is different -- the users would prefer that the parser will extract (lookahead will return) the first N bytes that are present and then fill the rest with 0s.

This is something that is difficult to achieve using the current specification. One way to do it might be by writing the code like this:

    state parse_ipv4_no_options {
        l4_udf[31:24] = pkt.lookahead<bit<8>>()[7:0];
        l4_udf[23:16] = pkt.lookahead<bit<16>>()[7:0];
        l4_udf[15:8]  = pkt.lookahead<bit<24>>()[7:0];
        l4_udf[7:0] = pkt.lookahead<bit<32>>()[7:0];
        ...
     }

Obviously, this is difficult to both write and read.

"All-or-nothing" Mode

Note, that a truly robust data plane should be able to distinguish between PacketTooShort exceptions that occur while extracting UDF data as opposed to the ones that take place during parsing of the real headers (since the latter typically indicate that the packet is, indeed, malformed). This is where "All-or-nothing" mode can help.

In the UDF example above, suppose, that the PacketTooShort exception did occur. In that case, if the program can use the validity of previously extracted headers as a guarantee that they were, indeed, fully extracted, it might still be able to process the packet correctly.

Unfortunately, as described above, such a mode is pretty difficult to implement in practice. For example, some headers are truly long and by the time the parser notices that the packet is too short, some portions of these headers have already been extracted, the pointer advanced, etc, which would typically lead to the problems deparsing the packet. That's why more often than not programmers have to resort to "No assumptions mode".

Mixing modes

A better approach might be to allow the programmer to switch between different modes, depending on what is actually needed. For example, in the Ethernet-based systems, it is quite safe to use "No Assumptions" mode in the very beginning of the packet, since the minimum frame size is guaranteed to be 64 bytes (60 bytes without the FCS). "As-much-as-you-can" mode is only needed in special cases (such as extracting UDFs) and makes little sense for typical lookahead or header extract. "All-or-nothing" mode is sometimes needed for some "critical" headers, but more often than not "No Assumptions" mode works quite well too.

The "critical" headers often occur in the programs that need to parse tunneled protocols (e.g. VXLAN). Since parsing is done before other processing takes place, it is not always possible to know whether one needs to parse the inner headers or not. This is where PacketTooShort situations often occur and that's where "all-or-nothing" mode really helps (provided that other limitations can be overcome).

Proposals

Make "No Assumptions" mode the default

This will dictate that exceptions are imprecise and in the case of a parser error, one should not rely on the contents of headers and variables set in the parser.

Pros:

This follows the current practice
This will lead to the most efficient code, generated by default without the need to change the programs
The change is relatively safe given that most programs do not seem to rely on the current spec too much
It is better to do it now and tighten later than do otherwise

Cons:

Formally speaking, this is an incompatible change
It takes away some safety net
It makes it more difficult to reason about the programs

Rebuttals to the cons:

The probability of breaking programs that exist today is quite low, since most have been written using "No Assumptions" mode anyway. Moreover, the potential breakage can occur on exceptional path only and few people that process these exceptions will have no problem adjusting the code
It is not clear whether this "safety net" exits in the current practical implementations or whether it actually saved anyone
It makes more difficult to reason about programs in exceptional situations only

Provide additional `packet_in` methods that will allow P4 programmers to express their intentions clearly and unambiguously

We can extend the list of packet_in methods as follows (naming is open for discussion):

Current methods will use "No assumptions" mode: extract(), lookahead(), advance()
Optional methods that will use "All-or-nothing" mode, e.g. extract_atomic(), lookahead_atomic(), advance_atomic()
Optional methods that will use "As-much-as-you can" mode, e.g. extract_greedy(), lookahead_greedy(), advance_greedy()

Pros:

These methods will allow the programmers to clearly express their intentions.
It is a lot better to have clear methods, rather than semantic-altering annotations
Having the "atomic()" group will provide clear compatibility with the previous spec

Cons:

"Too many methods"
Not every architecture can implement everything.
On some architectures it might or might not be possible to use certain methods, depending on the particulars of a given header or type. For example, if the maximum number of bytes a given architecture can extract per clock is 20, then atomic extraction of a 40-byte-long IPv6 header might not be possible.

Rebuttals to the cons:

Nobody has to use everything if they do not need to
True. Unfortunately, packet_in is a standard extern and it seems to be difficult for individual architectures to define additional methods for this extern. However, this can be considered
As usual, a target-specific compiler can clearly error out in case it is being asked to do something impossible

Add barriers

In parser code the calls to various packet_in methods are often mixed with other operations, e.g. metadata assignments. Barriers provide an explicit way for the programmer to ensure the completion of a certain statement sequence, which in turn will facilitate safer and efficient parallelization.

Let us look at this simple sequence:

     state s1 {
        pkt.extract_atomic(hdr.h1);
        var1 = 1;
        pkt.extract_atomic(hdr.h2);
        var2 = 1;
        transition s2;
    }

Let as assume that the default will be "No assumptions" mode (as proposed above). In the case of PacketTooShort exception, happening during the extraction of hdr.h2 we will know that the header h2 will not be valid. But what about the values of variables var1 and var2? Given that these variables are completely separate from the headers and that there are no any data dependencies, an optimizing compiler will have an option to group these assignments, to place them in the beginning of the state, in the middle or at the end or combine them with extractions (depending on the HW specifics) and this is something we do not want to prevent by default. However, in some cases we need to be able to reason about their values.

This problem can be solved by inserting barriers that will ensure that the operations, performed before them are actually completed. In case of optimizing compilers described above, barriers will place limits on what can or cannot be re-arranged, but only in the places where it is important.

We can define "barrier()" (naming is subject to discussion) as a standard extern function or as an additional method of packet_in and use it like this:

     state s1 {
        pkt.extract_atomic(hdr.h1);
        var1 = 1;
        barrier();
        pkt.extract_atomic(hdr.h2);
        var2 = 1;
        transition s2;
    }

Conclusion

The proposal above provides clear mechanisms for the programmers to express their intentions and provide more options to handle parser exceptions, while at the same time making legacy code more amenable to deep optimization.

mihaibudiu commented 4 years ago

I am not sure that P4-16 exceptions are "precise"; the architecture is allowed to do anything in the reject state or before calling the next block (e.g., ingress). Normally it cannot modify headers, since it does not know about them, but there is nothing that forbids it from writing arbitrary bit patterns in the headers if it wants. In fact, it can do this even when you transition to the accept state.

vgurevich commented 4 years ago

@mbudiu-vmw -- my statement about the precise nature of the exceptions was, indeed, made based on the pseudocode definitions that we have in the spec and the assumption that verify() will immediately and instantaneously transition the parser to the reject state. If you believe that this assumption is not correct, we should probably amend the spec and state so, and this would probably imply that "no assumptions" model is already there. Again, we should then state it explicitly.

With the first issue, hopefully out-of-the-way, let's then look at the remaining proposals, that should allow us to make some assumptions when we need them, perhaps at some extra cost.

mihaibudiu commented 4 years ago

Yes, it does transition to reject immediately. We don't say what happens in reject, though. The architecture could do many things, but probably not everything (like continue parsing where it left off). Changing the packet_in API is probably the cleanest way.

vgurevich commented 4 years ago

@mbudiu-vmw, note also, that while the architecture is certainly free to do anything inside the reject state, assuming that it can "undo" certain things (e.g. some assignments) is probably pushing it. So I'd think we should add some words that will more easily allow for the imprecise exceptions.

jafingerhut commented 4 years ago

Is it accurate to say that your proposed methods ending with _atomic() are exactly the same as the ones in the P4_16 language spec today, without the _atomic() suffix?

If so, an alternate proposal is to add the _greedy() ones, keep the spec'd ones as they are currently, and add new methods with _noassumptions() (or some better name). Then there is no backwards compatibility issue.

I think I understand how to specify the _greedy() ones you propose. Would it be reasonable to say that lookahead_greedy() is definitely useful, but extract_greedy() and advance_greedy() are more questionable whether they have good use cases? Or do you have use cases in mind for those, too?

I have much less idea how you propose to define the _noassumptions() methods. The final state of the parser can be anything if the packet runs out of data? If nothing is guaranteed, then it seems the only possible thing you could safely do if the packet ran out of data on one of those operations is to resubmit or clone the packet as received before parsing, because the packet after parsing could be mangled beyond recognition, with data arbitrarily missing from anywhere inside of it.

jafingerhut commented 4 years ago

One could even imagine having two errors, the existing PacketTooShort that is recorded by the precise, _atomic() methods, and a new PacketTooShortNoGuarantees error if the parser ended in a state where the only safe thing to do if you want to keep the packet is resubmit or clone the as-received-before-parsing-began packet.

vgurevich commented 4 years ago

@jafingerhut, you are correct, the proposed methods with the _atomic() suffix are exactly the same as the ones we have in the spec without the suffix.

With regards to the alternate proposal, I think the question is what is the goal: to make the spec backwards compatible or to make the programs and compilers backwards compatible? Keeping the "no-suffix" methods as they are today will keep the spec backwards compatible, but will essentially require the at least some compiler implementors to immediately re-write the compilers and the data plane programmers to rewrite their data plane programs. My proposal, on the other hand will certainly change the spec, but would allow pretty much everyone to keep the code as it is today.

Another reason, why I think that "no assumptions" model should be the default is the general principle that the default constructs should ideally be the cheapest. If someone wants to pay for predictability, they need to ask for this explicitly. I do believe that this is an important principle that we should follow for the next decade or two. Once packet processing hardware becomes as cheap and abundant as general purpose CPUs and you will move from counting bits into counting gigabytes, we'll switch to another model (i.e. optimizations should be explicitly requested), but for now the model is "indicate your willingness to pay extra" is more practical. If we do not follow that model, then practical P4 implementations and P4 programs will use other methods, which are most efficient for these platforms and then you will have even less compatibility if you have today. Or (even more probable alternative), compiler designers for these platforms will simply choose to ignore the spec, rather than require their customers to rewrite all the code. I think I summarized these thoughts in my proposal above, but if you agree, I can edit it with this detail.

The definitions of greedy() methods. If you want to extract_greedy() N bytes, but only M < N is available, then the first M bytes of the header will be taken from the bytestream (packet_in), the rest will be set to 0, the header will be marked as valid and the pointer will be advanced by M bytes (to the end of the packet). This also means that if you deparse such a header, the egress packet might be N-M bytes longer. advance_greedy(): if you want to advance by N bytes, but only M < N is available then you advance by M bytes (instead of not advancing at all, or advancing by some other number). In both cases error.PacketTooShort will still be present.

I think that you got the definition of _no__assumptions() methods (as well as the possible approaches to handle them) quite right already :)

jafingerhut commented 4 years ago

What happens if a parser is written with a mix of extract_atomic() and extract_no_assumption() calls? Does that make sense?

Using A for extract_atomic() and N for extract_no_assumptions(), then for example, suppose that a path through the parser goes through a sequence of calls: 1. A. 2. N. 3. N. 4. A. If the packet ran out of data on the 3rd operation, which is an N, can the parser continue executing past that and start but also fail the 4th operation, which is an A? What should the final state of the parser be allowed to be?

Should extract_atomic() operations behave as barriers always, so that either they complete and get all their data, or fail cleanly and extract nothing and go to an error state?

It seems that having a mix of some of different kinds of operations leaves open questions about what the allowed behaviors and final states ought to be, unless perhaps you already have a proposed answer for such questions.

jafingerhut commented 4 years ago

You say: "Keeping the "no-suffix" methods as they are today will keep the spec backwards compatible, but will essentially require the at least some compiler implementors to immediately re-write the compilers and the data plane programmers to rewrite their data plane programs."

I understand that maintaining the spec as backwards compatible would require data plane programmers to change their programs if they want to take advantage of the _no_assumptions() version of calls, but why would it require compiler implementers to re-write their compilers?

vgurevich commented 4 years ago

@jafingerhut,

On mixing modes: as I explained, this requires barriers and I think that they are needed even when you use _atomic() methods, at least the way I envisioned it today. This means that when you have a state, such as:

state s {
    a = 2;
    b = 3; 
    pkt.extract_atomic(hdr.h1);
    pkt.extract_greedy(hdr.h2);
}

it would probably be prudent to place a barrier between to extracts as to leave no doubt that the assignments happen. We can make _atomic() methods to have built-in barriers as well, but I'd rather have separate methods for separate functionality.

Why do I say that compiler implementers will need to rewrite the compilers? It is because I do believe that if you carefully check how most compilers implement these methods, they are following "no assumptions" model instead.

jafingerhut commented 4 years ago

If _atomic() methods do not have built-in barriers, in what way do they behave like the current spec says they should?

vgurevich commented 4 years ago

@jafingerhut -- the atomic() is responsible only for what happens inside the primitive :) barrier() is responsible for everything that happens before it.

jafingerhut commented 4 years ago

It seems to me like some kind of precise description that says, for all possible P4_16 parsers that might be written, with whatever mix of the existing and new method calls are being proposed here, and for all possible packets that might be given to such a parser (i.e. any sequence of bytes, with arbitrary length), what are the allowed set of final states that the parser can end with?

The current P4_16 spec I think does that fairly well, in that it typically only allows one final state.

I think this proposal would lead to many combinations of programs and input packets where the final state would be "undefined, i.e. later ingress/egress control code should not rely on the value of anything except that a parsing error occurred". It isn't clear to me yet which combinations of programs and packets would lead to this state, and which would not.

vgurevich commented 4 years ago

@jafingerhut -- if by "the state" you mean not only the name of the final state, but also the contents of the headers/metadata, yes, the proposal is to expand the list of possible outcomes. If "the state" is just the name of the state, they stay the same, i.e. accept and reject. It is already there, so I'd rather codify the situation that exists on the ground.

Again, please note all all these happen only in exceptional cases. Normal processing is not affected tp begin with. I think it makes sense not to dictate what exactly happens in these cases, because they are, er..., exceptional. Otherwise, this part of the spec will simply be ignored (as it pretty much currently is).

AashleshaPatil-BFN commented 4 years ago

I am trying to understand how each function mentioned would be used. How would advance_atomic and advance_greedy work and how would they be used?

vgurevich commented 4 years ago

@AashleshaPatil-BFN -- the methods will work as follows:

pkt.advance_atomic(N) will advance the pointer by N bits in case that there are no less than N bits between the current pointer position and the end of the packet. Otherwise pointer will not be advanced and the exception (error.packetTooShort) will be reported
pkt.advance_greedy(N) will advance the pointer by min(N, bits_remaning_in_the_packet). If the number of bits remaining in the packet is less than N, the exception will also be reported.

Out of 9 methods that are defined to be completely symmetric and orthogonal, these two are probably a little less useful than others, but still can be used, e.g. to safely skip/discard some headers.

jfingerh commented 4 years ago

This issue was discussed at length during the 2020-Aug-10 language design work group meeting. One thing I called out to be more explicit about is that it is not only the side effects of the extract calls that this proposal is intended to modify, but also of all assignment statements in parser states. They will be 'pending' until 'committed' by a barrier, successful all-or-nothing extract call, or reaching an accept state, just as headers that extract calls write into will be.

Regarding the proposed barrier() call, note that since it has no arguments, an optimizing P4 compiler is free to reorder such a call with assignment statements. At least, I see nothing inconsistent with the P4_16 language spec, neither now, nor with these proposed changes, that would make it incorrect for a P4 compiler to transform this parser state code:

    parser state1 {
        meta.f1 = hdr.ethernet.etherType;
        barrier();
        meta.f2 = hdr.ethernet.dstAddr;
        pkt.extract_atomic(hdr.ipv4);
        transition accept;
    }

into this:

    parser state1 {
        barrier();
        meta.f1 = hdr.ethernet.etherType;
        meta.f2 = hdr.ethernet.dstAddr;
        pkt.extract_atomic(hdr.ipv4);
        transition accept;
    }

or in general with the barrier call reordered anywhere before the call to pkt.extract_atomic().

If a compiler does this, it seems to me that it changes the possible final states of the parser to 'more defined' or 'less defined' in the case of a parser exception, and this seems to go against the desired behavior of the barrier() call.

I do not understand how one can make a barrier that is an extern function call that does not have this issue. I suppose one could say that the barrier should be a new statement of the P4_16 parser sublanguage itself, and then a P4 compiler would know that it is not allowed to reorder it with assignment statements.

Note that if you suggest that barrier be an extern function, and it should be passed arbitrary lists of variables of assigments that it should not be reordered relative to, then it is not clear to me whether it is sufficient to pass it values like meta and hdr, i.e. the out and inout parameters of the containing parser, only. It might be, but I do not see how to convince myself that reordering with assignments to local variables declared within the parser might cause a change in behavior.

jfingerh commented 4 years ago

In my previous comment, a better example code snippet would I think use pkt.extract_no_assumptions(hdr.ipv4), and have a transition select (hdr.ipv4.protocol) { ... } statement afterwards that in some cases could cause an exception and go to reject.

vgurevich commented 4 years ago

@jafingerhut -- the same way how similar calls are implemented, say, in a Linux kernel, the designers of such primitives will have to use target-specific compiler facilities to disable reordering, since the whole point of a barrier is to ensure that operators that appear before it are executed before it and the operators that appear after it are executed after it.

Here is an example of a barrier implementation in Linux kernel (look in arch/x86/include/asm/barrier.h):

#define mb()    asm volatile("mfence":::"memory")
#define rmb()   asm volatile("lfence":::"memory")
#define wmb()   asm volatile("sfence" ::: "memory")

As you can imagine this is highly target specific: the spec does not define the semantic of the keyword asm and it doesn't say much what volatile means in this context either. However, all mainstream implementations seem to agree on this meaning without the spec, simply by following established GCC practices.

The fact that the spec allows reordering does not mean that the compiler can't turn it off when it sees barrier()

jafingerhut commented 4 years ago

I agree that the P4_16 LDWG can create new statements that prevent reordering around them, no question about it.

I am simply pointing out that if barrier() is an extern function, then nothing stops the compiler from reordering extern calls with assignment statements, where there are no data dependencies, and an extern function call with no parameters has no data dependencies with any P4_16 language variables at all, ever.

If we make a new barrier statement in the language, sure, that is fine. We can define that to be non-reorderable with any and all assignment statements, unless someone besides you and I sees a problem with creating such a new language statement.

Do the examples of C statements you give use annotations, or pragmas, or something similar? If so, that direction for barrier calls seems to go against your misgivings about P4_16 annotations changing the meaning of the program.

vgurevich commented 4 years ago

@jafingerhut,

If the reordering can be done by the frontend, then we need to introduce some mechanisms (e.g. keywords) in the language. If it is done only by the backend passes, then the target-specific mechanisms are OK. If the reordering can be done by the common midend passes, we should probably define an API that will allow the backend to control them better.
On annotations vs. "first-class" language mechanisms (statements, keywords, attributes, etc.). Here are my thoughts: a. Generally, annotations should not be used to indicate programmer's intent. That should be done using the language b. Some new, experimental features can be expressed as annotations at first with the intent of either being promoted to a first-class language construct (which is precisely what happened in the GCC example (even not so long ago we'd write __asm__ __volatile__("...") or to be deprecated c. The rules are a lot more lax for the architecture definitions, which are internal files, written by the vendor only once (and often not even read by anyone, except the compiler itself), compared to the actual P4 code written (and read) by tons of people.

vgurevich commented 4 years ago

Updating the issue as per last discussion in LDWG...

Summary

It is agreed that the state of the output (out and inout) parameters of a parser cannot be defined, unless the parser reaches the accept state. If, instead the parser reaches the reject state, it is up to the specific architecture to define (or not to define) the state of the output parameters of the parser.
- A significant portion of this writing will be dedicated to explaining how to define that state
It is agreed that it would make sense to add variants of the current packet methods.
- It is still to be decided whether to have more names (e.g. extract_atomic()) or to have additional flags. The latter option is probably nices, but is predicated on the final outcome of #884
- I believe there is a consensus that the default behavior should be "no assumptions" as it is the cheapest to implement, the most amenable to optimization and the one that is most typically used by the vast majorities of programs
A number of questions were raised about the semantics of the proposed barrier() extern function. We will try to provide reasonable explanations of its effects that do not depend on any particular target.

What is "sequential execution"?

In several places P4_16 specification refers to "sequential execution". For example, in section "11.3 Block Statement" it states:

A block statement ... contains a sequence of statements and declarations, which are executed sequentially.

Similarly, in section "12.1 Parser states" it states:

In terms of the ParserModel, the sequence of statements in a state are executed sequentially.

For the compiler implementors this needs to be clarified, since they often need to translate P4 to highly parallel targets, where naïve translations to strictly sequential execution will be prohibitively expensive.

In reality, the actual requirement is that it is only the final result of the P4 program execution that needs to be the same as the result, produced by a sequential execution of the corresponding statements. All the intermediate results can differ as long as the final result is correct -- this is precisely the basis for all possible optimizations.

Let us try to be more specific and see if we can actually define what is that "final result" and what can be left to the discretion of the implementer.

We'll start with a trivial parser example:

parser ParserImpl(packet_in packet,
                  out headers_t hdr,
                  inout meta_t meta,
                  inout standard_metadata_t standard_metadata)
{
    state start {
        transition parse_ethernet;
    }
    state parse_ethernet {
        packet.extract(hdr.ethernet);
        transition select(hdr.ethernet.etherType) {
            0x0800: parse_ipv4;
            default: accept;
        }
    }
    state parse_ipv4 {
        packet.extract(hdr.ipv4);
        meta.ip_extracted = 1;
        transition accept;
    }
}

Here are some things we can say about the parser:

If someone sends a packet Ether(...)/..., once the he parser reaches the accept state, the header hdr.ethernet will be valid, and will contain the bytes from the first 14 bytes of the corresponding packet, while the header hdr.ipv4 will be invalid
If someone sends a packet Ether(...)/IP(...)/..., once the parser reaches the accept state, both hdr.ethernet and hdr.ipv4 will be valid and will contain the bytes from the first 34 bytes of the corresponding packet
One can also reason that at the point where select() is performed, the field hdr.ethernetetherType will contains the ethertype (2 bytes at 12 byte offset from the start of the packet) -- this is required for select() to work.
However, it is impossible to say, that at same point in code (the select() statement), whether the first 12 bytes of the packet buffer have actually made it into hdr.ethernet.{dst,src}Addr. They might have been copied or might not be -- it will not affect the program behavior, as long as they "make it" before the parser reaches the accept state. In fact, we might not be able to conclude whether the valid bit for the header has been set -- it might be set later and this will not affect program semantics, as log as it is done in time the parser reaches the accept state.
Similarly, it is impossible to say, whether hdr.ipv4 has been fully populated with data before meta.ip_extracted was set to 1 -- it does not matter as long as they are populated by the time the parser reaches the accept state.
Neither it is necessary to guarantee that meta.ip_extracted is set to 1 before hdr.ipv4 is actually extracted -- the effect will be the same once we reach the accept state

In other words, sequential semantic must be followed as long as effects can be observed and the number of "observation points" in P4 is limited. In P4 parsers, these are:

select() statements. They can only "see" fields/variables they select on
if() statements. They can only "see" fields/variables that are referenced in the conditional expression
Expressions. They can only "see" fields/variables that are referenced in the right-hand-side.
Extern method invocations. They can only "see" their input parameters
accept state is the only place where the results have to be finalized

As per our latest discussion reject state is not an "official" observation point -- if the parser reaches reject state, the state of headers and metadata is defined by the architecture.

Note, that this thinking can also be extended to the controls. Again, the final "observation point" is the end of the control. Any place where values are read (table keys, if() statement conditions, input parameters to extern methods, values, referenced in the right-hand-side of expressions) can be counted as mini-"observation points", that only "see" the corresponding fields/variables, but nothing else.

Moreover, for many architectures, that have a very "linear" (Parser-Control-Deparser) structure, even the end of the parser/control might not necessarily be a "full" observation point. Imagine, for example, that the previously described parser is complemented with this control:

control ingress(inout headers_t hdr,
                inout meta_t meta,
                inout standard_metadata_t standard_metadata)
{
    apply {
       standard_metadata.egress_spec = standard_metadata.ingress_port;
       hdr.ethernet.dst_addr = 0xFFFFFFFFFFFF;
    }
}

In such a program, it does not even make sense to insist that once the parser reaches "accept" state, hdr.ethernet.dst_addr must contain the first 6 bytes of the ingress packet: it might or might not and it will not change the result of the program execution -- it will be still fully compliant with the sequential semantics. Neither can we reason about that metadata meta.ip_extracted -- in fact the compiler has the full right to eliminate it and still the semantics will be preserved.

Similarly, if that control is followed by this deparser:

control DeparserImpl(packet_out packet, in headers_t hdr) {
    apply {
        pkt.emit(hdr.ipv4);
    }
}

we cannot reason about the contents of the header hdr.ethernet at all. We can only say that by the time the parser reaches the accept state, the pointer, separating the headers and the payload will be moved to the offset 14 or 34 bytes.

In other words, the compiler has a lot of opportunities to optimize the code by reodering it, parallelizing it and eliminating it, while still preserving sequential execution semantics because the number and the scope of the "observation points" is very limited: the only guaranteed full observation point is the very end of the processing pipeline. The accept state of the parser or the end of the control might be a "good" observation point, but do not have to be a full one.

This is a very important property of the language that we need to preserve in order to enable deep optimization of P4 programs.

How can `barrier()` be defined?

Essentially, barrier() becomes an artificial observation point, such that all writes that happen in the text of the program before it and that can be observed somewhere after it need to actually take place.

The second clause is quite important -- for example, if some code can be safely eliminated, it can be still eliminated even with the barrier. However, in most programs, where the results of the parsing are actually used downstream, this pretty much means "everything before barrier(), which is precisely what's needed.

Here is an example, based on the previous parser code:

parser ParserImpl(packet_in packet,
                  out headers_t hdr,
                  inout meta_t meta,
                  inout standard_metadata_t standard_metadata)
{
    state start {
        transition parse_ethernet;
    }
    state parse_ethernet {
        packet.extract(hdr.ethernet);
        barrier();            // <------------ BARRIER 1 ---------
        transition select(hdr.ethernet.etherType) {
            0x0800: parse_ipv4;
            default: accept;
        }
    }
    state parse_ipv4 {
        packet.extract(hdr.ipv4);
        barrier();            // <------------ BARRIER 2 ---------
        meta.ip_extracted = 1;
        transition accept;
    }
}

In this case, we should be able to reason that both hdr.ipv4 and hdr.ethernet are extracted before meta.ip_extracted is set to 1. This can be approximated by the following (purely theoretical) transformation:

    state parse_ipv4 {
        packet.extract(hdr.ipv4);
        transition select(hdr.ethernet, hdr.ipv4) { // i.e all fields from those headers 
          default: parse_ipv4_1;
       }
    }

   state parse_ipv4_1 {
        meta.ip_extracted = 1;
        transition accept;
    }

Why is this important?

This is important for the targets and architectures that implement high degree of parallelization, but want to provide the users the ability to reason about the state of the parser output parameters in case the final state is reject and not accept.

By inserting barrier(), the programmer writing a program for such a target can ensure that certain headers or metadata that are critical to their algorithm are guaranteed to have predictable values even in a case of a parser exception.

Theoretically, barrier() can also be used in controls, but since in P4_16 no exceptions can occur in controls (they are always executed to completion), it is not going to be very useful there.

mihaibudiu commented 3 years ago

How about this approach:

we assume that there's an extract which does not cause any transitions and fills the packet best effort
we assume that there is a packet_in function that gives you a boolean bool exception_occurred()
Then I am guessing that your barrier is the same as verify(exception_occurred(), error.packet_too_short).

We can refine this scheme to have exception occurred to return an error, and replace the barrier with error e = exception_occurred(); verify(e == error.ok, e);

Would this scheme provide you all the tools that you need to implement the parallel parsers?

vgurevich commented 3 years ago

@mbudiu-vmw -- First of all, my apologies -- somehow I didn't see this message before, so thanks to @jnfoster who let me know about it.

Let me process this -- for one we do not currently use verify() in TNA at all and neither do we use the error type, so I need to translate this into what we are doing somehow, but it looks to me that an unmovable verify() might be able to serve as a barrier.

I am still trying to understand how do we plan to distinguish between be "best effort" ("do not care") extract() and the strictly sequential one. In the proposal above we'd use separate method names and I am specifically suggesting changing the standard extract() method to become "best effort". We need to do it until it is too late.

mihaibudiu commented 3 years ago

yes, you would need either a new method extract_noexceptions, or, alternatively, if you never expect to mix the two kinds of extract, we could use an annotation on the parser @noexceptions.

You would still have to say what happens to the headers and packet when extract_noexceptions returns an error. These could be left to the architecture.

vgurevich commented 2 years ago

Attaching a presentation that explains the issue better. Parser Exception Handling.pdf

@mbudiu-vmw -- This is the presentation that should explain why we need this. Take a look and, perhaps we can discuss in the next meeting. I am also happy to elaborate more there.

mihaibudiu commented 2 years ago

You can simplify some of your expressions significantly. For example, instead of hdr.ipv6.traffic_class[3:0] = pkt.lookahead<ipv6_t>().traffic_class[3:0]; You should write hdr.ipv6.traffic_class[3:0] = pkt.lookahead<bit<4>>(). We could implement this directly in the compiler as a mid-end pass. The real question is "what are the primitive operations of the hardware parser? what do these do when there is an exception?"

vgurevich commented 2 years ago

@mbudiu-vmw ,

It is possible to simplify, but it will look different:

hdr.ipv6.version[3:0] = pkt.lookahead<bit<16>>()[7:4];

and that's pretty much what is happening once the program gets lowered more and more.

In terms of the primitive operations, what they can do depends on the targets, but basically there are three operations:

Pulling the data from the bytestream into the header fields or metadata. More often than not there might be restrictions on how much an individual operation can pull in, the size and the alignment of the data, etc.
Advancing the pointer. Again, there are HW-specific limitations (the minimal advancement unit (bit/byte, etc.), the maximum amount etc.
Other operations (assigning constants, arithmetic, etc.)

The important thing is that many are grouped together, parallelized and reordered. This is described in the previous writings here.

mihaibudiu commented 2 years ago

What happens when the hardware cannot pull data?

vgurevich commented 2 years ago

What happens depends on the specifics of the hardware, but given that you have multiple operations that are being executed in parallel, the typical situations are:

Some operations (e.g. the ones that can succeed, because there is data or because they operate on something else) will succeed, and others will not (i.e. best effort/greedy)
All operations will be aborted and none will succeed. This might mean that even an assignment of a constant might fail

If individual operations are grouped together in a "natural" way (you extract byte after byte as they appear on the wire), then (1) is often a good option for things like user-defined fields, et. If operations are re-arranged then you get the "don't care" case. And (2) is self explanatory.

Barriers will basically tell the compiler how to group or not to group the operations, whether to re-arrange them or not, etc.

mihaibudiu commented 2 years ago

How does the program know that an operation has not completed? In your translation there are no checks for success or status codes.

jnfoster commented 1 year ago

In the interest of tidying up the set of active issues on the P4 specification repository, I'm marking this as "stalled" and closing it. Of course, we can always re-open it in the future if there is interest in resurrecting it.

p4lang / p4-spec

Change parser exception model and provide better controls for exceptional situation handling #880

Personnel

Design

Implementation

Process

Introduction

Problem Description

Some Background on Hardware Parsers

Analysis

"No Assumptions" Mode

"As-much-as-you-can" Mode

"All-or-nothing" Mode

Mixing modes

Proposals

Make "No Assumptions" mode the default

Provide additional `packet_in` methods that will allow P4 programmers to express their intentions clearly and unambiguously

Add barriers

Conclusion

Summary

What is "sequential execution"?

How can `barrier()` be defined?

Why is this important?

p4lang / p4-spec

Change parser exception model and provide better controls for exceptional situation handling #880

Personnel

Design

Implementation

Process

Introduction

Problem Description

Some Background on Hardware Parsers

Analysis

"No Assumptions" Mode

"As-much-as-you-can" Mode

"All-or-nothing" Mode

Mixing modes

Proposals

Make "No Assumptions" mode the default

Provide additional packet_in methods that will allow P4 programmers to express their intentions clearly and unambiguously

Add barriers

Conclusion

Summary

What is "sequential execution"?

How can barrier() be defined?

Why is this important?

Provide additional `packet_in` methods that will allow P4 programmers to express their intentions clearly and unambiguously

How can `barrier()` be defined?