[PSA] LAG resolution in PRE

samar-abdi commented 6 years ago

I might be missing something, but I believe we have not really talked about LAGs in PSA. AFAIK, the ingress output metadata egress_port should be set to a singleton port value. So, for LAG resolution, a P4 table mapping a LAG port number to a singleton port is needed in the ingress. However, this means that when programming multicast groups to the PRE (at least as currently defined), the egress port in the multicast group entry should be a singleton, since the egress port value cannot be modified in the egress pipeline.

I imagine that the clone port value cannot be a LAG number for the same reason, i.e. the clone port values should also be LAG-resolved in the ingress pipeline.

Do we want the LAG resolution table to become part of the PRE? The advantages are that the P4 programs will become simpler and sending multicast replicas to LAGs would become possible. The disadvantage is that PRE definition will get more complex: we will need to define hash parameters for it, and a more complex API. Thoughts?

samar-abdi commented 6 years ago

One option is to configure the hashing for LAG resolution PRE through the switch config. The P4 API can just have a function to add a lag port and it's mapping to a set of singletons, with the resolution logic being applied as defined in the switch config.

jafingerhut commented 6 years ago

This did come up once briefly before in a Github comment thread on an issue, where I raised a question about what to do if someone wanted to do something different than hash-based selection of LAG members.

One possibility would be to do as you suggest above to handle what is likely the common case, i.e. hash-based selection of LAG members.

If someone wanted to do anything besides hash-based selection of LAG members, then they must write explicit P4 code for it, and always have the ingress code choose the singleton ports they want packets to go to (of course they could mix and match port selection methods for different LAGs if they wished, too).

I can't immediately think of any other complexities than the one you mention, i.e. the control plane API gets LAG-specific stuff added to it. Hardware implementations would want to know about this well in advance, of course (although some could implement it by having extra P4 logic invisibly appended to the end of your P4 ingress code, to implement the LAG selection behavior, at least for unicast).

If you wanted LAGs to be members of multicast replication lists in PRE, that is not something that the extra invisible P4 code approach would necessarily work for, unless someone also had variable-length loops in their implementation. Hardware-based targets would want to know about that even more so than for unicast traffic, I would guess.

I can't think of any fundamental reason not to allow this in PSA v1.0, but it would be good to discuss it at the next meeting to get thoughts from other working group members, too.

jafingerhut commented 6 years ago

A minor point -- even without special LAG support in PSA, a common technique I have seen is that the control plane code hashes and selects a LAG member for each multicast group with a LAG in its replication list. Then of course it is up to the control plane to update all such multicast groups if a LAG member fails, so I understand and agree that is going to be slower for control plane code to react to, vs. having the extra level of indirection existing in the fast path.

vgurevich commented 6 years ago

@samar-abdi ,

I do not think that LAG resolution per se should be a part of PSA. There is nothing special about it and it can be easily coded in a variety of ways using the existing primitives. This also allows one to have a lot of flexibility in terms of how it is performed. Even the most "LAG-related" mechanism, i.e. selector match kind is more of a convenience, rather than a necessity -- it can be implemented using regular tables and hash calculation.

What I do agree with you on is that the functionality of PRE needs to be specified more formally. We can debate whether or not to add a notion of LAG or ECMP group to a standard PRE, but whatever we decide should be made very clear, because otherwise people will not be able to write non-trivial programs that involve multicast in a portable fashion.

For everything, other than multicast, doing resolution in the ingress pipeline sounds like a reasonable approach to me.

Thanks, Vladimir

jafingerhut commented 6 years ago

@vgurevich I believe that perhaps the functionality of PRE already does have a specification written down in the latest draft PSA, although perhaps not as formal as you might wish.

The part relevant for multicast replication functionality is in Section 6.3.2 "Multicast operation". No control plane API is specified in the PSA for modifying the PRE multicast replication behavior, but I am guessing that will be covered by the P4 Runtime API working group?

Another thing hinted at, but not required or specified, is that there might be queues and a buffer between ingress and egress (of course any hardware implementation pretty much has to, if it doesn't want to have a very high packet drop rate due to even very short-term contention for the same output port). The PSA does suggest that normally packets should be processed in the egress control block in the same order as they were processed by the ingress control block for unicast packets with the same (input port, class_of_service, output port) combination, and similarly for multicast packets. This is in Section 6.2.

Another things already specified is that the PRE can drop packets that ingress sends to it, or that egress sends to it via clone operations, e.g. due to congestion. The PSA says nothing about mechanisms by which a P4 program can observe such drops. It only recommends that a PSA implementation should provide counters for such drops. Section 6.2 has a paragraph about that after the pseudocode, and the end of Section 6.5 also does.

Is there anything above that you don't think should be specified about PRE behavior in the PSA spec?

Is there anything you think should be specified about PRE behavior that isn't listed above?

samar-abdi commented 6 years ago

@vgurevich

Thanks! There are multicast use-cases today that send replicas to LAGs, with the LAG resolution happening after the replication. With current PSA, adding a P4 table for LAG resolution after PRE is not possible since the egress_port is immutable. One could theoretically have a twisted implementation with resubmit, but it would be nice to avoid that and reflect the actual packet processing behavior in the P4 program.

samar-abdi commented 6 years ago

@jafingerhut Yes, we are proposing a multicast group entry in P4Runtime and I imagine we will have a richer PRE API. Anyway, I would request that we set aside substantial time at the next architecture WG meeting to define PRE semantics and API requirements. Thanks!

vgurevich commented 6 years ago

@samar-abdi ,

I do not think you can do much in P4 program, since P4 does not know anything about multicast. It is all implemented in PRE and while P$ is perfectly OK describing the interfaces to and from PRE, you can't really use it to describe PRE behavior.

This is also true about any other fixed component that P4 interacts with and we should probably accept the fact that P4 does not describe the full (serdes-to-serdes) data plane processing algorithm, but only the key parts of it as allowed by the given architecture.

So, for your practical problem, I think that the solution is in the detailed PRE specification nd I see 3 options:

The simplest and the most portable solution is that PRE knows nothing about LAGs/ECMP groups. When the software is being asked to add a LAG or ECMP group to a multicast group, It should then pick and add only one port (one path) of a LAG or ECMP group to that multicast group. It is not ideal, but very portable and is not that bad if you have many multicast groups, since they can be carried over different LAG/ECMP members.

Another, very simple solution is to add all ports/members of LAG/ECMP to the multicast group and let PRE produce more packets than needed and then drop all but one of them in the egress control (while having the full power of P4 available to do that).

The more difficult and less portable solution involves making PRE LAG/ECMP-aware. If we decide to go that route (e.g. the PRE in simple_switch is LAG-aware) then you will need to decide how PRE should do LAG resolution, whether additional metadata will be required to control it and how it will be compatible/incompatible with unicast LAG/ECMP resolution that will still be fully defined in P4 code.

samar-abdi commented 6 years ago

@vgurevich

Thanks. These are very good points.

With the last option, can't we just remove LAG resolution table from the P4 program and let PRE do LAG resolution for all egress packet instances, whether they are the result of unicast, clone or multicast? Is there value in doing different types of lag resolution for the different packet instance types?

I agree that configuring the PRE with the LAG table data is going to be challenging, because we need to provide the selector input to PRE in the P4 program and the group/flow programming through the API.

vgurevich commented 6 years ago

@samar-abdi ,

You can certainly define PRE functionality so that it resolves LAGs, but I would advice against that for several reasons.

On the philosophical/ideological grounds: the more functionality you move into the fixed function components, the less programmable and more fixed-function the device becomes. Also, as a result, the whole architecture will become less and less portable as fewer and fewer devices will be able to implement it correctly (including by the way some of the devices you are familiar with)
On the flexibility front: by moving LAG resolution into PRE you a. Define a very specific algorithm for that b. Force it to always be done at the very end, right between ingress and egress processing The questions then become: what to do if you need to resolve LAGs multiple times (e.g. you want a port-based ACL that can redirect a packet to another LAG)? What do you do with ECMP (especially over LAGs), what do you do with Virtual Port LAGs, etc, etc, etc? What if you want, say, stateful LAG resolution? c. You still need to reconcile ingress port->LAG mapping with the egress one (and even more difficult thing happens when you think about ECMP, VP-LAGs and such).

My recommendation is to keep all this to be as simple and generic as possible, otherwise, PSA will not be portable or flexible enough to to non-standard things.

samar-abdi commented 6 years ago

@vgurevich

I would much prefer to do all port resolution with a P4 table. It is just the placement of that table that I am concerned about. Right now, we are forced to do LAG resolution before replication because that table must reside in ingress. PSA would be more flexible if it provided an option to do LAG resolution after replication. Another P4 programmable control block after replication would be nice, but I can see the downsides to that.

antoninbas commented 6 years ago

@vgurevich's first solution -create a multicast group per port in the LAG, compute the hash in the ingress pipeline and pick the multicast group from a set based on the hash result- sounds good to me. @samar-abdi I'm not sure why you want to do LAG resolution after replication (although that does sound like @vgurevich's third solution); you would just be creating copies for nothing.

jafingerhut commented 6 years ago

@antoninbas Probably a use case that Samar is thinking about is L3 multicast routing. Each of the output interfaces could be a single physical port, or a LAG (among other possibilities, depending on what features you support, e.g. IP tunnels). You either need to pick a LAG member for the multicast group when you configure the multicast group, or you need a system that can do LAG member selection after multicast replication, to implement L3 multicast to LAG interfaces.

antoninbas commented 6 years ago

"You either need to pick a LAG member for the multicast group when you configure the multicast group". That's exactly @vgurevich's first solution, right? Even though you do end up with an exponential number of multicast groups if you want to have the complete distribution.

samar-abdi commented 6 years ago

@jafingerhut Thanks, yes this is good example and my use case is similar. We are not creating copies for nothing, it's just that the egress_port of a copy is a trunk, not singleton. The LAG resolution table should resolved the singleton value and overwrite the egress_port, so that the copy in the egress pipeline has a singleton value of egress_port.

As for @vgurevich first solution, I imagine, we are defining multicast groups only with singletons. The controller sets a logical multicast group, which is then resolved in a P4 table to a multicast group programmed in the PRE? In this case, the multicast group resolution table would need to watch all ports of the resolved multicast group. Again, this seems like the P4 program not matching the pipeline's reality.

jafingerhut commented 6 years ago

@antoninbas It is completely impractical to create an exponential number of groups for a complete distribution. If you support hundreds or more of L3 multicast groups, you can just make a single choice for each multicast replication group, and trust/hope that your hashing choices are good enough (or the control plane can even potentially make choices based on criteria other than hashing, too). There is unlikely to be a reason to emulate exactly the behavior of a "LAG choice after multicast replication" implementation by means of an implementation that requires one to pick a physical port in the PRE replication lists.

antoninbas commented 6 years ago

if the only requirement is a good enough distribution of flows, and not predictability of the path for a given flow independently of the P4 hardware being used, then it's fine by me otherwise, does it mean that we have to standardize the heuristic used by the switch software to pick singleton ports from the LAGs when creating the multicast groups?

jafingerhut commented 6 years ago

@antoninbas The simplest thing for the P4Runtime API to do is support only physical ports as members of multicast group replication lists, of course. If you don't provide the option to specify LAG interfaces at all, it is clearly up to the caller to update failed member ports with different member ports, if the caller wishes.

If you think it would provide enough benefit to have an option that picks a LAG member for callers that want to provide LAG interfaces in the P4Runtime API, you could consider that, but it will of course raise questions like:

Does that mean that P4Runtime API also promises to replace failed LAG member ports with good ones automatically? If so, how quickly?
I need an API to determine which member port the implementation picked for me for each LAG member of a replication list.
Can I register a callback that will be called to inform me whenever a member port selection changes?
What knobs do I have to control the member selection?

jafingerhut commented 6 years ago

Additional note on my previous comment:

The direction for fast re-route of unicast packets to LAG interfaces seems to be to have a separate action selector in the P4 program for that purpose, and then use the multi-controller option recently discussed in the P4Runtime API working group to choose to control that action selector (and maybe other externs and/or tables) from a "close by" controller with low latency to the device, vs. in a "far away" controller.

The same technique could be considered for PRE multicast replication groups, as a mitigating factor for reaction time, if the only option provided in the API was to add physical ports to the replication lists.

antoninbas commented 6 years ago

The problem with offloading this to the P4 Runtime caller is the same as for action profile member fast failover (https://github.com/p4lang/p4-spec/issues/457): in the case of a SDN deployment, the latency between a LAG member going down and the dataplane being updated is not acceptable. If P4 Runtime is meant to be a unique API for both SDN and local OS, then we have to have some kind of support for this. In the end I imagine that this thread and the decision of the PSA WG to describe that use case in the description of the PRE extern in the PSA spec don't really need to have an impact on the design of the API. It seems we have pretty much already reached a decision to support LAGs for action profile member fast-failover, and it seems inconsistent to me not to propagate this to the PRE API.

vgurevich commented 6 years ago

@samar-abdi ,

We can (and already do) use regular P4 tables for LAG resolution in the ingress pipeline, but we can use them only for unicast packets.

The reason we can't use P4 tables for multicast packets (in the ingress) is because P4 controls can only transform packets. They cannot create multiple packets on their own (and then transform them individually) -- that's the job of PRE that P4 ingress control requests to create multiple copies (e.g. by setting the multicast group id).

The reason we do not resolve LAGs in the egress pipeline is because that's not how the modern high-speed hardware works :( I would love to be able to set/change the egress port in the egress pipeline and we can certainly define such an architecture, but I would not be able to name a device on which we can implement it, at least not in the terabits-per-second family (or even hundreds of gigabits per second).

I totally agree that PSA solution is complex/complicated, but it does follow the standard industry practices (as do at least some of the solutions that I proposed), so if we want to be portable, we need to restrain ourselves.

vgurevich commented 6 years ago

@jafingerhut ,

In terms of multicast operation, I think we do need to add more meat. For example, what is the contents of a multicast group? Is it just a set of ports, or is it sets (port, rid) pairs, or is it a set of rids with sets of ports associated with each? Is it LAG-aware (i.e. these are sets of ports or LAGs)? If yes, what does it mean? etc.

P4Runtime is a good proxy for the answers to many of these questions, but it is not the answer. I am more and more convinced that control plane APIs cannot be used to define data plane functionality and vice versa (data plane interfaces cannot be used to define control plane APIs) -- they are orthogonal. So, regardless of P4Runtime, we need to define PRE functionality in great detail.

vgurevich commented 6 years ago

@samar-abdi , @antoninbas , @jafingerhut

I think we can have a great discussion on this topic tomorrow. Sounds like face-to-face is better.

jafingerhut commented 6 years ago

Glad to add clarifying details to the spec for PRE operation and restrictions. I will add a couple more questions to the list:

Should PORT_CPU be allowed in a PRE multicast group replication list?
Similarly, should PORT_RECIRCULATE be allowed?

My thought was that a PRE multicast replication list would be a list of (port, replication_id) pairs, and the control plane API would allow adding such pairs to, or removing such pairs from, a given multicast group. Restricting them to not contain any duplicate (port, replication_id) pairs for the same multicast_group value seems reasonable (could easily be enforced in software, with an error status if you try to violate it).

I wouldn't call data plane and control plane issues 'orthogonal', quite. Each affects the other in significant ways. But none of that discussion needs to be in the PSA spec :-)

samar-abdi commented 6 years ago

@jafingerhut Yes, I will have a P4Runtime pull request shortly with the multicast replication list programming API.

vgurevich commented 6 years ago

@jafingerhut ,

On which ports should be allowed in multicast group...

I would think that as long as a port can appear as egress_port in psa_ingress_output_metadata_t, it's a fair game.

So, CPU_PORT for sure. In case of recirculation, if we do it via a separate flag, than the answer is "no", if we do it by simply sending a packet to a special port number, then the answer is "yes".

jafingerhut commented 6 years ago

Discussed during the 2017-Dec-06 P4 Arch WG meeting. Decision was made that PSA does not require an implementation to support LAG interfaces in PRE multicast groups. That lack of a requirement has proposed mention in the PSA spec as part of this PR: https://github.com/p4lang/p4-spec/pull/513

It mentions that the egress_port part of a multicast group member can be PORT_CPU or PORT_RECIRCULATE, and those should work for those copies the same as if a unicast packet was sent to that egress_port.

It also documents the restrictions on (egress_port, instance) values.

jafingerhut commented 6 years ago

@samar-abdi I haven't read through every comment in this discussion today, but wanted to ask whether you are aware of anything left unanswered related to this issue. The PR #513 mentioned in my previous comment is now merged into the PSA spec.

If everything has been resolved, I would like to close this issue.

samar-abdi commented 6 years ago

The plan on the API side is to stay with singleton ports in PRE programming for now. We will revisit LAG handling in the API separately, but as far as PSA is concerned, we can close this.

jafingerhut commented 6 years ago

Great. Closing this one.

p4lang / p4-spec

[PSA] LAG resolution in PRE #508