[PSA] Make queue lengths readable, at least from ingress control block

jafingerhut commented 7 years ago

Motivation: INT (Inband Network Telemetry) can make use of this, by reading the length of the PRE target queue for a packet before it is enqueued, and put that length into a field in the packet header, for use by later data collection.

Open questions: What should the units be? packets only? bytes only? both? In units of 'cells' (an implementation detail in some packet buffers, where the packet length is quantized to some multiple of a packet buffer cell size, e.g. it is rounded up to the next larger multiple of 256 bytes)?

Probably the highest performing implementations would limit this to reading at most one queue length per packet, but that would be an implementation-specific performance issue.

jnfoster commented 7 years ago

@cc10512 and @vgurevich should correct me if I'm wrong, but one choice that was made in early design discussions for PSA was to only expose an abstract notions of port specification. Of course, in reality, these "ports" might be mapped by the target onto a physical port, a LAG, a multicast group, etc. And that mapping lies outside of the scope and control of PSA.

It's worth considering whether we should go the other way and come up with a design that standardizes all of this functionality in as portable a way as possible. I believe Skip booth made a similar point on Monday's call.

And what I think you're observing is that queues are another facet that's been abstracted away. And maybe we need to expose them -- e.g., if we want INT to be portable.

jafingerhut commented 7 years ago

Maybe I'm not thinking big enough here, but it sounds challenging to try to 'hide' the existence of LAG from a P4 program on a switch. I had (perhaps naively) assumed that every P4 program that wanted to support LAG would need an explicit table or 2 to handle that feature, and select a physical output port explicitly for every packet.

Maybe this example is too focused on existing implementations, but as an example, it is common on many switches to have configuration knobs to choose whether a LAG group member is selected based on a hash of only L2 fields, or L2+L3 fields, or L2+L3+L4 fields, and some have options for 'symmetric' hashes vs. non-symmetric hashes (i.e. symmetric hashes 'sort' the source and destination addresses before calculating the hash, so the two opposite directions of an application flow will get the same hash value calculated for them).

Now all of those options could be hidden behind some LAG implementation provided for you under the covers by an architecture, but then what if you want to change the options available for selecting a LAG group member? If it is baked into the architecture, you can't, except using implementation-specific methods, not under control of a P4 program.

vgurevich commented 7 years ago

@jafingerhut ,

You are correct -- this is a subject that will be difficult to avoid. While unicast LAG/ECMP can be fairly easily expressed using either pure P4_16 or a well-understood externs (e.g. "calculate hash"), the multicast context is a totally different beast. There are several ways how you can multicast to a LAG or an ECMP group and we will have to decide, whether PRE should be LAG- and ECMP- aware or not.

In terms of hash algorithm selection/tuning there are a couple of options, all centered around the fact that hash calculation is an extern and thus if we instantiate it, then a lot of very clever control-plane APIs can control the algorithm, including the formula itself as well as the list of fields it takes (although that will be a huge backdoor).

We should definitely discuss all that and agree on something that would allows us to write interesting (although probably not the coolest) programs.

jafingerhut commented 7 years ago

I have heard Tom Edsall tell a story of visiting Scott Shenker's class to give a guest lecture on networking, and afterwards a student asked 'How would that work for multicast?', and Scott Shenker said something like 'The quickest way to ruin any perfectly good networking conversation is to ask about multicast.'

I don't have encyclopedic knowledge on this topic, but I think the best that anyone knows how to do multicast load balancing in an inexpensive way is to precalculate a small number of multicast groups, e.g. 8 or 16, and load balance over those groups. Each group makes its own independent choice of individual among LAG members for any LAG ports involved, but every multicast packet picking one group always goes over the same LAG members.

There are ways to imagine getting fancier than that, but if the decision must be made before the packet buffer, and you want to do it in a programmable way, you need to do ingress recirculation, which takes bandwidth away from newly arriving packets.

jafingerhut commented 7 years ago

And to be explicit, packets going to a PRE multicast group would not get back any usable results if they tried to read the current queue length of a multicast group, so they shouldn't bother trying.

vgurevich commented 7 years ago

@jafingerhut , I totally feel you. The problem is that "basic L2 switching" typically requires (at least) the ability to flood broadcast/unknown unicast/unknown multicast packets to a VLAN, which is basically multicasting and VLANs are also known to contain not only just ports but LAGs. And, of course, once we transition into VXLAN and similar areas, then everything goes up a level (i.e. LAGs are now ECMP paths that can also sit on top real LAGs).

Partially it goes back to the question of PSA goals. I think we want people who use PSA to be able to write at least what is nowadays considered fairly "standard" data plane programs, but maybe that's too much and we should not try to boil the ocean. I honestly do not know.

jafingerhut commented 7 years ago

@vgurevich Isn't providing a PRE with multicast groups with 2 or 3 'levels of hierarchy', where the leaves are physical ports, a fairly good primitive here? Maybe as an extension some implementation could choose to do LAGs at the leaves instead of physical ports, but that would be an optional extension over and above the basic PSA requirements. The 2 or 3 levels of hierarchy in the lists in the PRE multicast groups would be a convenience for the control plane API, so it would only need to make O(1) changes when ECMP/LAG memberships change, instead of larger changes.

vgurevich commented 7 years ago

@jafingerhut ,

This is definitely a solution as long as you feel comfortable mandating the number of levels, the order of levels, the representation of the aggregates and other things that are needed for the "interesting programs".

The problem is that PRE is not programmable, so there is a very high chance that different PREs from different vendors will all be not-quite-compatible to put it mildly, so that's where will have to really need to agree on common functionality

jafingerhut commented 7 years ago

In 2017-Aug-07 PSA working group meeting, a decision was made to write a proposed way for the egress control block to have access to a metadata field for each packet it receives, where the value is the queue length at the time the packet was enqueued, in some TBD units.

There was no discussion that I recall in that meeting regarding PRE multicast replication configurability, but there should probably be a separate issue/discussion on that topic alone, to see who has an interest in what levels of configurability there.

jafingerhut commented 7 years ago

In 2017-Oct-18 PSA working group meeting, the topic of making queue lengths visible in some way as part of PSA was brought up again, and there was no strong interest in pushing for inclusion in the first release of the PSA spec. There is a P4 applications working group getting started soon where its first application area is planned to cover INT (Inband Network Telemetry), so it seems reasonable to leave any extensions required to implement INT for that working group.

Closing this issue.

p4lang / p4-spec

[PSA] Make queue lengths readable, at least from ingress control block #335