Pre-2ndWGLC - Open Issue from WG-LC: Issues caused by RD in CT

suehares commented 9 months ago

Swadesh comments:

https://mailarchive.ietf.org/arch/msg/idr/E8XCqoTpccZfZqdFxHjjQ5rr6ck/

============ [Idr] Issues caused by RD in CT draft "Swadesh Agrawal (swaagraw)" swaagraw@cisco.com Tue, 26 September 2023 20:57 UTCShow header

Hi Sue,

BGP CT’s choice of RD in NLRI meant for transport BGP hop by hop routing has design limitations that introduce sub-optimal behaviors. This issue has been brought up on the list several times, but has not been resolved or addressed. Describing the concern as requested during the interim.

The CT draft states a reason for the RD in NLRI as “it helps in troubleshooting by uniquely identifying the originator of a route and avoids path-hiding”. Further, section 10 states “Deploying unique RDs is strongly RECOMMENDED”.

With unique RD following problems exists:

Lack of multipath and localized fast convergence within a domain for originator failure (even though an operator has deployed redundant routers for that purpose).
When the issue was brought up, the authors proposed the ‘stripping RD’ and ‘TC, IP’ allocation scheme (Section 10.2) to provide multipath. However, this has the unintended side-effect of advertisement of duplicate/redundant BGP routes with the same forwarding label across the network.

I. It multiplies control plane state on all routers across multiple domains while not providing any useful functionality.

II. Further it exposes the failure churn for originating routers outside the originating domain all the way to ingress PEs in other domains. The impact aggravates as the number of originators increases (e.g., Anycast case) As a response to the issue raised during adoption call, CT draft added the table (figure 7). We can see from Figure 7 rows 4 and 6, failure of an originator (such as ABR) will result in slow convergence as LSP is end to end and failure of originator needs to be propagated to ingress PE to converge. To avoid it "RD stripping" or “TC,EP” label allocation procedures at BNs is required. But it has the redundant routes/churn issue described above. Row 5 shows there are 16 routes, even though there are only 2 labels advertised between them.

At the interim meeting, CT authors referred to two recommendations as workarounds – use same RD, and recommend the origination of CT routes from egress PEs, instead of from ABRs.

I. In such a case, the utility of the RD is very minimal, as an egress PE would be the only node originating it’s route. In fact, IMHO the WG should question why the cost of VPN import semantics are required at every BGP transport hop as the rationale of “RD for troubleshooting” does not hold here. (If we pivot to the Anycast case here, the unique RD issues listed above persist.)

II. It’s also fairly common to originate BGP routes at ABRs, this would be a restriction.

In any case, the CT authors did acknowledge the issue as per my understanding.

Each of the option in the table require operator to make a choice with its associated sub optimal behavior. It is a significant deviation in functional and operational behaviors from currently deployed BGP-LU, which does not have these duplicate route/churn issues and provides multipath/active-backup within each local domain.

These issues are a manifestation of signaling RD in NLRI for BGP hop by hop transit routing (unlike VPNs where the import/merge are at provider edge routers). I believe the limitations should be captured clearly in the draft, if they are not going to be addressed. This will allow operators to be aware of the impact of the respective options while doing their network designs.

Reference of past discussions https://mailarchive.ietf.org/arch/msg/idr/a3zJ4y7eumYTU9Fc-xUJn9shtLU/ https://mailarchive.ietf.org/arch/msg/idr/wGgrSfXxhUU-RRM_sN6a64RK560/

Regards Swadesh

suehares commented 9 months ago

The section that Swadesh is referring to is section 10.3 in version 22.
The table is now Table 7.

Issue-1 Swadesh and Kaliraj are debating the contents of the table and the conclusion reflected in section 7.4 (version draft-ietf-idr-bgp-ct-12 + -22) in the following text:

"This helps in avoiding BGP CT route churn throughout the CT network when an instability (e.g. link failure) is experienced in a domain. The failure is not propagated further than the BN closest to the failure. If a different label allocation mode is used, impact on end to end convergence should be considered."

Action item-1: Resolution on section 10.3 (-22) and relation to section 7.4

suehares commented 9 months ago

Issue 2 - Whether the tests are realistic.

Swadesh's comment: "BGP-CT-UPDATE-PACKING-TEST results included are for an unrealistic scenario in practice; and also do not cover relevant deployment cases : For example, it captures 1.9 million BGP CT MPLS routes packed in 7851 update messages. That means about 250 routes sharing attributes and packing every update message completely. It seems test is done with all routes (around 400k) for a given color having exactly same attributes. This is not a practical example. A more practical case would be to have a packing ratio, for example 5-6 routes to a set of attributes."

Since the Spring WG has not adopted the requirements document: (draft-hr-spring-intentaware-routing-using-color", there are two choices: a) simply ask each group to provide detailed specification of their test plan and results, or b) delay until Spring finishes its work on the requirements and then ask the Spring WG to set a reasonable test plan.

Based on feedback from the authors that time is critical for their customers, the IDR shepherd (that's me) selected option A.

Item-#2 from Swadesh's request is closed.

suehares commented 9 months ago

Issue #3 - Problems in the use of Implicit Null

Swadesh points out that "Non deterministic usage of IMPLICIT NULL : Implicit NULL is a valid MPLS label and indicates no label to push by receiver. Label path to BGP nexthop is still valid/expected. For example in figure 11 and 12 not sure why R3 won’t send MPLS traffic to R4 as stated in last paragraph. Similar is the problem with section 13.2.2.2."

Discussion between Swadesh and Kaliraj follows Implicit NULL is a valid MPLS label and indicates no label to push by receiver. Label path to BGP nexthop is still valid/expected.

KV> intra-domain tunneled path to the BGP nexthop may or may-not be labeled. Implicit-Null label carried in BGP-LU/BGP-CT route doesn’t claim anything about the intra-domain tunnel. It just says no BGP-LU/BGP-CT label needs to be pushed in forwarding.
[SA] Thanks for clarification on procedure. But when I read draft, it indicates towards new meaning of IMPLICT NULL. Quoting exact text in draft “R4 will carry the special MPLS Label with value 3 (Implicit-NULL) in RFC 8277 encoding, which tells R1 not to push any MPLS label towards R4”. It will be better to update your response text in the draft.

Section 13.2.2.1 is extending implicit NULL label presence to indicate that originator does not support MPLS. This is not possible as the two cases cannot be distinguished.
KV2> Sure, will clarify the text to say, “Implicit-Null label carried in BGP-LU/BGP-CT route indicates that no BGP-LU/BGP-CT label is pushed in forwarding”.
[SA2] Thanks. But misdelivery of traffic is possible if an MPLS tunnel exists to next hop with this procedure. This should be captured in the draft.
KV> so, there is no ambiguity. Implicit-NULL is only saying no BGP-LU/BGP-CT label needs to be pushed in forwarding.
[SA] Same response as previous point.

For example in figure 11 and 12 not sure why R3 won’t send MPLS traffic to R4 as stated in last paragraph. Similar is the problem with section 13.2.2.2.

KV> as shown in these figures, R4 does not support MPLS. So there can be no MPLS-tunnel from R3->R4 so why would R3 send MPLS traffic to R4? When R3 tries to resolve PNH==R4, it will find no matching MPLS tunnel, and the route will remain Unusable.
[SA] It’s an operational burden to make sure that no router has MPLS path to R4 (MPLS path can be for other purposes). Otherwise there can be mis-forwarding with IMPLICIT-NULL in 8277 style encoding for non MPLS encapsulation signaling (SRv6, UDP) in BGP CT. It should be captured in the draft.
KV2> R4 does not support MPLS. So there can be no MPLS path towards it. There is no operational burden. Thanks for the comments.

suehares commented 9 months ago

Swadesh's comments on issues in discussion: There were 3 issues called out; neither of which have been addressed. [issue-1]

The first one is a significant design limitation that the use of unique RDs results in: a. Lack of multipath and localized fast convergence within a domain for originator failures (even though an operator has deployed redundant routers) b. To achieve multipath, ‘stripping RD’ and ‘TC, IP’ allocation results advertisement of duplicate/redundant BGP routes with the same forwarding label c. which in turn increases control plane state on all routers upstream across multiple domains, and exposes the failure churn outside the originating domain all the way to ingress PEs in other domains

This is not a new issue, it has existed since day-1 of CT and still remains, in spite of all the revisions of the draft.

This is not just an editorial issue. It is a significant deviation from currently deployed BGP-LU, which does not have these duplicate route/churn issues and provides multipath/active-backup within each local domain. It is a manifestation of the wrong data model of signaling RD in NLRI for BGP hop-by-hop routes. The limitations need to be captured clearly in the draft as impact of the respective options, if they are not going to be addressed.

[Shepherd's comments:] You are debating a design limitation based on your theoretical opinion. Kaliraj has disagreed with your viewpoint on what the text does and how the design works. There are two ways forward at this point: a) you provide test result proof of that counter's Kaliraj's opinion or b) you can accept there is a difference of opinion.

I do not see further discussion reaching a consensus on this viewpoint. If you have specific changes to the

Swadesh Issue-2. The second issue is that of the inefficiency caused by the choice of the CT NLRI which only supports MPLS labels in the NLRI. Any use other than MPLS, such as SR prefix-SID (label-index) or SRv6 SID means every route needs to be sent in a separate BGP update message with no packing possible. The scale/performance test data completely ignores this issue, and shows data for a non-existent problem.

[Shepherd's comments]: This is Issue-2 above. See my comments regarding this issue.

"Swadesh Issue-3" The 3rd issue is that the draft introduces non-deterministic usage of IMPLICT NULL. It can result in mis-delivery of traffic and is an operational burden to make sure no MPLS path exists to next hop. This is again a result of CT mandating signaling label in NLRI even for non-MPLS encapsulation.

[Shepherd's comments]: Please provide proof of the "misdelivery of traffic" and specific cases. ]

suehares commented 9 months ago

https://mailarchive.ietf.org/arch/msg/idr/PCq07w7tXWWRy8gIdZroqCWDbYs/

Reference where Kaliraj responds further to Swadesh.

jhaas-pfrc commented 9 months ago

Swadesh comments:

I. In such a case, the utility of the RD is very minimal, as an egress PE would be the only node originating it’s route. In fact, IMHO the WG should question why the cost of VPN import semantics are required at every BGP transport hop as the rationale of “RD for troubleshooting” does not hold here. (If we pivot to the Anycast case here, the unique RD issues listed above persist.)

II. It’s also fairly common to originate BGP routes at ABRs, this would be a restriction.

From Section 10.2 of draft-23: "Alternatively, the same RD may be provisioned for multiple originators of the same EP. This mode can be used when the ingress does not require full visibility of all nodes originating an EP."

It seems your point is this is a situation where the ingress doesn't need such visibility. The current -ct procedures permit using the same RD in such circumstances.

It should also be noted that per prior analysis, if RD is effectively a distinct value per TC (color), the signaling impacts are the same as -car. It's understood that in such circumstances that there is duplication of signaling information.

suehares commented 8 months ago

During our CAR/CT editor's discussion, Swadesh gave the example of

         set self
          next-hop

R1--------R2-------R3 ====== ======= tunnel1 tunnel2

                 <-----routes with CT

traffic ---> flow

The point that Swadesh made is: a) how does R2 know what to do with Implicit Null in order to connect the tunnel? b) How does the code for CT and CT for SRv6 connect the two tunnels?

Answer from Kaliraj: Implicit tunnel simply says the two tunnels (Tunnel-1 and Tunnel-2) are not MPLS tunnels. There is no label popping, pushing or swapping.

The tunnels can be RFC9012 tunnels without MPLS (e.g. GRE tunnels). The tunnels can be RFC9012 tunnels for SRv6.

Swadesh felt the Implicit label implied more in some implementations of
Kaliraj pointed out that the CT SRv6 implementation treat it just as "not MPLS."

suehares commented 8 months ago

Based on the meeting of the CAR and CT editor (DJ, Swadesh, Kaliraj, and Nats), I will consider issue 3 of Swadesh's comment closed.

suehares commented 8 months ago

Issue 1 discussion:

Swadesh would like to have Figure 7 annotated with the cost of each of row of choices. Kaliraj indicates that these costs are indicated in the text of section 10.3. Part of the editorial review should be considerations on how to directly link this. Perhaps (1, 2) could link the two better. However, I need the report from Swadesh to deal with this issue.

kalirajv commented 8 months ago

   R1--------R2-------R3
      ======   =======
        tunnel1   tunnel2

Answer from Kaliraj: Implicit tunnel simply says the two tunnels (Tunnel-1 and Tunnel-2) are not MPLS tunnels. There is no label popping, pushing or swapping.

Recording what I discussed in the meeting, and in earlier IDR-interim meeting with Ketan too:

When R2 advertises Implicit-Null to R1, it just means, "no BGP-CT label needs to be pushed over the tunnel1". It makes no assumptions on what kind of tunnel the tunnel1 is. It can be any type (MPLS, GRE, others). When R2 is a pure-SRv6 node, it uses implicit-Null in BGP-CT route. When R2 supports both SRv6 and MPLS, it uses real label along with SRv6-SID.

R1 will push MPLS-label towards R2, if it had received a real label in some BGP-family (like SAFI 128, SAFI 76, SAFI 4), or if the tunnel to R1 was an MPLS-tunnel. All this is existing BGP MPLS routing behavior, nothing new or changed for BGP CT.

Bottom-line is: BGP CT route can carry multiple encapulations (just like any regular BGP route). The receiver uses the encap that they support/desire.

Ref: https://www.ietf.org/archive/id/draft-ietf-idr-bgp-ct-26.html#section-6.3 https://www.ietf.org/archive/id/draft-ietf-idr-bgp-ct-26.html#section-11.3.2

suehares commented 8 months ago

Closed based on draft-ietf-idr-bgp-ct-27.
The text around the table provides similar details. I have reviewed this with the editors.

Should the RTG-AD wish to discuss this mail thread, please contact me for an early review.

ietf-wg-idr / draft-ietf-idr-bgp-ct

Pre-2ndWGLC - Open Issue from WG-LC: Issues caused by RD in CT #59