stratum / fabric-tna

The SD-Fabric data plane
https://docs.sd-fabric.org/
31 stars 15 forks source link

MPLS TTL behaviour might be wrong #109

Open Yi-Tseng opened 3 years ago

Yi-Tseng commented 3 years ago

Currently, we set the MPLS TTL value to a default one(64), however, we should copy the TTL from the IP header. We also need to set the TTL back to the IP header when we pop the MPLS label.

Not much detail in the original RFC showing how to handle TTL with IP packet https://tools.ietf.org/html/rfc3031#section-3.23

But there are some rules in RFC2032 (MPLS Label Stack Encoding) https://tools.ietf.org/html/rfc3032#section-2.4.3

2.4.3. IP-dependent rules

   We define the "IP TTL" field to be the value of the IPv4 TTL field,
   or the value of the IPv6 Hop Limit field, whichever is applicable.

   When an IP packet is first labeled, the TTL field of the label stack
   entry MUST BE set to the value of the IP TTL field.  (If the IP TTL
   field needs to be decremented, as part of the IP processing, it is
   assumed that this has already been done.)

   When a label is popped, and the resulting label stack is empty, then
   the value of the IP TTL field SHOULD BE replaced with the outgoing
   TTL value, as defined above.  In IPv4 this also requires modification
   of the IP header checksum.

   It is recognized that there may be situations where a network
   administration prefers to decrement the IPv4 TTL by one as it
   traverses an MPLS domain, instead of decrementing the IPv4 TTL by the
   number of LSP hops within the domain.

Also, there are some explanations on these websites: https://www.ciscopress.com/articles/article.asp?p=680824&seqNum=4 http://wiki.kemot-net.com/mpls-ttl-behavior

Which says we need to set the TTL value to TTL-1 from the previous header (push/swap/pop)

ccascone commented 3 years ago

Following the RFC means having the final IP TTL decremented by the number of hops inside the fabric (e.g, ttl = ttl - 2 for packets going from one leaf to the other in a 2x2).

But considering that we use MPLS only inside the fabric (i.e., we don't peer with external MPLS routers) and that Trellis abstracts the whole fabric as one big IP router, do we need to follow the RFC? Or should we just make sure that the IP TTL is decremented by one independently of the number of hops inside the fabric? cc @charlesmcchan @pierventre

charlesmcchan commented 3 years ago

Shouldn't it be -3 instead of -2? Our case should be the same as figure 3-4 in this page

I believe SR does both COPY_OUT and COPY_IN at the first and last hop respectively.

ccascone commented 3 years ago

Yes, it should be -3...

I just realized that since we do penultimate hop popping (i.e., spine pops MPLS), without copying the TTL between MPLS and IP, we cannot prevent loops inside the fabric...

I still think that decrementing the IP TTL by the number of hops inside the fabric is wrong. IMO the fabric should behave like one big router between the access devices and the Internet. The fact that we use MPLS tunnels internally is an implementation detail, and the IP TTL should not be affected by the number of switches inside the fabric. Instead, the IP TTL should be frozen when inside the tunnel.

However, using penultimate hop popping doesn't leave us any other choice if we want protection against loops. We should change segmentrouting to support ultimate hop popping (i.e., dest leaf pops MPLS) to be able to detect tunnels inside the fabric.

charlesmcchan commented 3 years ago

What SR does today is completely legit as described in RFC3443 section 3.1.

I found it hard to justify the benefit of making such changes, taking the amount of work that needs to be done into account. We need a stronger reason to prioritize this.

ccascone commented 3 years ago

I agree that we don't have strong reasons to do this change. I just wanted to voice my concern.

I will make the change to fix the TTL behavior such that we comply with RFC3443 section 3.1, but most importantly with segmentrouting flow objectives.