openlcb / documents

The OpenLCB specification: standards, recommended practices and other documentation.
3 stars 7 forks source link

TCP draft: remove chaining #10

Open balazsracz opened 5 years ago

balazsracz commented 5 years ago

The chaining bit from the TCP standard draft is not implemented by any of the stacks. Maybe we should remove it.

RobertPHeller commented 5 years ago

Interesting thought... I was wondering what the use the chaining bit served.

RobertPHeller commented 5 years ago

While the TcpTransfersS file mentions chaining, the TcpTransfersTN document does not mention it all (except where it expands on the format of the prefix header). No one has bothered to implement chaining, but has anyone come up with an actual use case for it, one that would have practical real-world application? Unless someone comes up with a real-world use-case, we probably can just drop chaining from the standard, but reserve the bit for some future use (maybe chaining or something else).

bakerstu commented 5 years ago

Yes, I want this feature removed. There is no need for it. The idea was to be able to track the path of a packet, but there are already standard ways of doing this without us having to invent our own.

RobertPHeller commented 5 years ago

Ok, here is some proposed changes to TcpTransferS:

At the bullet at about line 25 on page 1:

And the whole paragraph at about line 75 on page 3 can be just deleted in its entirity.

RobertPHeller commented 5 years ago

And in TcpTransferTN, in the section about the flags field, the sentence about chaining can be deleted.

kiwi64ajs commented 5 years ago

Well given the whole TCP Transfer is something that hasn’t really been implemented and tested very widely yet, I think you’re being a bit quick with the Delete button.

We can flag it for removal but I’d prefer to defer that decision for several months of active use/experience and consideration.

I suggest we make it go first and then review the gaps before we delete too much.

Bob J probably wrote this so I might reach-out to him and ask why he though it was important at the time. That may be a useful perspective. Same goes for the Link Control stuff that is also mentioned or hinted at but is otherwise not really mentioned.

In my mind all this stuff just hasn’t had enough thought and real world exposure to be decisive about what should be in/out yet.

Alex

Sent from my iPad

On 23/03/2019, at 9:00 AM, Robert Heller notifications@github.com wrote:

While the TcpTransfersS file mentions chaining, the TcpTransfersTN document does not mention it all (except where it expands on the format of the prefix header). No one has bothered to implement chaining, but has anyone come up with an actual use case for it, one that would have practical real-world application? Unless someone comes up with a real-world use-case, we probably can just drop chaining from the standard, but reserve the bit for some future use (maybe chaining or something else).

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

bakerstu commented 5 years ago

Alex,

I disagree that we need to wait in order to pull these out, but I'm willing to wait for you to get onboard. Removing this is long past due. I agree with Balazs' statements here as a superior way to handle this: https://groups.io/g/openlcb/message/10712

You might have to go back and read the whole thread to get all the context.

RobertPHeller commented 5 years ago

Well, I implemented the Native Tcp transfer some time ago ( 2016-06-26 according to my Subversion log)...

kiwi64ajs commented 5 years ago

Hi Guys,

On 23/03/2019, at 10:30 AM, Stuart W Baker notifications@github.com wrote:

Alex,

I disagree that we need to wait in order to pull these out, but I'm willing to wait for you to get onboard. Removing this is long past due. I agree with Balazs' statements here as a superior way to handle this: https://groups.io/g/openlcb/message/10712 https://groups.io/g/openlcb/message/10712 You might have to go back and read the whole thread to get all the context.

Ok, I think I have re-read the relevant threads and refreshed my memory… Interesting to see how far back some of the TCP stuff goes. I saw some stuff from 2011...

I guess my concern was and still is that we don’t ignore the problem of handling loops or declare it done or solved with little apparent testing or proving.

If we remove the chaining stuff (and i don’t disagree that we do) then what do we replace it with? Its not yet obvious (to me at least) that we have a replacement yet.

I see a few comments and references to "spanning tree” but no discussion on application to OpenLCB TCPTransfer

I see Balazs’ comments in: https://groups.io/g/openlcb/message/10712 https://groups.io/g/openlcb/message/10712 and in particular points 2 & 3 that seem to make high-level sense, but I think it needs more details fleshed out to really know/understand what he’s meaning. Then we can make a better decision and work towards prototyping and proving it and modifying our docs.

I’ve spent some time today browsing Wikipedia reading about "Spanning tree Protocol” (STP), “Rapid Spanning tree Protocol” (RSTP) as well as Shortest Path Bridging (SPB) and “Transparent Interconnection of Lots of Links” (TRILL), "Open Shortest Path First” (OSPF) and "Intermediate System to Intermediate System" (IS-IS).

Previously all I knew was spanning tree handled loops so they didn’t kill the LAN, but I didn’t know how. Its the how that we are needing to refine and I need to know more about.

In case others are interested, lookup each of the above in Wikipedia and also the first 15 or so pages of this document was a helpful summary:

https://www.delaat.net/rp/2011-2012/p25/report.pdf <https://www.delaat.net/rp/2011-2012/p25/report.pdf>

HTH

Alex Shepherd

kiwi64ajs commented 5 years ago

I spent some time googling to find spanning tree software libraries we could leverage that don’t look too complicated and I found these:

https://github.com/adigostin/mstp-lib <https://github.com/adigostin/mstp-lib>

https://github.com/ani8897/Spanning-Tree-Protocol <https://github.com/ani8897/Spanning-Tree-Protocol>

Be good to hear from Balazs as maybe he’s already got enough sorted to not need anything like these libraries/projects.

Alex

kiwi64ajs commented 5 years ago

What is our objective with this? Is it: 1) Simply protection from loops or 2) Protection from loops + better utilisation of all available paths + failover etc

It seems that STP and RSTP achieved 1) ok, but didn't spread network traffic over all the available paths. Hence why there was a whole pile of effort adding the subsequent protocols (SPB, TRILL etc) that better utilised all available network paths.

I guess 1) is a must have, 2) is a bonus but not really a problem we need to solve right now.

However, in section "3.3.1 Protocol" of TCPTransferTN.pdf it has the bit about using mDNS to locate a hub at address: _openlcb-hub._tcp.local and if not found become a hub, we might what to have a way for a lesser capability to hand-over to a more capable hub. Eg an ESP32 node cold act as a hub for a few nodes (maybe 8 or so) but as soon as a Linux node (embedded or full) comes online, it would be good to be able to transfer all the client node connections to the more capable hub. Maybe the lesser hub issues a Drop-Link to all the clients so they know to go look for another hub.

Alex

balazsracz commented 5 years ago

I don't have a solution identified for loop avoidance at this point. Possible loops are also a problem when we do CAN over TCP (via gridconnect), and I would like to have a solution that works for both use-cases and also across mixed CAN+TCP networks.

We also need to be aware of embedded environments, so we should invest the minimum necessary complexity to support our use-case.

Our use-case is loop avoidance, and I don't think we need to take bandwidth expansion etc. into question. One specific question however that we ought to treat -- now or later -- is the discovery and establishment of links based on mDNS.

kiwi64ajs commented 5 years ago

Hi Balazs,

On 24/03/2019, at 1:00 PM, Balazs Racz notifications@github.com wrote:

I don't have a solution identified for loop avoidance at this point.

Ok, from Stuart’s comments it sounded like you did and I might have missed it.

Possible loops are also a problem when we do CAN over TCP (via gridconnect), and I would like to have a solution that works for both use-cases and also across mixed CAN+TCP networks.

Yeah, because we don’t have any notion of Segment ID or anything that would give us a hint that the packet came from “that” CAN segment so don’t send it back, we’re kinda left with just the knowledge of the gateway’s Node ID. During discovery I guess we could notice when the same Node-Ids are heard from multiple Gateways and mark that in our routing tables.

We also need to be aware of embedded environments, so we should invest the minimum necessary complexity to support our use-case.

Yes, only enough complexity but no more.

Our use-case is loop avoidance, and I don't think we need to take bandwidth expansion etc. into question.

Agreed, but just wanted to confirm that. In my reading yesterday I also came across references to Mesh Networking and handling of Routing and Path Traversal, which is probably more complicated than OpenLCB but might offer some useful insights.

The trick with this is to leave room for expansion or future extension - as they did with STP and then RSTP and then the subsequent other layers that still offered a measure of backwards compatibility. Hopefully for a while our Ethernet Hubs will be RPi class devices that can be changed easily as we perfect the logic and protocol.

One specific question however that we ought to treat -- now or later -- is the discovery and establishment of links based on mDNS.

Yes, there are comments in the Tech Note about the mDNS stuff and they say something like “some of this should be moved to the Specification” so we need to do that.

Alex

RobertPHeller commented 5 years ago

OK, it has been about 3 months since the last comment? Has anyone come up with a really good reason to keep the chaining bit? Or any further thoughts?

bobjacobsen commented 1 year ago

The original concept wasn't just loop detection per se, but loop diagnosis: When a loop is detected, what do you tell the non-technical model railroader who's trying to get his LCC setup to work? "The loop is from A to D to G and back to A" gives them information that will be useful when they try to fix it.

Even when there's a CAN segment involved, you can still get useful information: The last CAN segment that was involved. "The loop starts on the CAN segment attached to A, goes through B and goes to C where there's another CAN connection". Yes, really pathological cases like two CAN segments separated by multiple routers won't provide complete information, but at least a starting point is provided. And a general "Only attach one network link on a CAN segment" will get you a long way with this.

Is there another approach to loop detection that allows a single router/gateway to provide that information? That can be built in a low-end piece of hardware?

balazsracz commented 1 year ago

I don't think chaining would detect any loops in the presence of only one CAN segment.

IIUC this is how loop detection would work with chaining:

def receive_packet(self, packet):
    for h in unchain_headers(packet):
       if h.sender_gateway == self.node_id:
         raise "openlcb loop detected"

This would then run on every incoming packet, and it immediately detects a created loop after the first packet has made it around.

The problem is that when there is just one CAN link in the loop, then any outgoing packet would have the entry with "self.node_id" stripped at the time it passes the CAN segment. So the loop detection never triggers. It just happily loops around. Then if it doesn't trigger it will also not determine where the loop is, which limits the diagnostic benefits.

IP solves the diagnostic + loop routing problem in a simpler way (TTL), which also would not work in the presence of CAN segments. But TTL can be used for traceroute and it's simpler logic than continuously wrapping and unwrapping packets. The wrapping/unwrapping is also expensive on the CPU, which in the presence of embedded controllers with WiFi is actually something we should consider.

I am not aware of a loop detection that can discover a loop on every packet and works on CAN. All ideas I can think of rely on a specific operation happening on demand, i.e., when a link is established. As an example, sending out the "node initialization complete" message through the new link and waiting for it to (not) come back on a different link would work. The difficulty is knowing the timeout to wait for. We can also define a new message for spanning tree / loop detection, which would contain a unique ID of the gateway node and a 2-byte link ID from the gateway. This would fit on CAN. We could even think about adding a TTL to it.

bobjacobsen commented 1 year ago

I’m not wedded to chaining. But I am interested in making sure that the average model railroader can figure out what went wrong when a loop is created. That means some way of identifying the network components that take part in the loop, not just detecting that the loop exists.

I agree that sending a Node Initialized message across each new link will detect loops when they happen, even with CAN present. Timeouts are not an issue: If it comes back, there’s a loop, and you’ve detected it, even if some time has gone by. (But not every gateway/router will detect the loop; likely only the last one to join the loop will detect it) A well-known automatically routed event could also be used, which has some advantages of simplicity.

After detection, the question is: what do you tell the user about how to fix it?

1) If it’s an all-CAN loop created by CAN-CAN gateways, there’s not much to say. The user (hopefully) knows he’s got no TCP segments, and therefore needs to look through how his CAN segments are connected. In general, he just has to disconnect the last gateway/router he connected. Everything else should continue to be connected through the looping CAN segments.

2) If there’s only one CAN link (and generally the user should know that) and the loop is through that, chaining will tell you which gateway(s)/router(s) are causing it, but you really don’t need that because you solve it by putting only one gateway/router on the CAN network.

3) If the loop is entirely in the TCP segments, chaining will tell you that. Since you can’t see their links physically, and there’s been a desire for automatic connection of TCP-CAN gateways/routers, some kind of diagnostics for that case would be really good.

4) If there are multiple CAN links present and more than one is involved, you’ll get some TCP node(s) in the loop with CAN at each end. But that’s enough to get started: Look at those CAN segments and find one that has two gateways/routers on it and remove all but one.

(3) is really the worrisome case. There really needs to be something concrete that will allow the user to deal with it.

As an alternative to chaining every message, you could perhaps create a specific message type that gets chained. Sort-of a forward going trace route, where you send this out and look for it to return with the routing appended. Properly defined, that could work over CAN too. And it’s not much of a load: When you see messages of this type go by, look for the one with the “last” bit set. Set that one to “middle”, and then send yours with “last” set. That helps with case (1) and I think handles the rest too.

As an alternative to having diagnostic information, would it be sufficient to just refuse to create loops? When a gateway/router detects that it’s part of a loop, it takes the looping connection offline and lights a light. Something like that would be needed in any case to be able to read out diagnostic information. The problem with totally relying on this, instead of giving the user information to permanently remove the loop, is that the network will likely come up in different topologies each time it’s brought up. That doesn’t seem desirable. And it’s hard to make sure that only one link is cut, which is necessary to avoid partitioning.