[Feature Request]: add indication of missing hops (no shared channel) to traceroute responses

ianmcorvidae commented 1 month ago

Platform

NRF52, ESP32, RP2040, Linux Native

Description

(this would presumably also require a corresponding protobuf change)

Right now, it's possible to get a somewhat misleading traceroute result if the outbound traceroute goes via a node that doesn't, or can't, decode the message. Since that node (or nodes) can't decode the message, it also can't add itself to the RouteDiscovery, so its hop is excluded from the final result.

While we can't know the identity of these nodes, with the existence of hopStart we could be aware that some number of hops are not represented in the returned value and let the requesting client know.

For a path like this:

A -> B -> C (can't/doesn't decode message) -> D -> E

the traceroute would show B and D only, but if the hop limit was 3, then E will see the message with hopStart of 3 and hopLimit of 0 and could see that there's one missing hop in the traceroute.

I'd propose we add either the count of total hops (in the example, 3) or the count of missing hops (in the example, 1) to the RouteDiscovery. Adding the missing hops would probably usually be 0, keeping the size of the packet a little smaller, while adding the total hops would mean the field is present in any traceroute that isn't to a direct neighbor, but would allow distinguishing between a traceroute where E is on older firmware that excludes the new field from a traceroute where E is actually aware of all hops.

EDIT: After some discussion below, it seems like a good option here might be to have firmware detect missing hops via this method at each hop, and if any node detects a missing link, add a dummy hop to the existing structure with a nodeid of 0xFFFFFFFF. Existing clients will simply see it as an unknown nodes, but anyone who wants to can understand that value to mean a node detected missing hops instead.

GUVWAF commented 1 month ago

I'm not entirely sure why it needs to be added to the RouteDiscovery if you can already deduce it from the information in the MeshPacket (hopStart - hopLimit)?

ianmcorvidae commented 1 month ago

Not back at node A -- this is proposing sending back the data derived from hopStart / hopLimit of the outbound traceroute, not of the reply. The traceroute from A to E may have one missing link, but the packet getting back from E to A after the traceroute has completed could go a totally different way (and often will, from what I've seen), so the hopStart and hopLimit of the packet when it gets back to A aren't meaningful in interpreting the path

GUVWAF commented 1 month ago

I see, makes sense.

So there's already a ticket on the 3.0 board to also add the route back to the traceroute, and the SNR per link. I first wanted it to do for 3.0 only, because it would give wrong information if not all nodes update. However, since we now can derive the actual hops it took, we could at least show the number of unknown hops on the way back. In this way we can gradually introduce it (and maybe don't show it yet in the phone clients). Downside is that it increases the packet size, but currently a traceroute is pretty small and I think SNR can be encoded with 1 byte. What do you think?

ianmcorvidae commented 1 month ago

That seems neat too, though I wasn't even thinking of reverse traceroutes -- just detecting missing hops on the current single-direction ones (the reverse direction being mentioned in my prior response because, since it can take a different route, it can't be used for calculating missing hops in the original traceroute)

ianmcorvidae commented 1 month ago

It also occurs to me that my original request might be better served by each node along the way calculating missing hops, since it would allow more information about where exactly the missing hops are.

I might try to draw up a diagram with the various pieces of information being proposed in different places to clarify what exactly we're all talking about :sweat_smile:

ianmcorvidae commented 1 month ago

First, how traces currently work: current-trace

A does not know (and can't know, with the information it has on the response) about the missing hop (C) between B & D

As proposed here: update-trace

Here, node E detects the missing hop and adds it to the RouteDiscovery, so A knows that there was a missing hop somewhere on the traceroute.

An even better version, maybe: all-hops-trace

Here, node D (the first node after the missing hop) detects the missing hop and notes it somewhere associated with itself in the RouteDiscovery. Node E doesn't need to add anything, because all hops are accounted for, and node A knows that there is a missing hop and that it's somewhere before node D (with all updated firmware, immediately before D).

If the SNR per link is added, these missing hops would presumably be added in a similar place in this last option, since they're per-hop information. Another thing that could be added per-hop is whether a given hop was via MQTT or not, so traceroutes could identify MQTT gateway nodes.

ianmcorvidae commented 1 month ago

Hopefully those diagrams are helpful at all, they're pretty slapped-together obviously. The core observation is that nodes on the outbound path are able to detect if the trace went by way of any nodes that couldn't add themselves to the list, but the originating node can't know that after the traceroute returns unless the outbound nodes or traceroute destination add it to the payload in some way, particularly since the response might go via a different path. SNR and MQTT are similarly things that can't be determined by the originating node unless they're added by the outbound-path nodes.

Bidirectional traceroutes would be another addition to this, so that nodes F and G are also eventually included in the list, but that's likely a bigger change.

ianmcorvidae commented 1 month ago

The other complication is whether to add a "total hops" indicator or just a "missing hops" one.

If it's "total hops", then the field would always be present with updated firmware, which would make it easy to tell if nodes D or E are adding information or not, in the proposed changes. But it'd increase packet size. Having only a "missing hops" field would mean that each missing hop would only need to be accounted for once (by D or E, in the example), but if a node doesn't have updated firmware and doesn't add the indication, it would appear that there were no missing hops, even if there actually were any.

My feeling is that adding only missing-hops information is better since nodes tend to eventually update and folks are used to (or at least somewhat used to) the current behavior where those hops are totally absent already anyway.

noon92 commented 1 month ago

Adding SNR for each hop would greatly increase the ability to troubleshoot unstable connections.

GUVWAF commented 3 weeks ago

@ianmcorvidae Thanks for the detailed explanation. So what if a node that detects there is a missing hop adds a dummy node number 0xFFFFFF to the route? Then we don't need to add a new field and it might be more efficient (depends a bit on the overhead of the repeated protobuf field versus a new field). I think this will immediately work for apps as well, because they don't know a node with node number 0xFFFFFF.

ianmcorvidae commented 3 weeks ago

That does seem like a good way to do it, and it's easily detectable by clients that want to mark it as a missing hop (vs. a known hop from an unknown node). I guess there remains the potential for the missing hops to get added the wrong place (in my example, if D has too-old firmware, E would add the missing hop after D rather than before), but that's something that would correct itself over time and it's still better information than we currently have!

The other improvements like bidirectional traceroutes and SNR would still be good of course, but I like that idea for covering just the original topic of this ticket.

GUVWAF commented 1 week ago

Implemented in the firmware by #4056.

meshtastic / firmware

[Feature Request]: add indication of missing hops (no shared channel) to traceroute responses #3942

Platform

Description