mavlink / qgroundcontrol

Cross-platform ground control station for drones (Android, iOS, Mac OS, Linux, Windows)
http://qgroundcontrol.io
3.3k stars 3.62k forks source link

Erroneous packet loss estimation with multiple MAVLink components. #6940

Open edwinhayes opened 6 years ago

edwinhayes commented 6 years ago

Our aircraft has quite a few different MAVLink components.

While there are multiple components talking, the MAVLink message loss rate reported by QGroundControl is very high. If I have [an intermediary MAVLink router] filter out messages that aren't from the flight controller, then once there is only one component talking, the loss rate drops to zero.

Since the link in question is UDP over a physical 10/100 ethernet cable, I'm pretty sure there aren't many messages being dropped - since the flight controller is significantly more talkative than all of our other components combined, I don't think it's traffic volume related. I suspect the reported packet loss is because the packet loss estimation isn't taking into account that different MAVLink nodes maintain the own individual sequence numbers?

Running daily build, looks like 06-09-2018, and latest beta Arducopter (although stable Arducopter v3.5.7 was the same, so probably unrelated).

dogmaphobic commented 6 years ago

It seems you are repackaging the messages instead of bridging them unchanged. In doing so, you are altering the message sequence, which is what QGC used to look for "gaps". When bridging MAVLink messages, these messages should be forwarded unchanged.

dogmaphobic commented 6 years ago

Also make sure not to re-use component IDs. Each component should have its own component ID. If your "bridge" sends messages of its own, it should use a unique compID. The same applies for anything else that sends MAVLink messages within the same sysID.

edwinhayes commented 6 years ago

As far as I am aware, we are bridging the messages; if not, that would be a bug which has appeared in our routing software at some indeterminate point. That actually is possible, since until recently we've been filtering out messages quite aggressively to obtain more sensible user experience when using Mission Planner, so we might not have noticed. Are you able to confirm that QGroundControl definitely behaves correctly with multiple components at the moment?

edwinhayes commented 6 years ago

In a cursory test, if I save a telemetry log in QGroundControl, and load it into Mission Planner, the playback shows multiple component IDs, and no significant packet loss. That implies that at least the source address isn't being repackaged, and I think also the sequence number?

On the other hand, I've discovered that our internal log file manipulation utility can't parse the QGroundControl TLOG files for some reason, so I'll have to follow up on that!

dogmaphobic commented 6 years ago

The sequence number is stored within the mavlink_status_t structure. There is one instance of this structure per channel. Depending on how you implemented MAVLink support, the declaration of this structure could also be duplicated (where you end up with sequence numbers being stored in two or more places, leading to wrong sequence and other issues).

The most common error is when messages are being processed and forwarded by a repeater of some sort. This repeater would receive and parse a given message and instead of forwarding the original data buffer unchanged, it repackages it and forwards it to down the line. This will cause the message header to be rebuilt (through the ultimate use of mavlink_finalize_message_chan()), using a different sequence number.

Another error is when a component sends messages using some other component's compID. Say you have a repeater/bridge, which may send its own messages (using its own compID) but it also sends messages on behalf of other components using those other components' compIDs. In that case, it would be generating a message using a locally defined sequence counter that would not match the actual sequence in place within the original component's MAVLink instance.

The less likely error is having multiple instances of mavlink_status_t (say, by including multiple instances of mavlink_helpers.h or one of its parents.)

I've seen many cases of all of the 3 errors above. With their frequency in the order I listed. The first two errors would all mostly work fine with the exception of keeping track of the sequence numbers. The third case would cause more obvious issues.

QGC's internal TLOG files contain the entire MAVLink stream (every message received) with a time stamp header inserted right before each message.

edwinhayes commented 6 years ago

Yep, I understand. Our routing software should be forwarding the messages unchanged; messages which are coming in through one instance of router look fine - I can view them and the sequence number persists across each individual cID as expected.

So if there is an issue on my side, it would have to be that for some reason, the messages on the outgoing side of the router are for some reason being packaged differently.

This certainly didn't used to be the case, so I'm looking to see if something has changed - could be related to changes made to support MAVLink V2.0 I guess. Regardless, it's worrying and I'm investigating, but it's non-trivial because if it is an issue, it's apparently not visible in my usual diagnostic utility.

hamishwillee commented 6 years ago

@dogmaphobic The rule about not modifying packages is captured at the end of the routing doc overview here:

Forwarded messages must not be changed/repackaged by the forwarding system (the original message is passed to the new link).

To fix the multiple instances error, shouldn't we add #pragma once or include guards to the C library?

If not, what would you add as a note to the Using C library docs?

edwinhayes commented 6 years ago

Ok, a quick test: on my machine MAVROS sits behind a router instance, just the same as QGS sits behind a [different] instance. Looks like MAVROS sees the sequence numbers of messages from different component IDs as persisting individually as well, so I think that implies my router is behaving normally.

edwinhayes commented 6 years ago

Having just updated to the latest (OSX) daily build, I looking at this again.

QGC receives heartbeats from non-flight controller components fine: the lost message count doesn't go up, and the MAVLink inspector shows the heartbeat message frequency to be faster than you'd expect from the flight controller alone.

Then if I have my companion computer send a single MAVLink status text message, QGC receives the message fine, but the number of lost messages increases by 254.