mavlink / MAVSDK

API and library for MAVLink compatible systems written in C++17
https://mavsdk.mavlink.io
BSD 3-Clause "New" or "Revised" License
616 stars 504 forks source link

Log download via udp over ethernet stalls #2091

Open nickatblueflite opened 1 year ago

nickatblueflite commented 1 year ago

Hardware Setup: Holybro 6X <- eth -> Jetson Xavier

Using the code out of the logfile_download.py example I have no trouble downloading the full set of files with a serial connection.

However, when connecting over an eth connection downloads will start fine then always stall and never recover.

In fact, the "connection" goes down fully and the only way to reconnect is to reboot the pix. I've run the mavsdk_server manually to watch its output and the only thing that happens is the heartbeats timed out message appears and then nothing.

The connection seems stable otherwise and works well when using mavlink-router to forward to a GCS over the network.

nickatblueflite commented 1 year ago

I'm not sure if this is relevant, but looking at the mavlink status (with a vanilla mavsdk connection only) on the autopilot, there seems to be a lot of rx loss on the eth connection. I don't know exactly what this means or how to interpret it:

instance #1: mavlink chan: #1 type: GENERIC LINK OR RADIO flow control: OFF rates: tx: 988.0 B/s txerr: 0.0 B/s tx rate mult: 1.000 tx rate max: 100000 B/s rx: 21.0 B/s rx loss: 54.4% Received Messages: sysid:205, compid:205, Total: 56 (lost: 3044) msgid: 4, Rate: 0.0 Hz, last 4.27s ago msgid: 0, Rate: 1.0 Hz, last 0.31s ago FTP enabled: YES, TX enabled: YES mode: Normal MAVLink version: 2 transport protocol: UDP (10001, remote port: 10001) Broadcast enabled: YES ping statistics: last: 0.88 ms mean: 0.89 ms max: 1.15 ms min: 0.81 ms dropped packets: 0

nickatblueflite commented 1 year ago

Here is the mavlink status when running the connection over ethernet through mavlink-router to QGC, no rx loss:

instance #1: GCS heartbeat valid mavlink chan: #1 type: GENERIC LINK OR RADIO flow control: OFF rates: tx: 983.9 B/s txerr: 0.0 B/s tx rate mult: 1.000 tx rate max: 100000 B/s rx: 21.0 B/s rx loss: 0.0% Received Messages: sysid:255, compid:190, Total: 80 (lost: 0) msgid: 126, Rate: 0.0 Hz, last 0.02s ago msgid: 4, Rate: 0.1 Hz, last 3.54s ago msgid: 0, Rate: 1.0 Hz, last 0.59s ago FTP enabled: YES, TX enabled: YES mode: Normal MAVLink version: 2 transport protocol: UDP (10001, remote port: 10001) Broadcast enabled: YES ping statistics: last: 2.11 ms mean: 2.14 ms max: 2.37 ms min: 1.76 ms dropped packets: 0

julianoes commented 1 year ago

Can you paste the output of mavsdk_server, mostly so I can see the version of that one...

And thanks for creating the issue.

nickatblueflite commented 1 year ago
[11:42:08|Info ] MAVSDK version: v1.4.16 (mavsdk_impl.cpp:20)
[11:42:08|Info ] Waiting to discover system on udp://:10001... (connection_initiator.h:20)
[11:42:09|Info ] New system on: 10.223.129.102:10001 (with sysid: 1) (udp_connection.cpp:192)
[11:42:09|Debug] New: System ID: 1 Comp ID: 1 (mavsdk_impl.cpp:496)
[11:42:09|Debug] Component Autopilot (1) added. (system_impl.cpp:377)
[11:42:09|Debug] Discovered 1 component(s) (system_impl.cpp:578)
[11:42:09|Info ] System discovered (connection_initiator.h:63)
[11:42:09|Info ] Server started (grpc_server.cpp:53)
[11:42:09|Info ] Server set to listen on 0.0.0.0:50051 (grpc_server.cpp:54)
** this is when the log download code starts
[11:42:27|Debug] Received all entries (log_files_impl.cpp:169)
[11:42:28|Debug] Received all entries (log_files_impl.cpp:169)
[11:42:28|Debug] 46080 B of 10106446 B (630.742 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 92160 B of 10106446 B (672.571 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 138240 B of 10106446 B (676.016 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 184320 B of 10106446 B (672.264 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 230400 B of 10106446 B (667.227 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 276480 B of 10106446 B (667.346 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 322560 B of 10106446 B (666.647 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 368640 B of 10106446 B (658.622 kiB/s) (log_files_impl.cpp:424)
...
[11:42:42|Debug] 9722880 B of 10106446 B (646.863 kiB/s) (log_files_impl.cpp:424)
[11:42:42|Debug] 9768960 B of 10106446 B (646.869 kiB/s) (log_files_impl.cpp:424)
[11:42:42|Debug] 9815040 B of 10106446 B (647.01 kiB/s) (log_files_impl.cpp:424)
[11:42:42|Debug] 9861120 B of 10106446 B (647.023 kiB/s) (log_files_impl.cpp:424)
[11:42:43|Debug] 9907200 B of 10106446 B (647.12 kiB/s) (log_files_impl.cpp:424)
[11:42:43|Debug] 9953280 B of 10106446 B (638.553 kiB/s) (log_files_impl.cpp:424)
[11:42:46|Info ] heartbeats timed out (system_impl.cpp:272)

Nothing else ever prints after this.

julianoes commented 1 year ago

Interesting! Thanks.

Note, I have this on my todo list, and it's definitely something I want to look into because it could be a bug on the MAVSDK side which I always want to get rid of. However, I can't give you an exact date when I'll get to it, I'm hoping for next week.

If this is an urgent or blocking issue for you, consider sponsoring me to move the task up my todo list.

shiftynick commented 1 year ago

Here is a little more info:

Scenarios that have this failure mode:

pix -> UDP (eth) -> Jetson -> mavsdk connect to udp://:10001 (our port setup on the pix)

pix -> UDP (eth) -> Jetson -> mavlink-router connect to udp://:10001 (our port setup on the pix) -> mavsdk connect to mavlink-router udp://:10002 (our port setup for mavlink-router)

pix -> USB (serial) -> Jetson -> mavlink-router connect to serial:///dev/ttyACM0 -> mavsdk connect to mavlink-router udp://:10002 (our port setup for mavlink-router)

Scenarios that work:

pix -> USB (serial) -> Jetson -> mavsdk connect to serial:///dev/ttyACM0

So it turns out the issue still manifests even without the eth connection in the loop. If we connect the pix to the jetson via USB (serial) but run mavlink-router in between mavsdk we get the same issue. So perhaps its UDP related?

julianoes commented 1 year ago

I'm trying to reproduce this and so far I've set up:

For the last case, this is my mavlink-router config:

[General]
DebugLogLevel = Debug

[UartEndpoint bob]
Device = /dev/ttyACM0
Baud = 921600

# send to 127.0.0.1:11000
[UdpEndpoint charlie]
Mode = Normal
Address = 127.0.0.1
Port = 14540

In order to connect over USB I had to connect via QGC first, so that MAVLink on USB is started, then I switched over to mavlink-router.

The files I downloaded were between 5MB and 24MB.

So, I unfortunately can't reproduce your problem. I suggest that you give me more details on how I could reproduce this. I don't have a Jetson though.

And one more thought, could it have to do with latency and a timeout being triggered?

there seems to be a lot of rx loss on the eth connection.

My guess is that this is just MAVSDK not sending the sequence numbers correctly which makes the stats misleading. See https://github.com/mavlink/MAVSDK/issues/1467.

julianoes commented 1 year ago

I also tried the log file download integration test with low timeouts and lossy connections and both still worked. :thinking:

shiftynick commented 1 year ago

Interestingly I'm having success with the USB scenario today, not immediately clear why.

My previous testing was on 1.13, I moved up to 1.14beta to see if that would resolve the issues over eth only.

So far I still have the same issue. Here is as much info as I can think to include for my exact setup:

PX4 netman config: nsh> netman show DEVICE=eth0 BOOTPROTO=static NETMASK=255.255.255.0 IPADDR=10.223.129.102 ROUTER=10.223.129.1 DNS=10.223.129.1

*PX4 MAV2:** MAV_2_BROADCAST = Always Broadcast MAV_2_CONFIG = Ethernet MAV_2_FLOW_CTRL = Auto-detected MAV_2_FORWARD = Disabled MAV_2_MODE = Normal MAV_2_RATE = 100000 MAV_2_REMOTE_PRT = 14540 MAV_2_UDP_PORT = 14540

Jetson IP Config: ip address flush dev eth0 ip route flush dev eth0 ip address add 10.223.129.101/24 brd + dev eth0 ip route add 10.223.129.1 dev eth0

mavsdk_server command:

sudo /usr/local/lib/python3.11/dist-packages/mavsdk/bin/mavsdk_server udp://:14540 -p 50204 --sysid 204 --compid 204

test.py:

#!/usr/bin/env python3

import asyncio
from mavsdk import System
import sys

async def run():
    drone = System(mavsdk_server_address="localhost", port=50204, sysid=204, compid=204)
    await drone.connect()

    print("Waiting for drone to connect...")
    async for state in drone.core.connection_state():
        if state.is_connected:
            print(f"-- Connected to drone!")
            break

    entries = await get_entries(drone)
    for entry in entries:
        await download_log(drone, entry)

async def download_log(drone, entry):
    date_without_colon = entry.date.replace(":", "-")
    filename = f"./log-{date_without_colon}.ulog"
    print(f"Downloading: log {entry.id} from {entry.date} to {filename}")
    previous_progress = -1
    async for progress in drone.log_files.download_log_file(entry, filename):
        new_progress = round(progress.progress*100)
        if new_progress != previous_progress:
            sys.stdout.write(f"\r{new_progress} %")
            sys.stdout.flush()
            previous_progress = new_progress
    print()

async def get_entries(drone):
    entries = await drone.log_files.get_entries()
    for entry in entries:
        print(f"Log {entry.id} from {entry.date}")
    return entries

if __name__ == "__main__":
    # Run the asyncio loop
    asyncio.run(run())

Maybe something sticks out?

julianoes commented 1 year ago

One thing to try out would be to launch mavsdk_server using gdb and then press Ctrl+C when it stalls and then do thread apply all bt to check where it hangs. I expect that we won't see too much though because it's a release build.

Edit: I'm just creating a debug build and will attach that here as soon as it's done. Edit2: Build is taking longer. Need to fix some build dependencies with Python future first. I'll continue tomorrow.

shiftynick commented 1 year ago

thread apply all bt is not returning anything for me, but running bt on the current thread results:

(gdb) bt
#0  0x0000fffff7acdb28 in syscall ()
#1  0x0000fffff7865a08 in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#2  0x0000fffff786590c in AbslInternalPerThreadSemWait_lts_20210324 ()
#3  0x0000fffff7865368 in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#4  0x0000fffff7837ee0 in gpr_cv_wait ()
#5  0x0000fffff74f4418 in grpc::Server::Wait() ()
#6  0x0000fffff6fdeac4 in main ()

I really feels like Mavsdk is just stuck listening and that maybe the problem is on the PX4 side?

The only way to recover after this is to reboot the pix.

julianoes commented 1 year ago

The only way to recover after this is to reboot the pix.

Surely restarting mavsdk_server would fix it?

Oh, the Pixhawk is stuck. Interesting, I wish I could reproduce it.

julianoes commented 1 year ago

mavsdk_server-debug.zip

This is a aarch64 musl debug build that works on my RPi4 (64bit).

julianoes commented 1 year ago

Ok, I've copied exactly your commit and settings on the Pixhawk and RPi4, and it downloads my 4 test logs ok (Although I get a warning about "Ignoring wrong offset (log_files_impl.cpp:365)").

julianoes commented 1 year ago

Would be interesting to get the output of things like top, uorb top, mavlink status, etc. if you happen to have console access to the Pixhawk. At least mavlink status would work just over MAVLink shell.

shiftynick commented 1 year ago

Ok, I've copied exactly your commit and settings on the Pixhawk and RPi4, and it downloads my 4 test logs ok (Although I get a warning about "Ignoring wrong offset (log_files_impl.cpp:365)").

I do get lots of those warnings as well.

shiftynick commented 1 year ago

mavsdk_server-debug.zip

This is a aarch64 musl debug build that works on my RPi4 (64bit).

This build works on the Jetson as well. I get a little more info from the bt, but still nothing from thread apply all bt

[05:15:31|Debug] 875520 B of 1176816 B (523.409 kiB/s) (log_files_impl.cpp:424)
[05:15:31|Error] Ignoring wrong offset (log_files_impl.cpp:365)
[05:15:31|Debug] 921600 B of 1176816 B (520.279 kiB/s) (log_files_impl.cpp:424)
[05:15:31|Error] Ignoring wrong offset (log_files_impl.cpp:365)
[05:15:31|Debug] 967680 B of 1176816 B (481.654 kiB/s) (log_files_impl.cpp:424)
[05:15:31|Error] Ignoring wrong offset (log_files_impl.cpp:365)
[05:15:34|Info ] heartbeats timed out (system_impl.cpp:272)
^C
Thread 1 "mavsdk_server2" received signal SIGINT, Interrupt.
0x0000fffff767a000 in syscall ()
(gdb) bt
#0  0x0000fffff767a000 in syscall ()
#1  0x0000fffff7400600 in absl::lts_20210324::synchronization_internal::FutexImpl::WaitUntil(std::atomic<int>*, int, absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#2  0x0000fffff7400350 in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#3  0x0000fffff7400034 in AbslInternalPerThreadSemWait_lts_20210324 ()
#4  0x0000fffff73fe82c in absl::lts_20210324::synchronization_internal::PerThreadSem::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#5  0x0000fffff73f7ab8 in absl::lts_20210324::Mutex::DecrementSynchSem(absl::lts_20210324::Mutex*, absl::lts_20210324::base_internal::PerThreadSynch*, absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#6  0x0000fffff73fd7fc in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#7  0x0000fffff73fd9a0 in absl::lts_20210324::CondVar::Wait(absl::lts_20210324::Mutex*) ()
#8  0x0000fffff73a83c0 in gpr_cv_wait ()
#9  0x0000fffff6ec1138 in grpc::CoreCodegen::gpr_cv_wait(long*, long*, gpr_timespec) ()
#10 0x0000fffff6ea3b04 in grpc::internal::CondVar::Wait(grpc::internal::Mutex*) ()
#11 0x0000fffff6ea1e98 in grpc::Server::Wait() ()
#12 0x0000fffff6762558 in mavsdk::mavsdk_server::GrpcServer::wait() ()
#13 0x0000fffff66da8f0 in MavsdkServer::Impl::wait() ()
#14 0x0000fffff66c22a4 in MavsdkServer::wait() ()
#15 0x0000fffff66c2180 in mavsdk_server_attach ()
#16 0x0000fffff66c179c in main ()
(gdb) thread apply all bt
shiftynick commented 1 year ago

The only way to recover after this is to reboot the pix.

Surely restarting mavsdk_server would fix it?

Oh, the Pixhawk is stuck. Interesting, I wish I could reproduce it.

Yes, restarting mavsdk_server never connects with the pix again over eth until rebooting the pix. If I have a secondary USB connection to QGC that stays up.

shiftynick commented 1 year ago

Would be interesting to get the output of things like top, uorb top, mavlink status, etc. if you happen to have console access to the Pixhawk. At least mavlink status would work just over MAVLink shell.

From QGC shell after the download freezes:


 PID COMMAND                   CPU(ms) CPU(%)  USED/STACK PRIO(BASE) STATE FD
   0 Idle Task                  554604 72.065   272/  768   0 (  0)  READY  3
   1 hpwork                          0  0.000   292/ 1224 249 (249)  w:sem  3
   2 lpwork                          0  0.000   860/ 1576  50 ( 50)  w:sem  3
   3 nsh_main                        0  0.000  1972/ 2896 100 (100)  w:sem  4
   4 wq:manager                      0  0.000   428/ 1232 255 (255)  w:sem  3
   5 Telnet daemon                   0  0.000   556/ 1984 100 (100)  w:sem  1
   6 netinit                         0  0.000   588/ 2024  49 ( 49)  w:sem  4
1623 mavlink_shell                   0  0.000  1076/ 2000 100 (100)  w:sem  4
 247 wq:hp_default                   0  0.461   888/ 1872 237 (237)  w:sem  3
 253 dataman                         0  0.000   852/ 1280  90 ( 90)  w:sem  4
 255 wq:lp_default                   0  0.231  1084/ 1896 205 (205)  w:sem  3
 297 wq:ttyS5                        0  0.200  1108/ 1704 229 (229)  w:sem  3
 331 wq:I2C1                         0  0.000  1024/ 2312 246 (246)  w:sem  3
 334 wq:I2C2                         0  0.172   840/ 2312 245 (245)  w:sem  3
 376 wq:SPI3                         3  1.722  1384/ 2368 251 (251)  w:sem  3
 410 wq:SPI2                         3  1.775  1840/ 2368 252 (252)  w:sem  3
 429 wq:SPI1                         2  1.057  1696/ 2368 253 (253)  w:sem  3
 434 wq:I2C4                         0  0.194  1116/ 2312 243 (243)  w:sem  3
 564 wq:nav_and_controllers          3  1.550  1248/ 2216 242 (242)  w:sem  3
 565 wq:rate_ctrl                    1  0.731  1440/ 3120 255 (255)  w:sem  3
 566 wq:INS0                         7  3.608  4316/ 5976 241 (241)  w:sem  3
 569 wq:INS1                         7  3.873  4316/ 5976 240 (240)  w:sem  3
 571 wq:INS2                         7  3.885  4316/ 5976 239 (239)  w:sem  3
 574 commander                       1  0.557  1380/ 3192 140 (140)  w:sig  5
 976 mavlink_rcv_if0                 0  0.188  2292/ 5976 175 (175)  w:sem  4
 817 gps                             0  0.037  1228/ 1656 205 (205)  w:sem  4
 975 mavlink_if0                     2  1.047  1980/ 3048 100 (100)  w:sig  4
1608 log_writer_file                 0  0.000   388/ 1144  60 ( 60)  w:sem  3
1116 mavlink_if1                     0  0.000  2124/ 3048 100 (100)  w:sem  5
1127 mavlink_rcv_if1                 0  0.000  2732/ 5976 175 (175)  w:sem  5
1235 navigator                       0  0.058  1032/ 1896 105 (105)  w:sem  6
1548 logger                          0  0.169  2532/ 3616 230 (230)  w:sem  3
1624 netinit                         0  0.000   508/ 2024  49 ( 49)  w:sem  4
1617 wq:uavcan                       1  0.794  2244/ 3600 236 (236)  w:sem  3
1621 mavlink_if2                     9  4.644  1980/ 3144 100 (100)  READY  6
1622 mavlink_rcv_if2                 0  0.193  2572/ 5976 175 (175)  w:sem  6
1638 top                             0  0.007  2036/ 4056 237 (237)  RUN    3

Processes: 37 total, 3 running, 34 sleeping
CPU usage: 27.16% tasks, 0.77% sched, 72.06% idle
DMA Memory: 5120 total, 1536 used 1536 peak
Uptime: 762.860s total, 554.604s idle

nsh> uorb top

nsh> uorb top -1

update: 1s, topics: 283, total publications: 14320, 1570.2 kB/s
TOPIC NAME                     INST #SUB RATE #Q SIZE
actuator_armed                    0   19    2  1   16 
battery_status                    0    9    2  1  168 
cpuload                           0    6    2  1   16 
estimator_attitude                0    1  199  1   56 
estimator_attitude                1    2  203  1   56 
estimator_attitude                2    1  200  1   56 
estimator_attitude                3    1  199  1   56 
estimator_attitude                4    1  203  1   56 
estimator_attitude                5    1  200  1   56 
estimator_event_flags             0    1    1  1   56 
estimator_event_flags             1    1    1  1   56 
estimator_event_flags             2    1    1  1   56 
estimator_event_flags             3    1    1  1   56 
estimator_event_flags             4    1    1  1   56 
estimator_event_flags             5    1    1  1   56 
estimator_innovation_test_ratios  0    1   99  1  152 
estimator_innovation_test_ratios  1    1  101  1  152 
estimator_innovation_test_ratios  2    1  100  1  152 
estimator_innovation_test_ratios  3    1   99  1  152 
estimator_innovation_test_ratios  4    1  101  1  152 
estimator_innovation_test_ratios  5    1  100  1  152 
estimator_innovation_variances    0    1   99  1  152 
estimator_innovation_variances    1    1  101  1  152 
estimator_innovation_variances    2    1  100  1  152 
estimator_innovation_variances    3    1   99  1  152 
estimator_innovation_variances    4    1  101  1  152 
estimator_innovation_variances    5    1  100  1  152 
estimator_innovations             0    1   99  1  152 
estimator_innovations             1    1  101  1  152 
estimator_innovations             2    1  100  1  152 
estimator_innovations             3    1   99  1  152 
estimator_innovations             4    1  101  1  152 
estimator_innovations             5    1  100  1  152 
estimator_local_position          0    1   99  1  168 
estimator_local_position          1    2  101  1  168 
estimator_local_position          2    1  100  1  168 
estimator_local_position          3    1   99  1  168 
estimator_local_position          4    1  101  1  168 
estimator_local_position          5    1  100  1  168 
estimator_odometry                1    1  101  1  112 
estimator_selector_status         0    8    1  1  160 
estimator_sensor_bias             0    1    1  1  120 
estimator_sensor_bias             1    5    1  1  120 
estimator_sensor_bias             2    1    1  1  120 
estimator_sensor_bias             3    1    1  1  120 
estimator_sensor_bias             4    1    1  1  120 
estimator_sensor_bias             5    1    1  1  120 
estimator_states                  0    1   99  1  216 
estimator_states                  1    1  101  1  216 
estimator_states                  2    1  100  1  216 
estimator_states                  3    1   99  1  216 
estimator_states                  4    1  101  1  216 
estimator_states                  5    1  100  1  216 
estimator_status                  0    5   99  1  120 
estimator_status                  1    9  101  1  120 
estimator_status                  2    5  100  1  120 
estimator_status                  3    3   99  1  120 
estimator_status                  4    3  101  1  120 
estimator_status                  5    3  100  1  120 
estimator_status_flags            0    1    1  1   96 
estimator_status_flags            1    2    1  1   96 
estimator_status_flags            2    1    1  1   96 
estimator_status_flags            3    1    1  1   96 
estimator_status_flags            4    1    1  1   96 
estimator_status_flags            5    1    1  1   96 
failsafe_flags                    0    1    2  1   88 
failure_detector_status           0    1    2  1   24 
magnetometer_bias_estimate        0    2   20  1   64 
position_setpoint_triplet         0    6   21  1  224 
px4io_status                      0    1    1  1  144 
rtl_time_estimate                 0    5    1  1   24 
sensor_accel                      0    5  199  8   48 
sensor_accel                      1    6  406  8   48 
sensor_accel                      2    5  401  8   48 
sensor_baro                       0    6   23  4   32 
sensor_baro                       1    3   23  4   32 
sensor_combined                   0    7  203  1   48 
sensor_gyro                       0    6  399  8   48 
sensor_gyro                       1    5  406  8   48 
sensor_gyro                       2    5  401  8   48 
sensor_gyro_fifo                  0    1  406  4  224 
sensor_mag                        0    6   20  4   40 
sensor_preflight_mag              0    1   22  1   16 
sensors_status_imu                0    3  203  1   96 
system_power                      0    3  100  1   40 
telemetry_status                  0    2    1  1   88 
telemetry_status                  2    1    1  1   88 
vehicle_acceleration              0    1  203  1   32 
vehicle_air_data                  0   21   15  1   40 
vehicle_angular_velocity          0    9  406  1   40 
vehicle_attitude                  0   12  203  1   56 
vehicle_control_mode              0   15    2  1   24 
vehicle_imu                       0    6  199  1   56 
vehicle_imu                       1    6  203  1   56 
vehicle_imu                       2    6  200  1   56 
vehicle_imu_status                0    6   10  1  136 
vehicle_imu_status                1    7    9  1  136 
vehicle_imu_status                2    6    9  1  136 
vehicle_local_position            0   32  101  1  168 
vehicle_magnetometer              0    5   13  1   40 
vehicle_odometry                  0    1  101  1  112 
vehicle_status                    0   39    2  1   72 

nsh> mavlink status

instance #0:
    mavlink chan: #0
    type:        GENERIC LINK OR RADIO
    flow control: ON
    rates:
      tx: 341.6 B/s
      txerr: 0.0 B/s
      tx rate mult: 0.332
      tx rate max: 1200 B/s
      rx: 0.0 B/s
      rx loss: 0.0%
    FTP enabled: YES, TX enabled: YES
    mode: Normal
    Forwarding: On
    MAVLink version: 1
    transport protocol: serial (/dev/ttyS6 @57600)

instance #1:
    mavlink chan: #1
    type:        GENERIC LINK OR RADIO
    flow control: OFF
    rates:
      tx: 279627.2 B/s
      txerr: 0.0 B/s
      tx rate mult: 1.000
      tx rate max: 100000 B/s
      rx: 212.8 B/s
      rx loss: 59.9%
    Received Messages:
      sysid:204, compid:204, Total: 63 (lost: 3774)
    FTP enabled: YES, TX enabled: YES
    mode: Normal
    Forwarding: Off
    MAVLink version: 2
    transport protocol: UDP (10001, remote port: 10001)
    Broadcast enabled: YES
    ping statistics:
      last: 1.41 ms
      mean: 1.25 ms
      max: 1.41 ms
      min: 0.64 ms
      dropped packets: 1

instance #2:
    GCS heartbeat valid
    mavlink chan: #2
    type:        USB CDC
    flow control: ON
    rates:
      tx: 20947.9 B/s
      txerr: 0.0 B/s
      tx rate mult: 1.000
      tx rate max: 100000 B/s
      rx: 46.9 B/s
      rx loss: 0.0%
    Received Messages:
      sysid:255, compid:190, Total: 1614 (lost: 0)
        msgid:  126, Rate:  0.1 Hz, last 0.04s ago
        msgid:    0, Rate:  1.0 Hz, last 0.19s ago
        msgid:    4, Rate:  1.0 Hz, last 0.97s ago
    FTP enabled: YES, TX enabled: YES
    mode: Config
    Forwarding: On
    MAVLink version: 2
    transport protocol: serial (/dev/ttyACM0 @2000000)
    ping statistics:
      last: 0.43 ms
      mean: 0.56 ms
      max: 211.51 ms
      min: 0.27 ms
      dropped packets: 0
shiftynick commented 1 year ago

Unless something sticks out above, I'll try to get setup running on macOS or an Ubuntu VM to make sure it's not the Jetson, only other thing i can think of is the network configuration.

julianoes commented 1 year ago

The only thing I see is that the MAVLink UDP receiver goes to 0 CPU, so presumably it stops receiving anything.

1127 mavlink_rcv_if1 0 0.000 2732/ 5976 175 (175) w:sem 5

shiftynick commented 1 year ago

I have found an interesting new failure mode that takes mavsdk out of the loop (I think).

Pix -> eth udp to Jetson -> mavlink-router to Halo LTE Network (udp) -> Halo Mavlink Forward to Local computer over VPN udp -> Local QGC -> QGC log download

Logs start to download but eventually stall and require pix restart

Unless Halo is using mavsdk or QGC?

Still working on a setup without the jetson in between

shiftynick commented 1 year ago

More success modes:

Pix -> eth to macOS -> QGC log download WORKS Pix -> eth to macOS -> mavsdk based log download WORKS

It's starting to really seem like its an issue with the Jetson networking setup/stack

shiftynick commented 1 year ago

Finally got Wireshark w/ MAVLINK hooked up. When the stall happens the Jetson continually sends out LOG_REQUEST_DATA packets but gets no response from the pix, in fact, the pix stops sending anything at all.

Does this sound like its actually a PX4 issue??

10.223.129.101 is the Jetson, 10.223.129.102 is the pixhawk

image
julianoes commented 1 year ago

Interesting! So what's the last response by the Pixhawk that you see? Or what's the first message that triggers the Pixhawk to stop?

shiftynick commented 1 year ago

wireshark full jetson setup fail capture.pcapng.zip

Here is the full wireshark capture. You can see the LOG_DATA and LOG_REQUEST_DATA going back and forth normally for a while, then the LOG_DATA decrease and the LOG_REQUEST_DATA increase until eventually the LOG_DATA messages stop coming.

I've been trying dome PX4 builds with some PX4_WARNs in various places to see where if the LOG_REQUEST_DATA messages are coming through the stack and so far I have been able to determine that the Mavlink Receiver is not seeing the anything on the fds in the MavlinkReceiver::run() loop.

int ret = poll(&fds[0], 1, timeout);

ret is always 0 after the stall happens, so it looks like the issue is deeper in the network stack? I don't even know where to begin to look into that...

julianoes commented 1 year ago

Ok, I'm having a look at the wireshark dump, thanks for that.

Random thought: does it also happen when remote and local port numbers are not the same? They generally don't have to be, and I'm wondering if that could cause problems somewhere.

julianoes commented 1 year ago

10.223.129.101 is the Jetson, 10.223.129.102 is the pixhawk

So it looks like there is also 10.223.129.103 and all traffic is forwarded to that one. Not sure if that matters or not, but does it also happen without that forwarding on?

I don't even know where to begin to look into that...

Neither. I never had to look into the NuttX network stack so far.

This would be a start: https://cwiki.apache.org/confluence/display/NUTTX/Networking

shiftynick commented 1 year ago

Random thought: does it also happen when remote and local port numbers are not the same? They generally don't have to be, and I'm wondering if that could cause problems somewhere.

It does happen with different port numbers.

So it looks like there is also 10.223.129.103 and all traffic is forwarded to that one. Not sure if that matters or not, but does it also happen without that forwarding on?

It does happen without this as well.

This would be a start: https://cwiki.apache.org/confluence/display/NUTTX/Networking

Will see what I can find. Do you know who might be most familiar with this part of the stack that I could reach out to?

julianoes commented 1 year ago

When it comes to NuttX, that would be @davids5.

shiftynick commented 1 year ago

One more interesting data point, I was able to replicate the issue on an older Jetson Nano as well, except on the Nano it runs much much longer before it stalls. And interestingly it maintains a faster and much more consistent download speed while it is working.