Open nickatblueflite opened 1 year ago
I'm not sure if this is relevant, but looking at the mavlink status (with a vanilla mavsdk connection only) on the autopilot, there seems to be a lot of rx loss on the eth connection. I don't know exactly what this means or how to interpret it:
instance #1: mavlink chan: #1 type: GENERIC LINK OR RADIO flow control: OFF rates: tx: 988.0 B/s txerr: 0.0 B/s tx rate mult: 1.000 tx rate max: 100000 B/s rx: 21.0 B/s rx loss: 54.4% Received Messages: sysid:205, compid:205, Total: 56 (lost: 3044) msgid: 4, Rate: 0.0 Hz, last 4.27s ago msgid: 0, Rate: 1.0 Hz, last 0.31s ago FTP enabled: YES, TX enabled: YES mode: Normal MAVLink version: 2 transport protocol: UDP (10001, remote port: 10001) Broadcast enabled: YES ping statistics: last: 0.88 ms mean: 0.89 ms max: 1.15 ms min: 0.81 ms dropped packets: 0
Here is the mavlink status when running the connection over ethernet through mavlink-router to QGC, no rx loss:
instance #1: GCS heartbeat valid mavlink chan: #1 type: GENERIC LINK OR RADIO flow control: OFF rates: tx: 983.9 B/s txerr: 0.0 B/s tx rate mult: 1.000 tx rate max: 100000 B/s rx: 21.0 B/s rx loss: 0.0% Received Messages: sysid:255, compid:190, Total: 80 (lost: 0) msgid: 126, Rate: 0.0 Hz, last 0.02s ago msgid: 4, Rate: 0.1 Hz, last 3.54s ago msgid: 0, Rate: 1.0 Hz, last 0.59s ago FTP enabled: YES, TX enabled: YES mode: Normal MAVLink version: 2 transport protocol: UDP (10001, remote port: 10001) Broadcast enabled: YES ping statistics: last: 2.11 ms mean: 2.14 ms max: 2.37 ms min: 1.76 ms dropped packets: 0
Can you paste the output of mavsdk_server, mostly so I can see the version of that one...
And thanks for creating the issue.
[11:42:08|Info ] MAVSDK version: v1.4.16 (mavsdk_impl.cpp:20)
[11:42:08|Info ] Waiting to discover system on udp://:10001... (connection_initiator.h:20)
[11:42:09|Info ] New system on: 10.223.129.102:10001 (with sysid: 1) (udp_connection.cpp:192)
[11:42:09|Debug] New: System ID: 1 Comp ID: 1 (mavsdk_impl.cpp:496)
[11:42:09|Debug] Component Autopilot (1) added. (system_impl.cpp:377)
[11:42:09|Debug] Discovered 1 component(s) (system_impl.cpp:578)
[11:42:09|Info ] System discovered (connection_initiator.h:63)
[11:42:09|Info ] Server started (grpc_server.cpp:53)
[11:42:09|Info ] Server set to listen on 0.0.0.0:50051 (grpc_server.cpp:54)
** this is when the log download code starts
[11:42:27|Debug] Received all entries (log_files_impl.cpp:169)
[11:42:28|Debug] Received all entries (log_files_impl.cpp:169)
[11:42:28|Debug] 46080 B of 10106446 B (630.742 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 92160 B of 10106446 B (672.571 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 138240 B of 10106446 B (676.016 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 184320 B of 10106446 B (672.264 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 230400 B of 10106446 B (667.227 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 276480 B of 10106446 B (667.346 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 322560 B of 10106446 B (666.647 kiB/s) (log_files_impl.cpp:424)
[11:42:28|Debug] 368640 B of 10106446 B (658.622 kiB/s) (log_files_impl.cpp:424)
...
[11:42:42|Debug] 9722880 B of 10106446 B (646.863 kiB/s) (log_files_impl.cpp:424)
[11:42:42|Debug] 9768960 B of 10106446 B (646.869 kiB/s) (log_files_impl.cpp:424)
[11:42:42|Debug] 9815040 B of 10106446 B (647.01 kiB/s) (log_files_impl.cpp:424)
[11:42:42|Debug] 9861120 B of 10106446 B (647.023 kiB/s) (log_files_impl.cpp:424)
[11:42:43|Debug] 9907200 B of 10106446 B (647.12 kiB/s) (log_files_impl.cpp:424)
[11:42:43|Debug] 9953280 B of 10106446 B (638.553 kiB/s) (log_files_impl.cpp:424)
[11:42:46|Info ] heartbeats timed out (system_impl.cpp:272)
Nothing else ever prints after this.
Interesting! Thanks.
Note, I have this on my todo list, and it's definitely something I want to look into because it could be a bug on the MAVSDK side which I always want to get rid of. However, I can't give you an exact date when I'll get to it, I'm hoping for next week.
If this is an urgent or blocking issue for you, consider sponsoring me to move the task up my todo list.
Here is a little more info:
Scenarios that have this failure mode:
pix -> UDP (eth) -> Jetson -> mavsdk connect to udp://:10001 (our port setup on the pix)
pix -> UDP (eth) -> Jetson -> mavlink-router connect to udp://:10001 (our port setup on the pix) -> mavsdk connect to mavlink-router udp://:10002 (our port setup for mavlink-router)
pix -> USB (serial) -> Jetson -> mavlink-router connect to serial:///dev/ttyACM0 -> mavsdk connect to mavlink-router udp://:10002 (our port setup for mavlink-router)
Scenarios that work:
pix -> USB (serial) -> Jetson -> mavsdk connect to serial:///dev/ttyACM0
So it turns out the issue still manifests even without the eth connection in the loop. If we connect the pix to the jetson via USB (serial) but run mavlink-router in between mavsdk we get the same issue. So perhaps its UDP related?
I'm trying to reproduce this and so far I've set up:
For the last case, this is my mavlink-router config:
[General]
DebugLogLevel = Debug
[UartEndpoint bob]
Device = /dev/ttyACM0
Baud = 921600
# send to 127.0.0.1:11000
[UdpEndpoint charlie]
Mode = Normal
Address = 127.0.0.1
Port = 14540
In order to connect over USB I had to connect via QGC first, so that MAVLink on USB is started, then I switched over to mavlink-router.
The files I downloaded were between 5MB and 24MB.
So, I unfortunately can't reproduce your problem. I suggest that you give me more details on how I could reproduce this. I don't have a Jetson though.
And one more thought, could it have to do with latency and a timeout being triggered?
there seems to be a lot of rx loss on the eth connection.
My guess is that this is just MAVSDK not sending the sequence numbers correctly which makes the stats misleading. See https://github.com/mavlink/MAVSDK/issues/1467.
I also tried the log file download integration test with low timeouts and lossy connections and both still worked. :thinking:
Interestingly I'm having success with the USB scenario today, not immediately clear why.
My previous testing was on 1.13, I moved up to 1.14beta to see if that would resolve the issues over eth only.
So far I still have the same issue. Here is as much info as I can think to include for my exact setup:
Pixhawk 6X with PX4 v1.14beta2 -> eth to mavsdk_server v1.4.16 -> test.py
HolyBro 6X
Jetson Xavier NX
PX4 1.14.beta2 [18898f1]
PX4 netman config: nsh> netman show DEVICE=eth0 BOOTPROTO=static NETMASK=255.255.255.0 IPADDR=10.223.129.102 ROUTER=10.223.129.1 DNS=10.223.129.1
*PX4 MAV2:** MAV_2_BROADCAST = Always Broadcast MAV_2_CONFIG = Ethernet MAV_2_FLOW_CTRL = Auto-detected MAV_2_FORWARD = Disabled MAV_2_MODE = Normal MAV_2_RATE = 100000 MAV_2_REMOTE_PRT = 14540 MAV_2_UDP_PORT = 14540
Jetson IP Config: ip address flush dev eth0 ip route flush dev eth0 ip address add 10.223.129.101/24 brd + dev eth0 ip route add 10.223.129.1 dev eth0
mavsdk_server command:
sudo /usr/local/lib/python3.11/dist-packages/mavsdk/bin/mavsdk_server udp://:14540 -p 50204 --sysid 204 --compid 204
test.py:
#!/usr/bin/env python3
import asyncio
from mavsdk import System
import sys
async def run():
drone = System(mavsdk_server_address="localhost", port=50204, sysid=204, compid=204)
await drone.connect()
print("Waiting for drone to connect...")
async for state in drone.core.connection_state():
if state.is_connected:
print(f"-- Connected to drone!")
break
entries = await get_entries(drone)
for entry in entries:
await download_log(drone, entry)
async def download_log(drone, entry):
date_without_colon = entry.date.replace(":", "-")
filename = f"./log-{date_without_colon}.ulog"
print(f"Downloading: log {entry.id} from {entry.date} to {filename}")
previous_progress = -1
async for progress in drone.log_files.download_log_file(entry, filename):
new_progress = round(progress.progress*100)
if new_progress != previous_progress:
sys.stdout.write(f"\r{new_progress} %")
sys.stdout.flush()
previous_progress = new_progress
print()
async def get_entries(drone):
entries = await drone.log_files.get_entries()
for entry in entries:
print(f"Log {entry.id} from {entry.date}")
return entries
if __name__ == "__main__":
# Run the asyncio loop
asyncio.run(run())
Maybe something sticks out?
One thing to try out would be to launch mavsdk_server using gdb and then press Ctrl+C when it stalls and then do thread apply all bt
to check where it hangs. I expect that we won't see too much though because it's a release build.
Edit: I'm just creating a debug build and will attach that here as soon as it's done. Edit2: Build is taking longer. Need to fix some build dependencies with Python future first. I'll continue tomorrow.
thread apply all bt is not returning anything for me, but running bt on the current thread results:
(gdb) bt
#0 0x0000fffff7acdb28 in syscall ()
#1 0x0000fffff7865a08 in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#2 0x0000fffff786590c in AbslInternalPerThreadSemWait_lts_20210324 ()
#3 0x0000fffff7865368 in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#4 0x0000fffff7837ee0 in gpr_cv_wait ()
#5 0x0000fffff74f4418 in grpc::Server::Wait() ()
#6 0x0000fffff6fdeac4 in main ()
I really feels like Mavsdk is just stuck listening and that maybe the problem is on the PX4 side?
The only way to recover after this is to reboot the pix.
The only way to recover after this is to reboot the pix.
Surely restarting mavsdk_server would fix it?
Oh, the Pixhawk is stuck. Interesting, I wish I could reproduce it.
This is a aarch64 musl debug build that works on my RPi4 (64bit).
Ok, I've copied exactly your commit and settings on the Pixhawk and RPi4, and it downloads my 4 test logs ok (Although I get a warning about "Ignoring wrong offset (log_files_impl.cpp:365)").
Would be interesting to get the output of things like top
, uorb top
, mavlink status
, etc. if you happen to have console access to the Pixhawk. At least mavlink status
would work just over MAVLink shell.
Ok, I've copied exactly your commit and settings on the Pixhawk and RPi4, and it downloads my 4 test logs ok (Although I get a warning about "Ignoring wrong offset (log_files_impl.cpp:365)").
I do get lots of those warnings as well.
This is a aarch64 musl debug build that works on my RPi4 (64bit).
This build works on the Jetson as well. I get a little more info from the bt
, but still nothing from thread apply all bt
[05:15:31|Debug] 875520 B of 1176816 B (523.409 kiB/s) (log_files_impl.cpp:424)
[05:15:31|Error] Ignoring wrong offset (log_files_impl.cpp:365)
[05:15:31|Debug] 921600 B of 1176816 B (520.279 kiB/s) (log_files_impl.cpp:424)
[05:15:31|Error] Ignoring wrong offset (log_files_impl.cpp:365)
[05:15:31|Debug] 967680 B of 1176816 B (481.654 kiB/s) (log_files_impl.cpp:424)
[05:15:31|Error] Ignoring wrong offset (log_files_impl.cpp:365)
[05:15:34|Info ] heartbeats timed out (system_impl.cpp:272)
^C
Thread 1 "mavsdk_server2" received signal SIGINT, Interrupt.
0x0000fffff767a000 in syscall ()
(gdb) bt
#0 0x0000fffff767a000 in syscall ()
#1 0x0000fffff7400600 in absl::lts_20210324::synchronization_internal::FutexImpl::WaitUntil(std::atomic<int>*, int, absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#2 0x0000fffff7400350 in absl::lts_20210324::synchronization_internal::Waiter::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#3 0x0000fffff7400034 in AbslInternalPerThreadSemWait_lts_20210324 ()
#4 0x0000fffff73fe82c in absl::lts_20210324::synchronization_internal::PerThreadSem::Wait(absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#5 0x0000fffff73f7ab8 in absl::lts_20210324::Mutex::DecrementSynchSem(absl::lts_20210324::Mutex*, absl::lts_20210324::base_internal::PerThreadSynch*, absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#6 0x0000fffff73fd7fc in absl::lts_20210324::CondVar::WaitCommon(absl::lts_20210324::Mutex*, absl::lts_20210324::synchronization_internal::KernelTimeout) ()
#7 0x0000fffff73fd9a0 in absl::lts_20210324::CondVar::Wait(absl::lts_20210324::Mutex*) ()
#8 0x0000fffff73a83c0 in gpr_cv_wait ()
#9 0x0000fffff6ec1138 in grpc::CoreCodegen::gpr_cv_wait(long*, long*, gpr_timespec) ()
#10 0x0000fffff6ea3b04 in grpc::internal::CondVar::Wait(grpc::internal::Mutex*) ()
#11 0x0000fffff6ea1e98 in grpc::Server::Wait() ()
#12 0x0000fffff6762558 in mavsdk::mavsdk_server::GrpcServer::wait() ()
#13 0x0000fffff66da8f0 in MavsdkServer::Impl::wait() ()
#14 0x0000fffff66c22a4 in MavsdkServer::wait() ()
#15 0x0000fffff66c2180 in mavsdk_server_attach ()
#16 0x0000fffff66c179c in main ()
(gdb) thread apply all bt
The only way to recover after this is to reboot the pix.
Surely restarting mavsdk_server would fix it?
Oh, the Pixhawk is stuck. Interesting, I wish I could reproduce it.
Yes, restarting mavsdk_server never connects with the pix again over eth until rebooting the pix. If I have a secondary USB connection to QGC that stays up.
Would be interesting to get the output of things like
top
,uorb top
,mavlink status
, etc. if you happen to have console access to the Pixhawk. At leastmavlink status
would work just over MAVLink shell.
From QGC shell after the download freezes:
PID COMMAND CPU(ms) CPU(%) USED/STACK PRIO(BASE) STATE FD
0 Idle Task 554604 72.065 272/ 768 0 ( 0) READY 3
1 hpwork 0 0.000 292/ 1224 249 (249) w:sem 3
2 lpwork 0 0.000 860/ 1576 50 ( 50) w:sem 3
3 nsh_main 0 0.000 1972/ 2896 100 (100) w:sem 4
4 wq:manager 0 0.000 428/ 1232 255 (255) w:sem 3
5 Telnet daemon 0 0.000 556/ 1984 100 (100) w:sem 1
6 netinit 0 0.000 588/ 2024 49 ( 49) w:sem 4
1623 mavlink_shell 0 0.000 1076/ 2000 100 (100) w:sem 4
247 wq:hp_default 0 0.461 888/ 1872 237 (237) w:sem 3
253 dataman 0 0.000 852/ 1280 90 ( 90) w:sem 4
255 wq:lp_default 0 0.231 1084/ 1896 205 (205) w:sem 3
297 wq:ttyS5 0 0.200 1108/ 1704 229 (229) w:sem 3
331 wq:I2C1 0 0.000 1024/ 2312 246 (246) w:sem 3
334 wq:I2C2 0 0.172 840/ 2312 245 (245) w:sem 3
376 wq:SPI3 3 1.722 1384/ 2368 251 (251) w:sem 3
410 wq:SPI2 3 1.775 1840/ 2368 252 (252) w:sem 3
429 wq:SPI1 2 1.057 1696/ 2368 253 (253) w:sem 3
434 wq:I2C4 0 0.194 1116/ 2312 243 (243) w:sem 3
564 wq:nav_and_controllers 3 1.550 1248/ 2216 242 (242) w:sem 3
565 wq:rate_ctrl 1 0.731 1440/ 3120 255 (255) w:sem 3
566 wq:INS0 7 3.608 4316/ 5976 241 (241) w:sem 3
569 wq:INS1 7 3.873 4316/ 5976 240 (240) w:sem 3
571 wq:INS2 7 3.885 4316/ 5976 239 (239) w:sem 3
574 commander 1 0.557 1380/ 3192 140 (140) w:sig 5
976 mavlink_rcv_if0 0 0.188 2292/ 5976 175 (175) w:sem 4
817 gps 0 0.037 1228/ 1656 205 (205) w:sem 4
975 mavlink_if0 2 1.047 1980/ 3048 100 (100) w:sig 4
1608 log_writer_file 0 0.000 388/ 1144 60 ( 60) w:sem 3
1116 mavlink_if1 0 0.000 2124/ 3048 100 (100) w:sem 5
1127 mavlink_rcv_if1 0 0.000 2732/ 5976 175 (175) w:sem 5
1235 navigator 0 0.058 1032/ 1896 105 (105) w:sem 6
1548 logger 0 0.169 2532/ 3616 230 (230) w:sem 3
1624 netinit 0 0.000 508/ 2024 49 ( 49) w:sem 4
1617 wq:uavcan 1 0.794 2244/ 3600 236 (236) w:sem 3
1621 mavlink_if2 9 4.644 1980/ 3144 100 (100) READY 6
1622 mavlink_rcv_if2 0 0.193 2572/ 5976 175 (175) w:sem 6
1638 top 0 0.007 2036/ 4056 237 (237) RUN 3
Processes: 37 total, 3 running, 34 sleeping
CPU usage: 27.16% tasks, 0.77% sched, 72.06% idle
DMA Memory: 5120 total, 1536 used 1536 peak
Uptime: 762.860s total, 554.604s idle
nsh> uorb top
nsh> uorb top -1
update: 1s, topics: 283, total publications: 14320, 1570.2 kB/s
TOPIC NAME INST #SUB RATE #Q SIZE
actuator_armed 0 19 2 1 16
battery_status 0 9 2 1 168
cpuload 0 6 2 1 16
estimator_attitude 0 1 199 1 56
estimator_attitude 1 2 203 1 56
estimator_attitude 2 1 200 1 56
estimator_attitude 3 1 199 1 56
estimator_attitude 4 1 203 1 56
estimator_attitude 5 1 200 1 56
estimator_event_flags 0 1 1 1 56
estimator_event_flags 1 1 1 1 56
estimator_event_flags 2 1 1 1 56
estimator_event_flags 3 1 1 1 56
estimator_event_flags 4 1 1 1 56
estimator_event_flags 5 1 1 1 56
estimator_innovation_test_ratios 0 1 99 1 152
estimator_innovation_test_ratios 1 1 101 1 152
estimator_innovation_test_ratios 2 1 100 1 152
estimator_innovation_test_ratios 3 1 99 1 152
estimator_innovation_test_ratios 4 1 101 1 152
estimator_innovation_test_ratios 5 1 100 1 152
estimator_innovation_variances 0 1 99 1 152
estimator_innovation_variances 1 1 101 1 152
estimator_innovation_variances 2 1 100 1 152
estimator_innovation_variances 3 1 99 1 152
estimator_innovation_variances 4 1 101 1 152
estimator_innovation_variances 5 1 100 1 152
estimator_innovations 0 1 99 1 152
estimator_innovations 1 1 101 1 152
estimator_innovations 2 1 100 1 152
estimator_innovations 3 1 99 1 152
estimator_innovations 4 1 101 1 152
estimator_innovations 5 1 100 1 152
estimator_local_position 0 1 99 1 168
estimator_local_position 1 2 101 1 168
estimator_local_position 2 1 100 1 168
estimator_local_position 3 1 99 1 168
estimator_local_position 4 1 101 1 168
estimator_local_position 5 1 100 1 168
estimator_odometry 1 1 101 1 112
estimator_selector_status 0 8 1 1 160
estimator_sensor_bias 0 1 1 1 120
estimator_sensor_bias 1 5 1 1 120
estimator_sensor_bias 2 1 1 1 120
estimator_sensor_bias 3 1 1 1 120
estimator_sensor_bias 4 1 1 1 120
estimator_sensor_bias 5 1 1 1 120
estimator_states 0 1 99 1 216
estimator_states 1 1 101 1 216
estimator_states 2 1 100 1 216
estimator_states 3 1 99 1 216
estimator_states 4 1 101 1 216
estimator_states 5 1 100 1 216
estimator_status 0 5 99 1 120
estimator_status 1 9 101 1 120
estimator_status 2 5 100 1 120
estimator_status 3 3 99 1 120
estimator_status 4 3 101 1 120
estimator_status 5 3 100 1 120
estimator_status_flags 0 1 1 1 96
estimator_status_flags 1 2 1 1 96
estimator_status_flags 2 1 1 1 96
estimator_status_flags 3 1 1 1 96
estimator_status_flags 4 1 1 1 96
estimator_status_flags 5 1 1 1 96
failsafe_flags 0 1 2 1 88
failure_detector_status 0 1 2 1 24
magnetometer_bias_estimate 0 2 20 1 64
position_setpoint_triplet 0 6 21 1 224
px4io_status 0 1 1 1 144
rtl_time_estimate 0 5 1 1 24
sensor_accel 0 5 199 8 48
sensor_accel 1 6 406 8 48
sensor_accel 2 5 401 8 48
sensor_baro 0 6 23 4 32
sensor_baro 1 3 23 4 32
sensor_combined 0 7 203 1 48
sensor_gyro 0 6 399 8 48
sensor_gyro 1 5 406 8 48
sensor_gyro 2 5 401 8 48
sensor_gyro_fifo 0 1 406 4 224
sensor_mag 0 6 20 4 40
sensor_preflight_mag 0 1 22 1 16
sensors_status_imu 0 3 203 1 96
system_power 0 3 100 1 40
telemetry_status 0 2 1 1 88
telemetry_status 2 1 1 1 88
vehicle_acceleration 0 1 203 1 32
vehicle_air_data 0 21 15 1 40
vehicle_angular_velocity 0 9 406 1 40
vehicle_attitude 0 12 203 1 56
vehicle_control_mode 0 15 2 1 24
vehicle_imu 0 6 199 1 56
vehicle_imu 1 6 203 1 56
vehicle_imu 2 6 200 1 56
vehicle_imu_status 0 6 10 1 136
vehicle_imu_status 1 7 9 1 136
vehicle_imu_status 2 6 9 1 136
vehicle_local_position 0 32 101 1 168
vehicle_magnetometer 0 5 13 1 40
vehicle_odometry 0 1 101 1 112
vehicle_status 0 39 2 1 72
nsh> mavlink status
instance #0:
mavlink chan: #0
type: GENERIC LINK OR RADIO
flow control: ON
rates:
tx: 341.6 B/s
txerr: 0.0 B/s
tx rate mult: 0.332
tx rate max: 1200 B/s
rx: 0.0 B/s
rx loss: 0.0%
FTP enabled: YES, TX enabled: YES
mode: Normal
Forwarding: On
MAVLink version: 1
transport protocol: serial (/dev/ttyS6 @57600)
instance #1:
mavlink chan: #1
type: GENERIC LINK OR RADIO
flow control: OFF
rates:
tx: 279627.2 B/s
txerr: 0.0 B/s
tx rate mult: 1.000
tx rate max: 100000 B/s
rx: 212.8 B/s
rx loss: 59.9%
Received Messages:
sysid:204, compid:204, Total: 63 (lost: 3774)
FTP enabled: YES, TX enabled: YES
mode: Normal
Forwarding: Off
MAVLink version: 2
transport protocol: UDP (10001, remote port: 10001)
Broadcast enabled: YES
ping statistics:
last: 1.41 ms
mean: 1.25 ms
max: 1.41 ms
min: 0.64 ms
dropped packets: 1
instance #2:
GCS heartbeat valid
mavlink chan: #2
type: USB CDC
flow control: ON
rates:
tx: 20947.9 B/s
txerr: 0.0 B/s
tx rate mult: 1.000
tx rate max: 100000 B/s
rx: 46.9 B/s
rx loss: 0.0%
Received Messages:
sysid:255, compid:190, Total: 1614 (lost: 0)
msgid: 126, Rate: 0.1 Hz, last 0.04s ago
msgid: 0, Rate: 1.0 Hz, last 0.19s ago
msgid: 4, Rate: 1.0 Hz, last 0.97s ago
FTP enabled: YES, TX enabled: YES
mode: Config
Forwarding: On
MAVLink version: 2
transport protocol: serial (/dev/ttyACM0 @2000000)
ping statistics:
last: 0.43 ms
mean: 0.56 ms
max: 211.51 ms
min: 0.27 ms
dropped packets: 0
Unless something sticks out above, I'll try to get setup running on macOS or an Ubuntu VM to make sure it's not the Jetson, only other thing i can think of is the network configuration.
The only thing I see is that the MAVLink UDP receiver goes to 0 CPU, so presumably it stops receiving anything.
1127 mavlink_rcv_if1 0 0.000 2732/ 5976 175 (175) w:sem 5
I have found an interesting new failure mode that takes mavsdk out of the loop (I think).
Pix -> eth udp to Jetson -> mavlink-router to Halo LTE Network (udp) -> Halo Mavlink Forward to Local computer over VPN udp -> Local QGC -> QGC log download
Logs start to download but eventually stall and require pix restart
Unless Halo is using mavsdk or QGC?
Still working on a setup without the jetson in between
More success modes:
Pix -> eth to macOS -> QGC log download WORKS Pix -> eth to macOS -> mavsdk based log download WORKS
It's starting to really seem like its an issue with the Jetson networking setup/stack
Finally got Wireshark w/ MAVLINK hooked up. When the stall happens the Jetson continually sends out LOG_REQUEST_DATA packets but gets no response from the pix, in fact, the pix stops sending anything at all.
Does this sound like its actually a PX4 issue??
10.223.129.101 is the Jetson, 10.223.129.102 is the pixhawk
Interesting! So what's the last response by the Pixhawk that you see? Or what's the first message that triggers the Pixhawk to stop?
wireshark full jetson setup fail capture.pcapng.zip
Here is the full wireshark capture. You can see the LOG_DATA and LOG_REQUEST_DATA going back and forth normally for a while, then the LOG_DATA decrease and the LOG_REQUEST_DATA increase until eventually the LOG_DATA messages stop coming.
I've been trying dome PX4 builds with some PX4_WARNs in various places to see where if the LOG_REQUEST_DATA messages are coming through the stack and so far I have been able to determine that the Mavlink Receiver is not seeing the anything on the fds in the MavlinkReceiver::run() loop.
int ret = poll(&fds[0], 1, timeout);
ret is always 0 after the stall happens, so it looks like the issue is deeper in the network stack? I don't even know where to begin to look into that...
Ok, I'm having a look at the wireshark dump, thanks for that.
Random thought: does it also happen when remote and local port numbers are not the same? They generally don't have to be, and I'm wondering if that could cause problems somewhere.
10.223.129.101 is the Jetson, 10.223.129.102 is the pixhawk
So it looks like there is also 10.223.129.103 and all traffic is forwarded to that one. Not sure if that matters or not, but does it also happen without that forwarding on?
I don't even know where to begin to look into that...
Neither. I never had to look into the NuttX network stack so far.
This would be a start: https://cwiki.apache.org/confluence/display/NUTTX/Networking
Random thought: does it also happen when remote and local port numbers are not the same? They generally don't have to be, and I'm wondering if that could cause problems somewhere.
It does happen with different port numbers.
So it looks like there is also 10.223.129.103 and all traffic is forwarded to that one. Not sure if that matters or not, but does it also happen without that forwarding on?
It does happen without this as well.
This would be a start: https://cwiki.apache.org/confluence/display/NUTTX/Networking
Will see what I can find. Do you know who might be most familiar with this part of the stack that I could reach out to?
When it comes to NuttX, that would be @davids5.
One more interesting data point, I was able to replicate the issue on an older Jetson Nano as well, except on the Nano it runs much much longer before it stalls. And interestingly it maintains a faster and much more consistent download speed while it is working.
Hardware Setup: Holybro 6X <- eth -> Jetson Xavier
Using the code out of the logfile_download.py example I have no trouble downloading the full set of files with a serial connection.
However, when connecting over an eth connection downloads will start fine then always stall and never recover.
In fact, the "connection" goes down fully and the only way to reconnect is to reboot the pix. I've run the mavsdk_server manually to watch its output and the only thing that happens is the heartbeats timed out message appears and then nothing.
The connection seems stable otherwise and works well when using mavlink-router to forward to a GCS over the network.