Open rayw-dronesense opened 2 months ago
Hm, any chance you can get a backtrace with more info? It looks like it happens in some lock but I'm not sure which one and when.
Sure, here's the Tombstone for it
Slightly different tombstone for it
So I nailed down the commit that introduced this particular crash, went through the different versions on the v2.12 branch on MAVSDK and saw that this: https://github.com/mavlink/MAVSDK/commit/71126ac267a10d1c5a7fcb7a3c963688b2b32277
Reverting this change while on the 2.12.6 tag made this particular crash go away.
But I do see the intent of the change was to improve connecting to systems, so we probably don't want to remove this but instead fix this somehow...
Great find, sorry for being slow here.
Given you seem to be able to build the server and try out different versions, any chance you can build debug mode and get more output where it crashes?
Or if you have hints how I can reproduce it with the example apps that would work too.
Problem is I'm not familiar with the Java stuff but I do want to help fix this bug.
Hi @julianoes - I attached a custom example app (https://github.com/user-attachments/files/17066260/ReproMavsdkCrash.zip) that can easily repro the issue.
It can be done by,
Thanks for that. I can reproduce it, but I'm not sure what to make of it yet.
It looks like it crashes here: https://github.com/mavlink/MAVSDK/blob/27c5c30749076d205f8f5e46622171051b255386/src/mavsdk/core/log.h#L70-L83
But that doesn't look right if you ask me as the thombstone suggests pthread_mutex_lock.
Ok, so now I realized that you're starting 3 servers, one on each UDP port. Are you actually connecting 3 vehicles? Or is this just to connect three ports? If it is to do 3 ports but one vehicle, you might want to add some mavlink-router or MAVSDK based forwarder in-between.
In any case, the crash happens for the one where nothing is connected! I'll have a closer look into this.
I've finally managed to copy in my debug build mavsdk_server.so and added some printfs.
It turns out the _should_exit
flag here is not set to true
, for whatever mystery reason:
https://github.com/mavlink/MAVSDK/blob/v2.12/src/mavsdk_server/src/connection_initiator.h#L35
After plenty of head scratching I figured out what was going on, here is the fix: https://github.com/mavlink/MAVSDK/pull/2417.
Fix coming with https://github.com/mavlink/MAVSDK/releases/tag/v2.12.9.
@JonasVautherin do you mind making a Java release once the artifacts are ready? Thanks!
I guess I should wait for 2.12.10 with https://github.com/mavlink/MAVSDK/pull/2421?
@JonasVautherin no https://github.com/mavlink/MAVSDK/pull/2421 is only on main, not v2.12.
That's done in mavsdk_server:2.1.3 :+1:
I see some crashes with "ReproMavsdkCrash" app even after bumping mavsdk-server to 2.1.3. Crashes happen after some consecutive server start/stop calls, or if app is relaunched after a successful drone connection without stopping the server first.
@daniel-sales ok I will have to try to reproduce that again.
Ok, so now I realized that you're starting 3 servers, one on each UDP port. Are you actually connecting 3 vehicles? Or is this just to connect three ports? If it is to do 3 ports but one vehicle, you might want to add some mavlink-router or MAVSDK based forwarder in-between.
I'm still curious on this @rayw-dronesense and @daniel-sales.
@julianoes On my end I'm starting only one server on port (14551).
Regarding the connection issue, it is reproductible with the @rayw-dronesense's example app if you:
ports
list at MAVLinkService.kt
)Here's the stack trace:
2024-10-08 09:22:36.406 5635-5718 Mavsdk com.example.repromavcrash I MAVSDK version: v2.12.9
2024-10-08 09:22:36.414 5635-5718 MAVSDK-Server com.example.repromavcrash D Running mavsdk_server with connection url: udp://:14551
2024-10-08 09:22:36.418 5635-5991 Mavsdk com.example.repromavcrash I New system on: 127.0.0.1:47005 (with system ID: 1)
2024-10-08 09:22:36.418 5635-5991 Mavsdk com.example.repromavcrash D New system ID: 1 Comp ID: 1
2024-10-08 09:22:36.425 5635-5991 Mavsdk com.example.repromavcrash D Component Autopilot (1) added.
2024-10-08 09:22:36.444 5635-5991 Mavsdk com.example.repromavcrash D New system ID: 42 Comp ID: 190
2024-10-08 09:22:36.449 5635-5991 Mavsdk com.example.repromavcrash D Component Ground station (190) added.
2024-10-08 09:22:36.449 5635-5991 Mavsdk com.example.repromavcrash D Discovered 1 component(s)
2024-10-08 09:22:36.510 5635-5991 Mavsdk com.example.repromavcrash D Component Camera 1 (100) added.
2024-10-08 09:22:36.516 5635-5718 Mavsdk com.example.repromavcrash I System discovered
2024-10-08 09:22:36.529 5635-5718 Mavsdk com.example.repromavcrash I Server started
2024-10-08 09:22:36.529 5635-5718 Mavsdk com.example.repromavcrash I Server set to listen on 0.0.0.0:42723
2024-10-08 09:22:36.529 5635-5718 MAVSDK-Server com.example.repromavcrash D mavsdk_server is now running, listening on port 42723
2024-10-08 09:22:36.530 5635-5718 MAVLinkService com.example.repromavcrash E Got drone connection io.mavsdk.System@e7d61f5 on port 42723
2024-10-08 09:22:36.552 5635-5748 MAVLinkService com.example.repromavcrash E Connection state changed: false
2024-10-08 09:22:36.554 5635-5748 MAVLinkService com.example.repromavcrash E Connection state changed: true
2024-10-08 09:22:36.557 5635-5748 MAVLinkService com.example.repromavcrash E Connection state changed: false
2024-10-08 09:22:36.558 5635-5748 MAVLinkService com.example.repromavcrash E Connection state changed: true
2024-10-08 09:22:36.580 5635-6011 libc com.example.repromavcrash A Fatal signal 6 (SIGABRT), code -6 in tid 6011 (DefaultDispatch)
Ok, so this is a SIGABORT and happens when something is connected, not disconnected like before.
I can't reproduce this just yet. When you say "close the app", how do you do that? I'm using the emulator and hit stop, then I can just relaunch.
My guess is that you run into a bind issue with port already bound by the previous run.
I can't reproduce this just yet. When you say "close the app", how do you do that?
Hitting AndroidStudio "stop" button or using Android device UI to close the app. App crashes on next "start server" attempt.
My guess is that you run into a bind issue with port already bound by the previous run.
That makes sense. The crashes go away when I revert PR#2388, so maybe the new autopilot discovery method is interfering on the automatic port release by system when app is destroyed?
I honestly don't see how that would be related. And I could not reproduce it that way I'm afraid.
I just made more tests here and indeed, issue is not reproductible with px4-gazebo-headless simulator. However, it is easily reproductible on my real drone setup. So here is the detailed info for more reliable repro steps:
Crash happening on "start server" command after a few start/stop loops, or on next "start server" after closing the app without manual server stop.
Stack trace
2024-10-11 10:52:57.492 8080-8267 Mavsdk com.example.repromavcrash I MAVSDK version: v2.12.9
2024-10-11 10:52:57.493 8080-8267 MAVSDK-Server com.example.repromavcrash D Running mavsdk_server with connection url: udp://:14551
2024-10-11 10:52:57.502 8080-8272 Mavsdk com.example.repromavcrash I New system on: 127.0.0.1:43748 (with system ID: 1)
2024-10-11 10:52:57.502 8080-8272 Mavsdk com.example.repromavcrash D New system ID: 1 Comp ID: 1
2024-10-11 10:52:57.503 8080-8272 Mavsdk com.example.repromavcrash D Component Autopilot (1) added.
2024-10-11 10:52:57.529 8080-8272 Mavsdk com.example.repromavcrash D New system ID: 42 Comp ID: 190
2024-10-11 10:52:57.530 8080-8272 Mavsdk com.example.repromavcrash D Component Ground station (190) added.
2024-10-11 10:52:57.530 8080-8272 Mavsdk com.example.repromavcrash D Discovered 1 component(s)
2024-10-11 10:52:57.595 8080-8267 Mavsdk com.example.repromavcrash I System discovered
2024-10-11 10:52:57.611 8080-8272 Mavsdk com.example.repromavcrash D Component Unsupported component (68) added.
2024-10-11 10:52:57.614 8080-8272 Mavsdk com.example.repromavcrash D Component Camera 1 (100) added.
2024-10-11 10:52:57.628 8080-8267 Mavsdk com.example.repromavcrash I Server started
2024-10-11 10:52:57.628 8080-8267 Mavsdk com.example.repromavcrash I Server set to listen on 0.0.0.0:34613
2024-10-11 10:52:57.628 8080-8267 MAVSDK-Server com.example.repromavcrash D mavsdk_server is now running, listening on port 34613
2024-10-11 10:52:57.631 8080-8267 MAVLinkService com.example.repromavcrash E Got drone connection io.mavsdk.System@5322373 on port 34613
2024-10-11 10:52:57.641 8080-8267 System.err com.example.repromavcrash W SLF4J(W): No SLF4J providers were found.
2024-10-11 10:52:57.641 8080-8267 System.err com.example.repromavcrash W SLF4J(W): Defaulting to no-operation (NOP) logger implementation
2024-10-11 10:52:57.641 8080-8267 System.err com.example.repromavcrash W SLF4J(W): See https://www.slf4j.org/codes.html#noProviders for further details.
2024-10-11 10:52:57.705 8080-8292 art com.example.repromavcrash W Before Android 4.1, method double java.util.concurrent.ThreadLocalRandom.internalNextDouble(double, double) would have incorrectly overridden the package-private method in java.util.Random
2024-10-11 10:52:57.705 8080-8292 art com.example.repromavcrash W Before Android 4.1, method int java.util.concurrent.ThreadLocalRandom.internalNextInt(int, int) would have incorrectly overridden the package-private method in java.util.Random
2024-10-11 10:52:57.705 8080-8292 art com.example.repromavcrash W Before Android 4.1, method long java.util.concurrent.ThreadLocalRandom.internalNextLong(long, long) would have incorrectly overridden the package-private method in java.util.Random
2024-10-11 10:52:57.785 8080-8296 MAVLinkService com.example.repromavcrash E Connection state changed: false
2024-10-11 10:52:57.787 8080-8296 MAVLinkService com.example.repromavcrash E Connection state changed: true
2024-10-11 10:52:57.795 8080-8296 MAVLinkService com.example.repromavcrash E Connection state changed: false
2024-10-11 10:52:57.796 296-321 BoardControl pid-296 E Unable to create sockfd: Connection refused
2024-10-11 10:52:57.799 8080-8296 MAVLinkService com.example.repromavcrash E Connection state changed: true
2024-10-11 10:52:57.826 8080-8299 libc com.example.repromavcrash A Fatal signal 6 (SIGABRT), code -6 in tid 8299 (DefaultDispatch)
2024-10-11 10:52:57.827 232-232 <no-tag> pid-232 W debuggerd: handling request: pid=8080 uid=10064 gid=10064 tid=8299
2024-10-11 10:52:57.913 8302-8302 DEBUG pid-8302 A *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
2024-10-11 10:52:57.914 8302-8302 DEBUG pid-8302 A Build fingerprint: 'CubePilot/song/song:7.1.2/N2G47J/RU01240301:user/test-keys'
2024-10-11 10:52:57.914 8302-8302 DEBUG pid-8302 A Revision: '0'
2024-10-11 10:52:57.914 8302-8302 DEBUG pid-8302 A ABI: 'arm64'
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A pid: 8080, tid: 8299, name: DefaultDispatch >>> com.example.repromavcrash <<<
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x0 0000000000000000 x1 000000000000206b x2 0000000000000006 x3 0000000000000008
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x4 000000000000006e x5 0000000000000000 x6 0000007d09f70110 x7 0000000000000000
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x8 0000000000000083 x9 ffffffffffffffdf x10 0000000000000000 x11 0000000000000001
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x12 ffffffffffffffff x13 0000000000000020 x14 000000000000000c x15 0000000000000000
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x16 0000007d44dd3ee0 x17 0000007d44d7d364 x18 0000007d423f07e0 x19 0000007d09f704f8
2024-10-11 10:52:57.915 8302-8302 DEBUG pid-8302 A x20 0000000000000006 x21 0000007d09f70450 x22 0000000000000000 x23 0000007d2526e8c0
2024-10-11 10:52:57.916 8302-8302 DEBUG pid-8302 A x24 0000007d2526e8a0 x25 0000007d39b00fa0 x26 0000007d39a252d0 x27 0000000000000000
2024-10-11 10:52:57.916 8302-8302 DEBUG pid-8302 A x28 0000000000000000 x29 0000007d09f6fc60 x30 0000007d44d7a80c
2024-10-11 10:52:57.916 8302-8302 DEBUG pid-8302 A sp 0000007d09f6fc40 pc 0000007d44d7d36c pstate 0000000060000000
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A
backtrace:
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A #00 pc 000000000006b36c /system/lib64/libc.so (tgkill+8)
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A #01 pc 0000000000068808 /system/lib64/libc.so (pthread_kill+64)
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A #02 pc 0000000000023d68 /system/lib64/libc.so (raise+24)
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A #03 pc 000000000001c7ec /system/lib64/libc.so (abort+52)
2024-10-11 10:52:57.930 8302-8302 DEBUG pid-8302 A #04 pc 0000000000003e0c /data/app/com.example.repromavcrash-1/base.apk (offset 0x147d000)
---------------------------- PROCESS ENDED (8080) for package com.example.repromavcrash ----------------------------
Ok, any chance you can debug this yourself by building the .so file in debug mode and copying it in as these instructions: https://github.com/mavlink/MAVSDK-Java?tab=readme-ov-file#debugging-without-pushing-to-maven
That's essentially what I did to debug this. It should hopefully tell us which abort call is happening.
Environment
(Does not occur with 1.3.1)
Repro
Stack trace