mavlink / MAVSDK-Java

MAVSDK client for Java.
76 stars 44 forks source link

MAVSDK 2.1.0 Seg Fault Crashes When Tearing Down Server #180

Open rayw-dronesense opened 2 months ago

rayw-dronesense commented 2 months ago

Environment

(Does not occur with 1.3.1)

Repro

  1. Install example app (attached as ReproMavsdkCrash.zip)
  2. Start the server
  3. Connect the Simulator to the app (on port 14540)
  4. Wait until aircraft attitude is shown in Logcat
  5. Click "stop server"
  6. Crash

Stack trace

2024-09-19 16:36:17.585 14170-14336 MAVLinkService          com.example.repromavcrash            E  Got attitude from aircraft: -1.2388214 -1.231313 93.332565
2024-09-19 16:36:17.595 14170-14336 MAVLinkService          com.example.repromavcrash            E  Got attitude from aircraft: -1.2398055 -1.2350508 93.33456
2024-09-19 16:36:17.601 14170-14336 MAVLinkService          com.example.repromavcrash            E  Got attitude from aircraft: -1.2399287 -1.2368649 93.33481
2024-09-19 16:36:17.631   273-13156 MVE_OMX_IF              media.codec                          E  Buffer abandoned. buffer_id 6f9. (mve_omx_common_component.c:1987)
2024-09-19 16:36:17.700 14170-14208 MAVLinkService          com.example.repromavcrash            E  Server 14551 no longer active.
2024-09-19 16:36:17.700 14170-14215 libc                    com.example.repromavcrash            A  Fatal signal 11 (SIGSEGV), code 1, fault addr 0x16a in tid 14215 (DefaultDispatch)
2024-09-19 16:36:17.701   230-230   <no-tag>                debuggerd64                          W  debuggerd: handling request: pid=14170 uid=10064 gid=10064 tid=14215
2024-09-19 16:36:17.739 14347-14347 DEBUG                   pid-14347                            A  *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
2024-09-19 16:36:17.739 14347-14347 DEBUG                   pid-14347                            A  Build fingerprint: 'CubePilot/song/song:7.1.2/N2G47J/RU01240522:user/test-keys'
2024-09-19 16:36:17.739 14347-14347 DEBUG                   pid-14347                            A  Revision: '0'
2024-09-19 16:36:17.739 14347-14347 DEBUG                   pid-14347                            A  ABI: 'arm64'
2024-09-19 16:36:17.740 14347-14347 DEBUG                   pid-14347                            A  pid: 14170, tid: 14215, name: DefaultDispatch  >>> com.example.repromavcrash <<<
2024-09-19 16:36:17.740 14347-14347 DEBUG                   pid-14347                            A  signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x16a
2024-09-19 16:36:17.740 14347-14347 DEBUG                   pid-14347                            A      x0   000000000000016a  x1   00000074d905f398  x2   0000000000000000  x3   00000074d8f5d4e8
2024-09-19 16:36:17.740 14347-14347 DEBUG                   pid-14347                            A      x4   00000074d8f5d460  x5   00000075146bef4c  x6   00000074d8f5d450  x7   ffffffffffffffff
2024-09-19 16:36:17.740 14347-14347 DEBUG                   pid-14347                            A      x8   00000074d905f3d0  x9   fc83543f6d416706  x10  ffffffffc4653600  x11  0000000000000000
2024-09-19 16:36:17.740 14347-14347 DEBUG                   pid-14347                            A      x12  0000000000000001  x13  00000074d905ef88  x14  0000000000000041  x15  0000000000000000
2024-09-19 16:36:17.740 14347-14347 DEBUG                   pid-14347                            A      x16  00000074db55b220  x17  00000075146bf968  x18  000000750fbf07e0  x19  000000000000016a
2024-09-19 16:36:17.740 14347-14347 DEBUG                   pid-14347                            A      x20  00000074d905f3d0  x21  0000000000000022  x22  0000000000000000  x23  0000000000000000
2024-09-19 16:36:17.740 14347-14347 DEBUG                   pid-14347                            A      x24  00000074db2d5928  x25  00000074d905f3e8  x26  00000074db2d1258  x27  00000074db2d1398
2024-09-19 16:36:17.740 14347-14347 DEBUG                   pid-14347                            A      x28  00000074db2d1450  x29  00000074d905f380  x30  00000074db526ee0
2024-09-19 16:36:17.740 14347-14347 DEBUG                   pid-14347                            A      sp   00000074d905f380  pc   00000075146bf968  pstate 0000000060000000
2024-09-19 16:36:17.743 14347-14347 DEBUG                   pid-14347                            A  
                                                                                                    backtrace:
2024-09-19 16:36:17.743 14347-14347 DEBUG                   pid-14347                            A      #00 pc 0000000000068968  /system/lib64/libc.so (pthread_mutex_lock)
2024-09-19 16:36:17.743 14347-14347 DEBUG                   pid-14347                            A      #01 pc 00000000000c5edc  /data/app/com.example.repromavcrash-1/base.apk (offset 0x8000)
julianoes commented 2 months ago

Hm, any chance you can get a backtrace with more info? It looks like it happens in some lock but I'm not sure which one and when.

rayw-dronesense commented 2 months ago

Sure, here's the Tombstone for it

tombstone_09.zip

rayw-dronesense commented 2 months ago

Slightly different tombstone for it

tombstone_07.zip

rayw-dronesense commented 1 month ago

So I nailed down the commit that introduced this particular crash, went through the different versions on the v2.12 branch on MAVSDK and saw that this: https://github.com/mavlink/MAVSDK/commit/71126ac267a10d1c5a7fcb7a3c963688b2b32277

Reverting this change while on the 2.12.6 tag made this particular crash go away.

But I do see the intent of the change was to improve connecting to systems, so we probably don't want to remove this but instead fix this somehow...

julianoes commented 1 month ago

Great find, sorry for being slow here.

Given you seem to be able to build the server and try out different versions, any chance you can build debug mode and get more output where it crashes?

julianoes commented 1 month ago

Or if you have hints how I can reproduce it with the example apps that would work too.

Problem is I'm not familiar with the Java stuff but I do want to help fix this bug.

rayw-dronesense commented 1 month ago

Hi @julianoes - I attached a custom example app (https://github.com/user-attachments/files/17066260/ReproMavsdkCrash.zip) that can easily repro the issue.

It can be done by,

  1. Start example app
  2. Click "start server"
  3. Connect the Gazeo-Headless-PX4 sim to the app
  4. wait for attitude to start showing up
  5. Hit the "stop server" button
  6. Reproduced crash
julianoes commented 1 month ago

Thanks for that. I can reproduce it, but I'm not sure what to make of it yet.

It looks like it crashes here: https://github.com/mavlink/MAVSDK/blob/27c5c30749076d205f8f5e46622171051b255386/src/mavsdk/core/log.h#L70-L83

But that doesn't look right if you ask me as the thombstone suggests pthread_mutex_lock.

julianoes commented 1 month ago

Ok, so now I realized that you're starting 3 servers, one on each UDP port. Are you actually connecting 3 vehicles? Or is this just to connect three ports? If it is to do 3 ports but one vehicle, you might want to add some mavlink-router or MAVSDK based forwarder in-between.

In any case, the crash happens for the one where nothing is connected! I'll have a closer look into this.

julianoes commented 1 month ago

I've finally managed to copy in my debug build mavsdk_server.so and added some printfs.

It turns out the _should_exit flag here is not set to true, for whatever mystery reason:

https://github.com/mavlink/MAVSDK/blob/v2.12/src/mavsdk_server/src/connection_initiator.h#L35

julianoes commented 1 month ago

After plenty of head scratching I figured out what was going on, here is the fix: https://github.com/mavlink/MAVSDK/pull/2417.

julianoes commented 1 month ago

Fix coming with https://github.com/mavlink/MAVSDK/releases/tag/v2.12.9.

@JonasVautherin do you mind making a Java release once the artifacts are ready? Thanks!

JonasVautherin commented 1 month ago

I guess I should wait for 2.12.10 with https://github.com/mavlink/MAVSDK/pull/2421?

julianoes commented 1 month ago

@JonasVautherin no https://github.com/mavlink/MAVSDK/pull/2421 is only on main, not v2.12.

JonasVautherin commented 1 month ago

That's done in mavsdk_server:2.1.3 :+1:

daniel-sales commented 1 month ago

I see some crashes with "ReproMavsdkCrash" app even after bumping mavsdk-server to 2.1.3. Crashes happen after some consecutive server start/stop calls, or if app is relaunched after a successful drone connection without stopping the server first.

julianoes commented 1 month ago

@daniel-sales ok I will have to try to reproduce that again.

Ok, so now I realized that you're starting 3 servers, one on each UDP port. Are you actually connecting 3 vehicles? Or is this just to connect three ports? If it is to do 3 ports but one vehicle, you might want to add some mavlink-router or MAVSDK based forwarder in-between.

I'm still curious on this @rayw-dronesense and @daniel-sales.

daniel-sales commented 1 month ago

@julianoes On my end I'm starting only one server on port (14551).

Regarding the connection issue, it is reproductible with the @rayw-dronesense's example app if you:

  1. launch the app (I left only 14551 on ports list at MAVLinkService.kt)
  2. start server
  3. wait for drone connection and attitude logs
  4. close the app without hitting "stop server"
  5. relaunch the app

Here's the stack trace:

2024-10-08 09:22:36.406  5635-5718  Mavsdk                  com.example.repromavcrash            I  MAVSDK version: v2.12.9
2024-10-08 09:22:36.414  5635-5718  MAVSDK-Server           com.example.repromavcrash            D  Running mavsdk_server with connection url: udp://:14551
2024-10-08 09:22:36.418  5635-5991  Mavsdk                  com.example.repromavcrash            I  New system on: 127.0.0.1:47005 (with system ID: 1)
2024-10-08 09:22:36.418  5635-5991  Mavsdk                  com.example.repromavcrash            D  New system ID: 1 Comp ID: 1
2024-10-08 09:22:36.425  5635-5991  Mavsdk                  com.example.repromavcrash            D  Component Autopilot (1) added.
2024-10-08 09:22:36.444  5635-5991  Mavsdk                  com.example.repromavcrash            D  New system ID: 42 Comp ID: 190
2024-10-08 09:22:36.449  5635-5991  Mavsdk                  com.example.repromavcrash            D  Component Ground station (190) added.
2024-10-08 09:22:36.449  5635-5991  Mavsdk                  com.example.repromavcrash            D  Discovered 1 component(s)
2024-10-08 09:22:36.510  5635-5991  Mavsdk                  com.example.repromavcrash            D  Component Camera 1 (100) added.
2024-10-08 09:22:36.516  5635-5718  Mavsdk                  com.example.repromavcrash            I  System discovered
2024-10-08 09:22:36.529  5635-5718  Mavsdk                  com.example.repromavcrash            I  Server started
2024-10-08 09:22:36.529  5635-5718  Mavsdk                  com.example.repromavcrash            I  Server set to listen on 0.0.0.0:42723
2024-10-08 09:22:36.529  5635-5718  MAVSDK-Server           com.example.repromavcrash            D  mavsdk_server is now running, listening on port 42723
2024-10-08 09:22:36.530  5635-5718  MAVLinkService          com.example.repromavcrash            E  Got drone connection io.mavsdk.System@e7d61f5 on port 42723
2024-10-08 09:22:36.552  5635-5748  MAVLinkService          com.example.repromavcrash            E  Connection state changed: false
2024-10-08 09:22:36.554  5635-5748  MAVLinkService          com.example.repromavcrash            E  Connection state changed: true
2024-10-08 09:22:36.557  5635-5748  MAVLinkService          com.example.repromavcrash            E  Connection state changed: false
2024-10-08 09:22:36.558  5635-5748  MAVLinkService          com.example.repromavcrash            E  Connection state changed: true
2024-10-08 09:22:36.580  5635-6011  libc                    com.example.repromavcrash            A  Fatal signal 6 (SIGABRT), code -6 in tid 6011 (DefaultDispatch)
julianoes commented 1 month ago

Ok, so this is a SIGABORT and happens when something is connected, not disconnected like before.

julianoes commented 1 month ago

I can't reproduce this just yet. When you say "close the app", how do you do that? I'm using the emulator and hit stop, then I can just relaunch.

My guess is that you run into a bind issue with port already bound by the previous run.

daniel-sales commented 1 month ago

I can't reproduce this just yet. When you say "close the app", how do you do that?

Hitting AndroidStudio "stop" button or using Android device UI to close the app. App crashes on next "start server" attempt.

My guess is that you run into a bind issue with port already bound by the previous run.

That makes sense. The crashes go away when I revert PR#2388, so maybe the new autopilot discovery method is interfering on the automatic port release by system when app is destroyed?

julianoes commented 1 month ago

I honestly don't see how that would be related. And I could not reproduce it that way I'm afraid.

daniel-sales commented 1 month ago

I just made more tests here and indeed, issue is not reproductible with px4-gazebo-headless simulator. However, it is easily reproductible on my real drone setup. So here is the detailed info for more reliable repro steps:

Crash happening on "start server" command after a few start/stop loops, or on next "start server" after closing the app without manual server stop.

Stack trace

2024-10-11 10:52:57.492  8080-8267  Mavsdk                  com.example.repromavcrash            I  MAVSDK version: v2.12.9
2024-10-11 10:52:57.493  8080-8267  MAVSDK-Server           com.example.repromavcrash            D  Running mavsdk_server with connection url: udp://:14551
2024-10-11 10:52:57.502  8080-8272  Mavsdk                  com.example.repromavcrash            I  New system on: 127.0.0.1:43748 (with system ID: 1)
2024-10-11 10:52:57.502  8080-8272  Mavsdk                  com.example.repromavcrash            D  New system ID: 1 Comp ID: 1
2024-10-11 10:52:57.503  8080-8272  Mavsdk                  com.example.repromavcrash            D  Component Autopilot (1) added.
2024-10-11 10:52:57.529  8080-8272  Mavsdk                  com.example.repromavcrash            D  New system ID: 42 Comp ID: 190
2024-10-11 10:52:57.530  8080-8272  Mavsdk                  com.example.repromavcrash            D  Component Ground station (190) added.
2024-10-11 10:52:57.530  8080-8272  Mavsdk                  com.example.repromavcrash            D  Discovered 1 component(s)
2024-10-11 10:52:57.595  8080-8267  Mavsdk                  com.example.repromavcrash            I  System discovered
2024-10-11 10:52:57.611  8080-8272  Mavsdk                  com.example.repromavcrash            D  Component Unsupported component (68) added.
2024-10-11 10:52:57.614  8080-8272  Mavsdk                  com.example.repromavcrash            D  Component Camera 1 (100) added.
2024-10-11 10:52:57.628  8080-8267  Mavsdk                  com.example.repromavcrash            I  Server started
2024-10-11 10:52:57.628  8080-8267  Mavsdk                  com.example.repromavcrash            I  Server set to listen on 0.0.0.0:34613
2024-10-11 10:52:57.628  8080-8267  MAVSDK-Server           com.example.repromavcrash            D  mavsdk_server is now running, listening on port 34613
2024-10-11 10:52:57.631  8080-8267  MAVLinkService          com.example.repromavcrash            E  Got drone connection io.mavsdk.System@5322373 on port 34613
2024-10-11 10:52:57.641  8080-8267  System.err              com.example.repromavcrash            W  SLF4J(W): No SLF4J providers were found.
2024-10-11 10:52:57.641  8080-8267  System.err              com.example.repromavcrash            W  SLF4J(W): Defaulting to no-operation (NOP) logger implementation
2024-10-11 10:52:57.641  8080-8267  System.err              com.example.repromavcrash            W  SLF4J(W): See https://www.slf4j.org/codes.html#noProviders for further details.
2024-10-11 10:52:57.705  8080-8292  art                     com.example.repromavcrash            W  Before Android 4.1, method double java.util.concurrent.ThreadLocalRandom.internalNextDouble(double, double) would have incorrectly overridden the package-private method in java.util.Random
2024-10-11 10:52:57.705  8080-8292  art                     com.example.repromavcrash            W  Before Android 4.1, method int java.util.concurrent.ThreadLocalRandom.internalNextInt(int, int) would have incorrectly overridden the package-private method in java.util.Random
2024-10-11 10:52:57.705  8080-8292  art                     com.example.repromavcrash            W  Before Android 4.1, method long java.util.concurrent.ThreadLocalRandom.internalNextLong(long, long) would have incorrectly overridden the package-private method in java.util.Random
2024-10-11 10:52:57.785  8080-8296  MAVLinkService          com.example.repromavcrash            E  Connection state changed: false
2024-10-11 10:52:57.787  8080-8296  MAVLinkService          com.example.repromavcrash            E  Connection state changed: true
2024-10-11 10:52:57.795  8080-8296  MAVLinkService          com.example.repromavcrash            E  Connection state changed: false
2024-10-11 10:52:57.796   296-321   BoardControl            pid-296                              E  Unable to create sockfd: Connection refused
2024-10-11 10:52:57.799  8080-8296  MAVLinkService          com.example.repromavcrash            E  Connection state changed: true
2024-10-11 10:52:57.826  8080-8299  libc                    com.example.repromavcrash            A  Fatal signal 6 (SIGABRT), code -6 in tid 8299 (DefaultDispatch)
2024-10-11 10:52:57.827   232-232   <no-tag>                pid-232                              W  debuggerd: handling request: pid=8080 uid=10064 gid=10064 tid=8299
2024-10-11 10:52:57.913  8302-8302  DEBUG                   pid-8302                             A  *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
2024-10-11 10:52:57.914  8302-8302  DEBUG                   pid-8302                             A  Build fingerprint: 'CubePilot/song/song:7.1.2/N2G47J/RU01240301:user/test-keys'
2024-10-11 10:52:57.914  8302-8302  DEBUG                   pid-8302                             A  Revision: '0'
2024-10-11 10:52:57.914  8302-8302  DEBUG                   pid-8302                             A  ABI: 'arm64'
2024-10-11 10:52:57.915  8302-8302  DEBUG                   pid-8302                             A  pid: 8080, tid: 8299, name: DefaultDispatch  >>> com.example.repromavcrash <<<
2024-10-11 10:52:57.915  8302-8302  DEBUG                   pid-8302                             A  signal 6 (SIGABRT), code -6 (SI_TKILL), fault addr --------
2024-10-11 10:52:57.915  8302-8302  DEBUG                   pid-8302                             A      x0   0000000000000000  x1   000000000000206b  x2   0000000000000006  x3   0000000000000008
2024-10-11 10:52:57.915  8302-8302  DEBUG                   pid-8302                             A      x4   000000000000006e  x5   0000000000000000  x6   0000007d09f70110  x7   0000000000000000
2024-10-11 10:52:57.915  8302-8302  DEBUG                   pid-8302                             A      x8   0000000000000083  x9   ffffffffffffffdf  x10  0000000000000000  x11  0000000000000001
2024-10-11 10:52:57.915  8302-8302  DEBUG                   pid-8302                             A      x12  ffffffffffffffff  x13  0000000000000020  x14  000000000000000c  x15  0000000000000000
2024-10-11 10:52:57.915  8302-8302  DEBUG                   pid-8302                             A      x16  0000007d44dd3ee0  x17  0000007d44d7d364  x18  0000007d423f07e0  x19  0000007d09f704f8
2024-10-11 10:52:57.915  8302-8302  DEBUG                   pid-8302                             A      x20  0000000000000006  x21  0000007d09f70450  x22  0000000000000000  x23  0000007d2526e8c0
2024-10-11 10:52:57.916  8302-8302  DEBUG                   pid-8302                             A      x24  0000007d2526e8a0  x25  0000007d39b00fa0  x26  0000007d39a252d0  x27  0000000000000000
2024-10-11 10:52:57.916  8302-8302  DEBUG                   pid-8302                             A      x28  0000000000000000  x29  0000007d09f6fc60  x30  0000007d44d7a80c
2024-10-11 10:52:57.916  8302-8302  DEBUG                   pid-8302                             A      sp   0000007d09f6fc40  pc   0000007d44d7d36c  pstate 0000000060000000
2024-10-11 10:52:57.930  8302-8302  DEBUG                   pid-8302                             A  
                                                                                                    backtrace:
2024-10-11 10:52:57.930  8302-8302  DEBUG                   pid-8302                             A      #00 pc 000000000006b36c  /system/lib64/libc.so (tgkill+8)
2024-10-11 10:52:57.930  8302-8302  DEBUG                   pid-8302                             A      #01 pc 0000000000068808  /system/lib64/libc.so (pthread_kill+64)
2024-10-11 10:52:57.930  8302-8302  DEBUG                   pid-8302                             A      #02 pc 0000000000023d68  /system/lib64/libc.so (raise+24)
2024-10-11 10:52:57.930  8302-8302  DEBUG                   pid-8302                             A      #03 pc 000000000001c7ec  /system/lib64/libc.so (abort+52)
2024-10-11 10:52:57.930  8302-8302  DEBUG                   pid-8302                             A      #04 pc 0000000000003e0c  /data/app/com.example.repromavcrash-1/base.apk (offset 0x147d000)
---------------------------- PROCESS ENDED (8080) for package com.example.repromavcrash ----------------------------
julianoes commented 1 month ago

Ok, any chance you can debug this yourself by building the .so file in debug mode and copying it in as these instructions: https://github.com/mavlink/MAVSDK-Java?tab=readme-ov-file#debugging-without-pushing-to-maven

That's essentially what I did to debug this. It should hopefully tell us which abort call is happening.