openzipkin / zipkin

Zipkin is a distributed tracing system
https://zipkin.io/
Apache License 2.0
17.01k stars 3.09k forks source link

Docker image not working on arm64 when Elasticsearch storage is used #3532

Closed rogierslag closed 11 months ago

rogierslag commented 1 year ago

Describe the Bug

When starting the default Docker image on an arm Mac with the Elasticsearch storage enbled the JVM segfaults

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000020820, pid=1, tid=7
#
# JRE version: OpenJDK Runtime Environment (17.0.5+8) (build 17.0.5+8-alpine-r2)
# Java VM: OpenJDK 64-Bit Server VM (17.0.5+8-alpine-r2, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# C  0x0000000000020820
#
# Core dump will be written. Default location: /zipkin/core (max size 1000 kB). To ensure a full core dump, try "ulimit -c unlimited" before starting Java again
#
# If you would like to submit a bug report, please visit:
#   https://gitlab.alpinelinux.org/alpine/aports/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

For size reason I've moved the entire dump to a gist https://gist.github.com/rogierslag/2fc39fb959c09f04bf12f8470284902f

Steps to Reproduce

On an arm64 Mac with Docker installed, run the following command docker run --rm -e "STORAGE_TYPE=elasticsearch" -e "ES_HOSTS=http://es.dev:9200" -e "ES_INDEX=zipkin" openzipkin/zipkin:2.24.1 which crashes directly with the following output


                  oo
                 oooo
                oooooo
               oooooooo
              oooooooooo
             oooooooooooo
           ooooooo  ooooooo
          oooooo     ooooooo
         oooooo       ooooooo
        oooooo   o  o   oooooo
       oooooo   oo  oo   oooooo
     ooooooo  oooo  oooo  ooooooo
    oooooo   ooooo  ooooo  ooooooo
   oooooo   oooooo  oooooo  ooooooo
  oooooooo      oo  oo      oooooooo
  ooooooooooooo oo  oo ooooooooooooo
      oooooooooooo  oooooooooooo
          oooooooo  oooooooo
              oooo  oooo

     ________ ____  _  _____ _   _
    |__  /_ _|  _ \| |/ /_ _| \ | |
      / / | || |_) | ' / | ||  \| |
     / /_ | ||  __/| . \ | || |\  |
    |____|___|_|   |_|\_\___|_| \_|

:: version 2.24.1 :: commit 5e3402a ::

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000020820, pid=1, tid=7
#
# JRE version: OpenJDK Runtime Environment (17.0.5+8) (build 17.0.5+8-alpine-r2)
# Java VM: OpenJDK 64-Bit Server VM (17.0.5+8-alpine-r2, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# C  0x0000000000020820
#
# No core dump will be written. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /zipkin/hs_err_pid1.log
#
# If you would like to submit a bug report, please visit:
#   https://gitlab.alpinelinux.org/alpine/aports/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#

On a Linux machine, the same command boots Zipkin just fine. On the same Mac, it also works fine using the following Docker command docker run --rm openzipkin/zipkin:2.24.1

Expected Behaviour

Similar to Linux, the Mac variant boots normally

DevLomoSE commented 1 year ago

Same with LePotato board

Amlogic S905X SoC

Linux host 6.1.26-05272-g26c406245a2c #1 SMP PREEMPT_DYNAMIC Thu Apr 27 10:15:40 UTC 2023 aarch64 GNU/Linux

[ ... ]
:: version 2.24.1 :: commit 5e3402a ::

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000000000020820, pid=1, tid=7
#
# JRE version: OpenJDK Runtime Environment (17.0.5+8) (build 17.0.5+8-alpine-r2)
# Java VM: OpenJDK 64-Bit Server VM (17.0.5+8-alpine-r2, mixed mode, tiered, compressed oops, compressed class ptrs, serial gc, linux-aarch64)
# Problematic frame:
# C  0x0000000000020820
#
# Core dump will be written. Default location: /zipkin/core
#
# An error report file with more information is saved as:
# /zipkin/hs_err_pid1.log
#
# If you would like to submit a bug report, please visit:
#   https://gitlab.alpinelinux.org/alpine/aports/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
codefromthecrypt commented 11 months ago

I will look into it once finishing some maintenance backlog, which might fix it

codefromthecrypt commented 11 months ago

This is still an issue, it is crashing in the tcnative integration of netty, while the same command doesn't crash on the raw host STORAGE_TYPE=elasticsearch ES_HOSTS=http://es.dev:9200 ES_INDEX=zipkin

$ docker run --entrypoint /bin/sh -it --rm  openzipkin/zipkin:2.25.1
~ $ STORAGE_TYPE=elasticsearch ES_HOSTS=http://es.dev:9200 ES_INDEX=zipkin /usr/local/bin/start-zipkin

                  oo
                 oooo
                oooooo
               oooooooo
              oooooooooo
             oooooooooooo
           ooooooo  ooooooo
          oooooo     ooooooo
         oooooo       ooooooo
        oooooo   o  o   oooooo
       oooooo   oo  oo   oooooo
     ooooooo  oooo  oooo  ooooooo
    oooooo   ooooo  ooooo  ooooooo
   oooooo   oooooo  oooooo  ooooooo
  oooooooo      oo  oo      oooooooo
  ooooooooooooo oo  oo ooooooooooooo
      oooooooooooo  oooooooooooo
          oooooooo  oooooooo
              oooo  oooo

     ________ ____  _  _____ _   _
    |__  /_ _|  _ \| |/ /_ _| \ | |
      / / | || |_) | ' / | ||  \| |
     / /_ | ||  __/| . \ | || |\  |
    |____|___|_|   |_|\_\___|_| \_|

:: version 2.25.1 :: commit 82c3d7a ::

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00000000000226a0, pid=194, tid=195
#
# JRE version: OpenJDK Runtime Environment (21.0.1+12) (build 21.0.1+12-alpine-r0)
# Java VM: OpenJDK 64-Bit Server VM (21.0.1+12-alpine-r0, mixed mode, tiered, compressed oops, compressed class ptrs, g1 gc, linux-aarch64)
# Problematic frame:
# C  [libnetty_tcnative_linux_aarch_6410182490547859919359.so+0x2330c]  init_have_lse_atomics+0xc
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /zipkin/core.194)
#
# An error report file with more information is saved as:
# /zipkin/hs_err_pid194.log
[1.331s][warning][os] Loading hsdis library failed
#
# If you would like to submit a bug report, please visit:
#   https://gitlab.alpinelinux.org/alpine/aports/issues
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.
#
Aborted (core dumped)
~ $ find . -name \*tcnative\*
./BOOT-INF/lib/netty-tcnative-classes-2.0.61.Final.jar
./BOOT-INF/lib/netty-tcnative-boringssl-static-2.0.61.Final-linux-x86_64.jar
./BOOT-INF/lib/netty-tcnative-boringssl-static-2.0.61.Final-osx-aarch_64.jar
./BOOT-INF/lib/netty-tcnative-boringssl-static-2.0.61.Final-windows-x86_64.jar
./BOOT-INF/lib/netty-tcnative-boringssl-static-2.0.61.Final-osx-x86_64.jar
./BOOT-INF/lib/netty-tcnative-boringssl-static-2.0.61.Final-linux-aarch_64.jar

I opened https://gitlab.alpinelinux.org/alpine/aports/-/issues/15582 in alpine just in case there's something they can look into.

Meanwhile, @openzipkin/armeria is there a way to disable tcnative via flag? Also, if anyone has a chance to help with this, it would be appreciated. I fear if we can't resolve this we will need to remove boringssl as M1 Mac is a primary use case.

codefromthecrypt commented 11 months ago

@rogierslag @DevLomoSE meanwhile, you can switch to the zipkin-slim image which excludes boringssl. As long as you aren't using messaging you can use this instead.

anuraaga commented 11 months ago

@codefromthecrypt I believe you can set useOpenSsl flag to false to disable tcnative and it uses JDK for TLS instead

https://github.com/line/armeria/blob/main/core/src/main/java/com/linecorp/armeria/common/Flags.java#L547

codefromthecrypt commented 11 months ago

I noticed that the version of netty-tcnative-boringssl-static brought in by armeria is one patch behind... testing locally for luck

codefromthecrypt commented 11 months ago

upgrading seems to work, I'll raise a PR

codefromthecrypt commented 11 months ago

netty has a PR on this, but it looks like the build is broke https://github.com/netty/netty/pull/13724 cc @trustin in case you can help put some 🔥 under it!

codefromthecrypt commented 11 months ago

master no longer crashes, so the next patch out today will fix it in a formal release

$ docker run --rm -e "STORAGE_TYPE=elasticsearch" -e "ES_HOSTS=http://es.dev:9200" -e "ES_INDEX=zipkin" ghcr.io/openzipkin/zipkin:master
codefromthecrypt commented 11 months ago

https://github.com/openzipkin/zipkin/releases/tag/2.25.2