scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.42k stars 1.27k forks source link

scylla_setup: scylla_kernel_check failed: Please upgrade to a newer kernel version. (iotune coredump) #4392

Closed amoskong closed 5 years ago

amoskong commented 5 years ago

Installation details Scylla version (or git commit hash): 666.development-0.20190403.0dc0a6025-1 Cluster size: 1 OS (RHEL/CentOS/Ubuntu/AWS AMI): debian 9, ubuntu 16.04, ubuntu 18.04

2019-04-03 04:30:39,573 process          L0333 INFO | Running '/usr/lib/scylla/scylla_setup --nic ens3 --disks /dev/vdb --no-cpuscaling-setup'
2019-04-03 04:30:42,608 process          L0420 DEBUG| [stdout] Please upgrade to a newer kernel version.
2019-04-03 04:30:42,608 process          L0420 DEBUG| [stdout]  see http://www.scylladb.com/kb/kb-fs-not-qualified-aio/ for details
2019-04-03 04:30:42,682 process          L0420 DEBUG| [stderr] Traceback (most recent call last):
2019-04-03 04:30:42,682 process          L0420 DEBUG| [stderr]   File "/usr/lib/scylla/scylla_setup", line 236, in <module>
2019-04-03 04:30:42,683 process          L0420 DEBUG| [stderr]     run('/usr/lib/scylla/scylla_kernel_check')
2019-04-03 04:30:42,683 process          L0420 DEBUG| [stderr]   File "/usr/lib/scylla/scylla_util.py", line 276, in run
2019-04-03 04:30:42,683 process          L0420 DEBUG| [stderr]     return subprocess.check_call(cmd, shell=shell, stdout=stdout, stderr=stderr, env=scylla_env)
2019-04-03 04:30:42,683 process          L0420 DEBUG| [stderr]   File "/usr/lib/python3.5/subprocess.py", line 581, in check_call
2019-04-03 04:30:42,683 process          L0420 DEBUG| [stderr]     raise CalledProcessError(retcode, cmd)
2019-04-03 04:30:42,683 process          L0420 DEBUG| [stderr] subprocess.CalledProcessError: Command '['/usr/lib/scylla/scylla_kernel_check']' returned non-zero exit status 252
2019-04-03 04:30:42,712 stacktrace       L0036 ERROR| 

@syuu1228 @roydahan

amoskong commented 5 years ago

The kernel can only be updated to 4.4.0-145-generic by apt-get upgrade.

I tried to upgrade kernel by:

check latest kernel version by:
# apt-get search linux-image   
# apt-get install linux-image-4.15.0-47-generic
# uname -a
Linux ubuntu1604 4.15.0-47-generic #50~16.04.1-Ubuntu SMP Fri Mar 15 16:06:21 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

The kernel check problem still exists when I upgrade Ubuntu 16.04 kernel to 4.15.0-47

amoskong commented 5 years ago

I found the problem:

root@ubuntu1604:/tmp/new# mkdir /var/tmp/mnt
root@ubuntu1604:/tmp/new# dd if=/dev/zero of=/var/tmp/kernel-check.img bs=1M count=128
root@ubuntu1604:/tmp/new# mkfs.xfs /var/tmp/kernel-check.img
root@ubuntu1604:/tmp/new# sudo mount /var/tmp/kernel-check.img /var/tmp/mnt -o loop
root@ubuntu1604:/tmp/new# sudo iotune --fs-check --evaluation-directory /var/tmp/mnt
Illegal instruction (core dumped)
root@ubuntu1604:/tmp/new# ls /var/crash/
_opt_scylladb_libreloc_ld.so.0.crash
root@ubuntu1604:/tmp/new# apport-unpack /var/crash/_opt_scylladb_libreloc_ld.so.0.crash /tmp/new
root@ubuntu1604:/tmp/new# ls -l /tmp/new
total 2176
-rw-r--r-- 1 root root       5 Apr  3 15:37 Architecture
-rw-r--r-- 1 root root 2154496 Apr  3 15:37 CoreDump
-rw-r--r-- 1 root root      24 Apr  3 15:37 Date
-rw-r--r-- 1 root root      12 Apr  3 15:37 DistroRelease
-rw-r--r-- 1 root root      28 Apr  3 15:37 ExecutablePath
-rw-r--r-- 1 root root      10 Apr  3 15:37 ExecutableTimestamp
-rw-r--r-- 1 root root       1 Apr  3 15:37 _LogindSession
-rw-r--r-- 1 root root       5 Apr  3 15:37 ProblemType
-rw-r--r-- 1 root root     102 Apr  3 15:37 ProcCmdline
-rw-r--r-- 1 root root       5 Apr  3 15:37 ProcCwd
-rw-r--r-- 1 root root     264 Apr  3 15:37 ProcEnviron
-rw-r--r-- 1 root root   14117 Apr  3 15:37 ProcMaps
-rw-r--r-- 1 root root    1288 Apr  3 15:37 ProcStatus
-rw-r--r-- 1 root root       1 Apr  3 15:37 Signal
-rw-r--r-- 1 root root      30 Apr  3 15:37 Uname
-rw-r--r-- 1 root root       4 Apr  3 15:37 UserGroups

I failed to get backtrace from the coredump, will update later.

amoskong commented 5 years ago

scylla --version will also cause a coredump:

root@ubuntu1604:/tmp/new# rm /var/crash/*
root@ubuntu1604:/tmp/new# scylla --version
Illegal instruction (core dumped)
root@ubuntu1604:/tmp/new# ls /var/crash
_opt_scylladb_libreloc_ld.so.0.crash
amoskong commented 5 years ago

iotune coredump:

I had installed scylla-server-dbg package. Which debug packages should be installed for decoding the backtrace?

root@ubuntu1604:/tmp/new# gdb /opt/scylladb/libexec/iotune.bin CoreDump
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /opt/scylladb/libexec/iotune.bin...BFD: /usr/lib/debug/.build-id/0f/0c145706eb5335fb5e41dc0fab393550b0a222.debug: unable to initialize decompress status for section .debug_aranges
BFD: /usr/lib/debug/.build-id/0f/0c145706eb5335fb5e41dc0fab393550b0a222.debug: unable to initialize decompress status for section .debug_aranges

warning: File "/usr/lib/debug/.build-id/0f/0c145706eb5335fb5e41dc0fab393550b0a222.debug" has no build-id, file skipped
(no debugging symbols found)...done.

warning: core file may not match specified executable file.
[New LWP 5454]
Error while mapping shared library sections:
`/usr/bin/iotune': not in executable format: File format not recognized

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Core was generated by `/usr/bin/iotune /opt/scylladb/bin/../libexec/iotune.bin --fs-check --evaluation'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x000000000050a5de in ?? ()
(gdb) bt
#0  0x000000000050a5de in ?? ()
#1  0x000000000050f12e in malloc ()
#2  0x00007fd1db8f2a8a in ?? () from /opt/scylladb/bin/../libreloc/libstdc++.so.6
#3  0x00007fd1dc0c2e0a in ?? ()
#4  0x000000000000000e in ?? ()
#5  0x0000000000000004 in ?? ()
#6  0x00007ffc07c19bb0 in ?? ()
#7  0x00007ffc07c19bd8 in ?? ()
#8  0x00007fd1dc0df180 in ?? ()
#9  0x00007fd1dc0c2f0a in ?? ()
#10 0x0000000000000000 in ?? ()
amoskong commented 5 years ago

scylla --version coredump:

scylla-test@amos-ubuntu16:~/ubuntu16-crash$ gdb /opt/scylladb/libexec/scylla.bin CoreDump
GNU gdb (Ubuntu 8.2-0ubuntu1~16.04.1) 8.2
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /opt/scylladb/libexec/scylla.bin...Reading symbols from /usr/lib/debug/.build-id/68/f5c63a2e409bb94fd894e5fd9845baab65a243.debug...done.
done.

warning: core file may not match specified executable file.
warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Loadable section ".note.gnu.property" outside of ELF segments

warning: Unable to find libthread_db matching inferior's thread library, thread debugging will not be available.
Core was generated by `/usr/bin/scylla /opt/scylladb/bin/../libexec/scylla.bin --version'.
Program terminated with signal SIGILL, Illegal instruction.
#0  0x0000000004080013 in seastar::memory::allocate (size=37) at ../../src/core/memory.cc:1254
1254    ../../src/core/memory.cc: No such file or directory.
(gdb) bt
#0  0x0000000004080013 in seastar::memory::allocate (size=37) at ../../src/core/memory.cc:1254
#1  malloc (n=37) at ../../src/core/memory.cc:1504
#2  0x00007f509b2e2160 in set_binding_values.part () from /opt/scylladb/bin/../libreloc/libc.so.6
#3  0x00007f509b2e2405 in bindtextdomain () from /opt/scylladb/bin/../libreloc/libc.so.6
#4  0x00007f509acdc22b in ?? () from /opt/scylladb/bin/../libreloc/libgpg-error.so.0
#5  0x00007f509c9b8e0a in ?? ()
#6  0x000000000000002e in ?? ()
#7  0x0000000000000002 in ?? ()
#8  0x00007ffce4be2bb0 in ?? ()
#9  0x00007ffce4be2bc8 in ?? ()
#10 0x00007f509c9d5180 in ?? ()
#11 0x00007f509c9b8f0a in ?? ()
#12 0x0000000000000000 in ?? ()
(gdb)
amoskong commented 5 years ago

/CC @glommer @avikivity

amoskong commented 5 years ago

ubuntu16-crash-scylla.tar.gz ubuntu16-crash-iotune.tar.gz

$ ls ubuntu16-crash-iotune
Architecture        DistroRelease       ProblemType     ProcEnviron     Signal          _LogindSession
CoreDump        ExecutablePath      ProcCmdline     ProcMaps        Uname
Date            ExecutableTimestamp ProcCwd         ProcStatus      UserGroups

$ ls ubuntu16-crash-scylla
Architecture        DistroRelease       ProblemType     ProcEnviron     Signal          _LogindSession
CoreDump        ExecutablePath      ProcCmdline     ProcMaps        Uname
Date            ExecutableTimestamp ProcCwd         ProcStatus      UserGroups
tzach commented 5 years ago

@amoskong is this issue happened on 3.0 as well?

amoskong commented 5 years ago

This issue doesn't exist in latest 3.0.

avikivity commented 5 years ago

These kernels should be good enough for fsqual, so something else happened.

avikivity commented 5 years ago

I was unable to reproduce (building seastar's iotune in dbuild and running it on an xfs filesystem). Can you provide access to a scylla.deb that fails?

amoskong commented 5 years ago

The VM had been recovered for new testing. I will reproduce with a gce instance, and provide you the ip soon.

On Thu, Apr 4, 2019 at 8:09 PM Avi Kivity notifications@github.com wrote:

I was unable to reproduce (building seastar's iotune in dbuild and running it on an xfs filesystem). Can you provide access to a scylla.deb that fails?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/4392#issuecomment-479871147, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS5zIo8WjFn3hpY-TAyqumUlbqtHQDmks5vdeuDgaJpZM4cZcJ9 .

avikivity commented 5 years ago

It's enough for me to get the .deb you used.

On 04/04/2019 15.15, Amos Kong wrote:

The VM had been recovered for new testing. I will reproduce with a gce instance, and provide you the ip soon.

On Thu, Apr 4, 2019 at 8:09 PM Avi Kivity notifications@github.com wrote:

I was unable to reproduce (building seastar's iotune in dbuild and running it on an xfs filesystem). Can you provide access to a scylla.deb that fails?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/4392#issuecomment-479871147, or mute the thread

https://github.com/notifications/unsubscribe-auth/AAS5zIo8WjFn3hpY-TAyqumUlbqtHQDmks5vdeuDgaJpZM4cZcJ9 .

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/4392#issuecomment-479872794, or mute the thread https://github.com/notifications/unsubscribe-auth/AA-Femf_nkroaeMidVcIg3VdwRQKhkCNks5vdezggaJpZM4cZcJ9.

amoskong commented 5 years ago

I installed scylla by https://s3.amazonaws.com/downloads.scylladb.com/deb/unstable/stable/master/127/scylladb/scylla.list

amoskong commented 5 years ago

I failed to reproduce this issue (coredump) with a new gce instance. But I can still (100%) reproduce scylla --version coredump & iotune coredump with Ubuntu 16.04 VM which is used by artifact-test.

I will provide you the vm (ip/password) by slack.

syuu1228 commented 5 years ago

Not able to reproduce on my local Ubuntu 18.04 baremetal environment and Ubuntu 16.04 Docker instance. It likely only occurs on the VM which artifact test uses, not all environments.

root@3af6f67f5074:/etc/apt/sources.list.d# scylla_kernel_check
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following additional packages will be installed:
  libreadline5
Suggested packages:
  xfsdump acl attr quota
The following NEW packages will be installed:
  libreadline5 xfsprogs
0 upgraded, 2 newly installed, 0 to remove and 6 not upgraded.
Need to get 696 kB of archives.
After this operation, 3744 kB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu xenial/main amd64 libreadline5 amd64 5.2+dfsg-3build1 [99.5 kB]
Get:2 http://archive.ubuntu.com/ubuntu xenial-updates/main amd64 xfsprogs amd64 4.3.0+nmu1ubuntu1.1 [597 kB]
Fetched 696 kB in 1s (364 kB/s)
debconf: delaying package configuration, since apt-utils is not installed
Selecting previously unselected package libreadline5:amd64.
(Reading database ... 13356 files and directories currently installed.)
Preparing to unpack .../libreadline5_5.2+dfsg-3build1_amd64.deb ...
Unpacking libreadline5:amd64 (5.2+dfsg-3build1) ...
Selecting previously unselected package xfsprogs.
Preparing to unpack .../xfsprogs_4.3.0+nmu1ubuntu1.1_amd64.deb ...
Unpacking xfsprogs (4.3.0+nmu1ubuntu1.1) ...
Processing triggers for libc-bin (2.23-0ubuntu11) ...
Setting up libreadline5:amd64 (5.2+dfsg-3build1) ...
Setting up xfsprogs (4.3.0+nmu1ubuntu1.1) ...
Processing triggers for libc-bin (2.23-0ubuntu11) ...
WARN  2019-04-05 08:27:42,856 [shard 0] iotune - Available space on filesystem at /var/tmp/mnt: 124 MB: is less than recommended: 10 GB
INFO  2019-04-05 08:27:42,856 [shard 0] iotune - /var/tmp/mnt passed sanity checks
This is a supported kernel version.
root@3af6f67f5074:/etc/apt/sources.list.d#
syuu1228 commented 5 years ago

Not able to reproduce Ubuntu 16.04 VM on Virtualbox (via vagrant).

$ sudo scylla_kernel_check
WARN  2019-04-05 08:44:32,603 [shard 0] iotune - Available space on filesystem at /var/tmp/mnt: 124 MB: is less than recommended: 10 GB
INFO  2019-04-05 08:44:32,604 [shard 0] iotune - /var/tmp/mnt passed sanity checks
This is a supported kernel version.
syuu1228 commented 5 years ago

@amoskong is there a way to launch artifacts VM locally, to reproduce problem? I can see the error on jenkins log, but I don't find a way to reproduce it.

amoskong commented 5 years ago

Avi already reproduced the problem on artifact-test vm (ubuntu16), I don't know if he found the root problem.

I can prepare a single vm ubuntu16 for you on Monday.

Takuya ASADA notifications@github.com 于 2019年4月5日周五 下午6:52写道:

@amoskong https://github.com/amoskong is there a way to launch artifacts VM locally, to reproduce problem? I can see the error on jenkins log, but I don't find a way to reproduce it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/scylladb/scylla/issues/4392#issuecomment-480233197, or mute the thread https://github.com/notifications/unsubscribe-auth/AAS5zIC532KrE4ov9OCu4XjZ6z4M9XAeks5vdyr9gaJpZM4cZcJ9 .

avikivity commented 5 years ago

The problem is that scylla gets compiled with new instructions (I saw a BMI2 instruction, but others are also present), so running on an older machine fails. The idea was to build with -march=westmere, but this was probably lost in the Great Cmake Translation.

@hakuch any ideas? ./configure.py --enable-dpdk was supposed to limit dpdk to westmere, but likely this got lost, and dpdk added flags to build for the host machine. Once we ran a test on an older machine, the problem showed up.

avikivity commented 5 years ago

Here's a stanza from a seastar build.ninja:

build CMakeFiles/seastar.dir/src/core/prometheus.cc.o: CXX_COMPILER__seastar ../../src/core/prometheus.cc || cmake_object_order_depends_target_seastar
  DEFINES = -DFMT_SHARED -DSEASTAR_HAS_MEMBARRIER -DSEASTAR_HAVE_ASAN_FIBER_SUPPORT -DSEASTAR_HAVE_DPDK -DSEASTAR_HAVE_GCC6_CONCEPTS -DSEASTAR_HAVE_HWLOC -DSEASTAR_HAVE_LZ4_COMPRESS_DEFAULT -DSEASTAR_HAVE_NUMA -DSEASTAR_TYPE_ERASE_MORE -DSEASTAR_USE_STD_OPTIONAL_VARIANT_STRINGVIEW
  DEP_FILE = CMakeFiles/seastar.dir/src/core/prometheus.cc.o.d
  FLAGS = -O1   -std=gnu++17 -U_FORTIFY_SOURCE -fvisibility=hidden -UNDEBUG -Wall -Werror -Wno-error=deprecated-declarations -gz -Wno-error -march=westmere -fconcepts -march=native
  INCLUDES = -I../../include -Igen/include -I../../src -Igen/src -isystem _cooking/installed/include/dpdk
  OBJECT_DIR = CMakeFiles/seastar.dir
  OBJECT_FILE_DIR = CMakeFiles/seastar.dir/src/core

-march=native overrode -march=westmere, and made the executables fail if you are running on a machine older than your build machine.

hakuch commented 5 years ago

@avikivity, I see the issue. Finddpdk.cmake transitively applies -march=native. I'll change it to -march=westmere.

hakuch commented 5 years ago

I opened https://github.com/scylladb/seastar/issues/630 and sent a patch to the Seastar mailing list.

avikivity commented 5 years ago

Fixed by 704600f829ab34b3a290acc3c66798d0abc1fabe.