Open kolyshkin opened 7 years ago
I tested this couple of times and did not see this crash. Is this crash reproducible for you? If so, could you tell specific instructions on how to reproduce it? Were you able to collect a core dump from the crash?
What I did was running docker-stress test a few times (with no failures), like this:
./docker-stress --containers 50 -c 200 -k 3s
./docker-stress --containers 500 -c 200 -k 3s
./docker-stress -c 20 -k 3s
and then I tried building docker daemon, probably something like this:
go get github.com/moby/moby
cd $GOPATH/src/github.com/moby/moby
make test-integration-cli
The coredump is eaten by apport, there is nothing in /var/crash :(
I will try to reproduce the issue next week.
It was pretty easy to reproduce. Here are the steps on a digital ocean droplet running Ubuntu 16.04:
On the first terminal:
dd if=/dev/zero of=/dev/sda bs=1K count=10
cd git/lcsfs/lcfs
./lcfs daemon /dev/sda /var/lib/docker /lcfs -f -m -r -t -p -v
On a second terminal:
dockerd -s vfs --experimental
On a third terminal:
cd git/lcfs/plugin
docker build -t rootfsimage .
mkdir rootfs
id=$(docker create rootfsimage)
docker export "$id" | tar -x -C rootfs/
docker plugin create portworx/lcfs .
docker plugin enable portworx/lcfs
rm -rf rootfs
docker rm -vf "$id"
docker rmi rootfsimage
docker plugin ls
second terminal again:
^C
dockerd -s portworx/lcfs --experimental
third terminal again:
cd go/src/github.com/docker/docker/
make test-integration-cli
and it fails at the same step as last time:
Step 43/43 : COPY . /go/src/github.com/docker/docker
---> bbf45185d1b9
Successfully built bbf45185d1b9
open /var/lib/docker/image/portworx/lcfs/.tmp-repositories.json827429382: no such file or directory
Here's gdb trace:
root@ubuntu-2gb-nyc1-01:~/git/lcfs/lcfs# gdb lcfs /tmp/cores/core.lcfs.18821.ubuntu-2gb-nyc1-01.1503749695
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from lcfs...done.
[New LWP 14805]
[New LWP 18821]
[New LWP 18822]
[New LWP 18827]
[New LWP 18829]
[New LWP 18830]
[New LWP 14802]
[New LWP 14804]
[New LWP 14803]
[New LWP 14792]
[New LWP 14793]
[New LWP 14797]
[New LWP 14808]
[New LWP 19047]
[New LWP 14807]
[New LWP 18825]
[New LWP 19046]
[New LWP 18826]
[New LWP 27151]
[New LWP 18998]
[New LWP 7853]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./lcfs daemon /dev/sda /var/lib/docker /lcfs -f -m -r -t -p -v'.
Program terminated with signal SIGABRT, Aborted.
#0 0x00007f0c0005a428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f0bf834a700 (LWP 14805))]
(gdb) bt full
#0 0x00007f0c0005a428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
resultvar = 0
pid = 18821
selftid = 14805
#1 0x00007f0c0005c02a in __GI_abort () at abort.c:89
save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x4, sa_sigaction = 0x4}, sa_mask = {__val = {0, 4051324750011852341, 139689385563952, 0, 0, 0, 0, 21474836480, 139689518111616, 139689385564104,
139689385563904, 139689518125584, 139689518139232, 0, 139689518125584, 4386328}}, sa_flags = 22167552, sa_restorer = 0x42ee18}
sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x00007f0c00052bd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x42ee18 "zfs->fs_super->sb_flags & LC_SUPER_INIT", file=file@entry=0x42eaf5 "fs.c",
line=line@entry=308, function=function@entry=0x42f2a0 <__PRETTY_FUNCTION__.7725> "lc_removeLayers") at assert.c:92
str = 0xf695f70 ""
total = 4096
#3 0x00007f0c00052c82 in __GI___assert_fail (assertion=0x42ee18 "zfs->fs_super->sb_flags & LC_SUPER_INIT", file=0x42eaf5 "fs.c", line=308,
function=0x42f2a0 <__PRETTY_FUNCTION__.7725> "lc_removeLayers") at assert.c:101
No locals.
#4 0x000000000041097e in lc_removeLayers (gfs=0x24e8000, fs=0x25f8d80, gindex=98) at fs.c:308
zfs = 0x6c3db00
__PRETTY_FUNCTION__ = "lc_removeLayers"
#5 0x0000000000410cce in lc_getLayerForRemoval (gfs=0x24e8000, root=27584547717764810, fsp=0x7f0bf8349a08) at fs.c:368
ino = 120522
gindex = 98
fs = 0x25f8d80
__PRETTY_FUNCTION__ = "lc_getLayerForRemoval"
__func__ = "lc_getLayerForRemoval"
#6 0x0000000000426c28 in lc_removeRoot (rfs=0x25f8000, dir=0x2aeea80, ino=120522, rmdir=true, fsp=0x7f0bf8349a08) at layer.c:222
root = 27584547717764810
#7 0x0000000000419ed9 in lc_dirRemoveName (fs=0x25f8000, dir=0x2aeea80, name=0x7f0bf8349a70 "e557391afb5dd28f1545d6797488714f17c2536576b3a1d65f42e298515a1177", rmdir=true, fsp=0x7f0bf8349a08,
layer=true) at dir.c:579
ino = 120522
parent = 1023
dirent = 0x1bc74450
prev = 0x4e61860
gfs = 0x24e8000
len = 64
err = -130770480
rfs = 0x24e8000
__PRETTY_FUNCTION__ = "lc_dirRemoveName"
__func__ = "lc_dirRemoveName"
#8 0x0000000000426e13 in lc_deleteLayer (req=0x267b780, gfs=0x24e8000, name=0x7f0bf8349a70 "e557391afb5dd28f1545d6797488714f17c2536576b3a1d65f42e298515a1177") at layer.c:264
fs = 0x0
rfs = 0x25f8000
bfs = 0x0
zfs = 0x0
extents = 0x0
pdir = 0x2aeea80
start = {tv_sec = 1503749695, tv_usec = 866516}
err = 0
root = 16877
---Type <return> to continue, or q <return> to quit---
__func__ = "lc_deleteLayer"
#9 0x000000000040a0b7 in lc_ioctl (req=0x267b780, ino=1023, cmd=1077936231, arg=0xc82012bc00, fi=0x7f0bf8349b80, flags=16, in_buf=0x1f4e6048, in_bufsz=64, out_bufsz=0) at fops.c:1530
name = "e557391afb5dd28f1545d6797488714f17c2536576b3a1d65f42e298515a1177"
layer = 0x1 <error: Cannot access memory at address 0x1>
parent = 0x0
gfs = 0x24e8000
value = 34621422135551202
len = 140514
op = 103
__PRETTY_FUNCTION__ = "lc_ioctl"
__func__ = "lc_ioctl"
#10 0x00007f0c00835f32 in do_ioctl (req=<optimized out>, nodeid=<optimized out>, inarg=<optimized out>) at fuse_lowlevel.c:1758
flags = <optimized out>
in_buf = <optimized out>
fi = {flags = 0, writepage = 0, direct_io = 0, keep_cache = 0, flush = 0, nonseekable = 0, flock_release = 0, padding = 0, fh = 45017728, lock_owner = 0, poll_events = 0}
#11 0x00007f0c00838cd1 in fuse_session_process_buf_int (se=0x24f4380, buf=0xf0b8840, ch=<optimized out>) at fuse_lowlevel.c:2546
bufv = {count = 1, idx = 0, off = 0, buf = {{size = 136, flags = (unknown: 0), mem = 0x1f4e6000, fd = 26, pos = 0}}}
tmpbuf = {count = 1, idx = 0, off = 0, buf = {{size = 80, flags = (unknown: 0), mem = 0x0, fd = -1, pos = 0}}}
in = 0x1f4e6000
inarg = 0x1f4e6028
req = 0x267b780
mbuf = 0x0
err = <optimized out>
res = <optimized out>
#12 0x00007f0c00834371 in fuse_do_work (data=0xf0b8820) at fuse_loop_mt.c:159
isforget = 0
res = 136
w = 0xf0b8820
mt = 0x7ffed2a9bde0
#13 0x00007f0c003f66ba in start_thread (arg=0x7f0bf834a700) at pthread_create.c:333
__res = <optimized out>
pd = 0x7f0bf834a700
now = <optimized out>
unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139689385568000, 1383958741476702548, 0, 139689410742991, 8388608, 40172064, -1359173059717367468, -1356481400723080876}, mask_was_saved = 0}},
priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
not_first_call = <optimized out>
pagesize_m1 = <optimized out>
sp = <optimized out>
freesize = <optimized out>
__PRETTY_FUNCTION__ = "start_thread"
#14 0x00007f0c0012c3dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
(gdb)
Note I'm using bleeding edge docker build:
# grep docker /etc/apt/sources.list
deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable edge test
# deb-src [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable edge test
root@ubuntu-2gb-nyc1-01:~/go/src/github.com/docker/docker# dpkg -l docker-ce
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name Version Architecture Description
+++-===========================================-==========================-==========================-============================================================================================
ii docker-ce 17.07.0~ce~rc4-0~ubuntu amd64 Docker: the open-source application container engine
root@ubuntu-2gb-nyc1-01:~/go/src/github.com/docker/docker# docker --version
Docker version 17.07.0-ce-rc4, build fe143e3
Let me know if there's anything I can help with
I could not reproduce the crash on my system with the version of docker I have installed on it. Let me try with the latest version of docker. The assertion in LCFS is fired because a read-write layer does not have a parent init layer, as expected. Normally, Docker creates an init layer before creating a read-write root image layer for a container.
By the way, just curious why you are not using our installation scripts for setting up lcfs. That is a single line command which will set up docker with lcfs and you don't have to run individual commands manually to get going.
Seems like I have a similar problem. lcfs seems to work for about 200 layers and then constantly crashes with a segfault. SMP machine with 4 sockets and 64 threads .. waybe a threading issue ?
I ran latest LCFS (git commit 99b157db0ab731d) and got into this while running "make" on docker daemon sources
This is what I saw on the client:
and here are dockerd logs:
Please let me know if there's anything else I can provide.