portworx / lcfs

LCFS Graph driver for Docker
Apache License 2.0
300 stars 27 forks source link

lcfs crash #47

Open kolyshkin opened 7 years ago

kolyshkin commented 7 years ago

I ran latest LCFS (git commit 99b157db0ab731d) and got into this while running "make" on docker daemon sources

# ./lcfs daemon /dev/sda /var/lib/docker /lcfs -f -m -r -t -p -v
.....
lcfs: fs.c:308: lc_removeLayers: Assertion `zfs->fs_super->sb_flags & LC_SUPER_INIT' failed.
Aborted (core dumped)

This is what I saw on the client:

Step 42/43 : ENTRYPOINT hack/dind
 ---> Running in 137aeada4f4d
 ---> 98b9d7df394f
Removing intermediate container 137aeada4f4d
Step 43/43 : COPY . /go/src/github.com/docker/docker
 ---> ba3aea4d5aa7
Successfully built ba3aea4d5aa7
open /var/lib/docker/image/portworx/lcfs/.tmp-repositories.json026490387: no such file or directory
Makefile:113: recipe for target 'build' failed
make: *** [build] Error 1

and here are dockerd logs:

INFO[9097] Layer sha256:d173fd590788d5c19e591594e8d6f28524d41626bbfb4d717228dc50db6b5674 cleaned up 
INFO[9104] time="2017-08-25T14:58:09Z" level=error msg="err software caused connection abort\n"   plugin=de91e35b7c8c14d0bb3190c519dd10eb0d5a5b88b4e78f91f635d4367c52f5b0
ERRO[9104] Error removing mounted layer 3ff6814c3cb6dc6e75732aa9738d29fd62bd7cc182b07f9050c5905a6a4d73a5: GraphDriver.Remove: software caused connection abort 
ERRO[9104] Failed to release RWLayer: GraphDriver.Remove: software caused connection abort 
ERRO[9104] failed to unmount previous build image sha256:2baf993621acd4dc745c97ff3127609dbf68ec73c9fedb8278fa029763d177f1: GraphDriver.Remove: software caused connection abort 
INFO[9104] time="2017-08-25T14:58:09Z" level=error msg="err transport endpoint is not connected\n"   plugin=de91e35b7c8c14d0bb3190c519dd10eb0d5a5b88b4e78f91f635d4367c52f5b0
ERRO[9104] Failed to unmount RWLayer: GraphDriver.Put: transport endpoint is not connected 
ERRO[9104] failed to unmount previous build image sha256:e1573ce8644fc2781917e5edf824b154f1492476f202b0f35340dc7119ad743f: GraphDriver.Put: transport endpoint is not connected 
INFO[9104] time="2017-08-25T14:58:09Z" level=error msg="err transport endpoint is not connected\n"   plugin=de91e35b7c8c14d0bb3190c519dd10eb0d5a5b88b4e78f91f635d4367c52f5b0
ERRO[9104] Failed to unmount RWLayer: GraphDriver.Put: transport endpoint is not connected 
ERRO[9104] failed to unmount previous build image : GraphDriver.Put: transport endpoint is not connected 
INFO[9104] time="2017-08-25T14:58:09Z" level=error msg="err transport endpoint is not connected\n"   plugin=de91e35b7c8c14d0bb3190c519dd10eb0d5a5b88b4e78f91f635d4367c52f5b0
ERRO[9104] Failed to unmount RWLayer: GraphDriver.Put: transport endpoint is not connected 
ERRO[9104] failed to unmount previous build image sha256:bdf6818dde703755f0a724e9a72579529fa66cdde79b1a5a9037184894dd6108: GraphDriver.Put: transport endpoint is not connected 
INFO[9104] time="2017-08-25T14:58:09Z" level=error msg="err transport endpoint is not connected\n"   plugin=de91e35b7c8c14d0bb3190c519dd10eb0d5a5b88b4e78f91f635d4367c52f5b0
ERRO[9104] Failed to unmount RWLayer: GraphDriver.Put: transport endpoint is not connected 
ERRO[9104] failed to unmount previous build image sha256:98b9d7df394f644af4a772e74a710f196aacd92806756881f797b2a98b52cb2e: GraphDriver.Put: transport endpoint is not connected 

Please let me know if there's anything else I can provide.

jjobi commented 7 years ago

I tested this couple of times and did not see this crash. Is this crash reproducible for you? If so, could you tell specific instructions on how to reproduce it? Were you able to collect a core dump from the crash?

kolyshkin commented 7 years ago

What I did was running docker-stress test a few times (with no failures), like this:

./docker-stress --containers 50 -c 200 -k 3s
./docker-stress --containers 500 -c 200 -k 3s
./docker-stress -c 20 -k 3s

and then I tried building docker daemon, probably something like this:

go get github.com/moby/moby
cd $GOPATH/src/github.com/moby/moby
make test-integration-cli

The coredump is eaten by apport, there is nothing in /var/crash :(

I will try to reproduce the issue next week.

kolyshkin commented 7 years ago

It was pretty easy to reproduce. Here are the steps on a digital ocean droplet running Ubuntu 16.04:

On the first terminal:

dd if=/dev/zero of=/dev/sda bs=1K count=10
cd git/lcsfs/lcfs
./lcfs daemon /dev/sda /var/lib/docker /lcfs -f -m -r -t -p -v

On a second terminal:

dockerd -s vfs --experimental

On a third terminal:

cd git/lcfs/plugin
docker build -t rootfsimage .
mkdir rootfs
id=$(docker create rootfsimage)
docker export "$id" | tar -x -C rootfs/
docker plugin create portworx/lcfs .
docker plugin enable portworx/lcfs
rm -rf rootfs
docker rm -vf "$id"
docker rmi rootfsimage
docker plugin ls

second terminal again:

^C
dockerd -s portworx/lcfs --experimental

third terminal again:

cd go/src/github.com/docker/docker/
make test-integration-cli

and it fails at the same step as last time:

Step 43/43 : COPY . /go/src/github.com/docker/docker
 ---> bbf45185d1b9
Successfully built bbf45185d1b9
open /var/lib/docker/image/portworx/lcfs/.tmp-repositories.json827429382: no such file or directory

Here's gdb trace:

root@ubuntu-2gb-nyc1-01:~/git/lcfs/lcfs# gdb lcfs /tmp/cores/core.lcfs.18821.ubuntu-2gb-nyc1-01.1503749695
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.5) 7.11.1
Copyright (C) 2016 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from lcfs...done.
[New LWP 14805]
[New LWP 18821]
[New LWP 18822]
[New LWP 18827]
[New LWP 18829]
[New LWP 18830]
[New LWP 14802]
[New LWP 14804]
[New LWP 14803]
[New LWP 14792]
[New LWP 14793]
[New LWP 14797]
[New LWP 14808]
[New LWP 19047]
[New LWP 14807]
[New LWP 18825]
[New LWP 19046]
[New LWP 18826]
[New LWP 27151]
[New LWP 18998]
[New LWP 7853]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `./lcfs daemon /dev/sda /var/lib/docker /lcfs -f -m -r -t -p -v'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007f0c0005a428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54  ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
[Current thread is 1 (Thread 0x7f0bf834a700 (LWP 14805))]
(gdb) bt full
#0  0x00007f0c0005a428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
        resultvar = 0
        pid = 18821
        selftid = 14805
#1  0x00007f0c0005c02a in __GI_abort () at abort.c:89
        save_stage = 2
        act = {__sigaction_handler = {sa_handler = 0x4, sa_sigaction = 0x4}, sa_mask = {__val = {0, 4051324750011852341, 139689385563952, 0, 0, 0, 0, 21474836480, 139689518111616, 139689385564104, 
              139689385563904, 139689518125584, 139689518139232, 0, 139689518125584, 4386328}}, sa_flags = 22167552, sa_restorer = 0x42ee18}
        sigs = {__val = {32, 0 <repeats 15 times>}}
#2  0x00007f0c00052bd7 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x42ee18 "zfs->fs_super->sb_flags & LC_SUPER_INIT", file=file@entry=0x42eaf5 "fs.c", 
    line=line@entry=308, function=function@entry=0x42f2a0 <__PRETTY_FUNCTION__.7725> "lc_removeLayers") at assert.c:92
        str = 0xf695f70 ""
        total = 4096
#3  0x00007f0c00052c82 in __GI___assert_fail (assertion=0x42ee18 "zfs->fs_super->sb_flags & LC_SUPER_INIT", file=0x42eaf5 "fs.c", line=308, 
    function=0x42f2a0 <__PRETTY_FUNCTION__.7725> "lc_removeLayers") at assert.c:101
No locals.
#4  0x000000000041097e in lc_removeLayers (gfs=0x24e8000, fs=0x25f8d80, gindex=98) at fs.c:308
        zfs = 0x6c3db00
        __PRETTY_FUNCTION__ = "lc_removeLayers"
#5  0x0000000000410cce in lc_getLayerForRemoval (gfs=0x24e8000, root=27584547717764810, fsp=0x7f0bf8349a08) at fs.c:368
        ino = 120522
        gindex = 98
        fs = 0x25f8d80
        __PRETTY_FUNCTION__ = "lc_getLayerForRemoval"
        __func__ = "lc_getLayerForRemoval"
#6  0x0000000000426c28 in lc_removeRoot (rfs=0x25f8000, dir=0x2aeea80, ino=120522, rmdir=true, fsp=0x7f0bf8349a08) at layer.c:222
        root = 27584547717764810
#7  0x0000000000419ed9 in lc_dirRemoveName (fs=0x25f8000, dir=0x2aeea80, name=0x7f0bf8349a70 "e557391afb5dd28f1545d6797488714f17c2536576b3a1d65f42e298515a1177", rmdir=true, fsp=0x7f0bf8349a08, 
    layer=true) at dir.c:579
        ino = 120522
        parent = 1023
        dirent = 0x1bc74450
        prev = 0x4e61860
        gfs = 0x24e8000
        len = 64
        err = -130770480
        rfs = 0x24e8000
        __PRETTY_FUNCTION__ = "lc_dirRemoveName"
        __func__ = "lc_dirRemoveName"
#8  0x0000000000426e13 in lc_deleteLayer (req=0x267b780, gfs=0x24e8000, name=0x7f0bf8349a70 "e557391afb5dd28f1545d6797488714f17c2536576b3a1d65f42e298515a1177") at layer.c:264
        fs = 0x0
        rfs = 0x25f8000
        bfs = 0x0
        zfs = 0x0
        extents = 0x0
        pdir = 0x2aeea80
        start = {tv_sec = 1503749695, tv_usec = 866516}
        err = 0
        root = 16877
---Type <return> to continue, or q <return> to quit--- 
        __func__ = "lc_deleteLayer"
#9  0x000000000040a0b7 in lc_ioctl (req=0x267b780, ino=1023, cmd=1077936231, arg=0xc82012bc00, fi=0x7f0bf8349b80, flags=16, in_buf=0x1f4e6048, in_bufsz=64, out_bufsz=0) at fops.c:1530
        name = "e557391afb5dd28f1545d6797488714f17c2536576b3a1d65f42e298515a1177"
        layer = 0x1 <error: Cannot access memory at address 0x1>
        parent = 0x0
        gfs = 0x24e8000
        value = 34621422135551202
        len = 140514
        op = 103
        __PRETTY_FUNCTION__ = "lc_ioctl"
        __func__ = "lc_ioctl"
#10 0x00007f0c00835f32 in do_ioctl (req=<optimized out>, nodeid=<optimized out>, inarg=<optimized out>) at fuse_lowlevel.c:1758
        flags = <optimized out>
        in_buf = <optimized out>
        fi = {flags = 0, writepage = 0, direct_io = 0, keep_cache = 0, flush = 0, nonseekable = 0, flock_release = 0, padding = 0, fh = 45017728, lock_owner = 0, poll_events = 0}
#11 0x00007f0c00838cd1 in fuse_session_process_buf_int (se=0x24f4380, buf=0xf0b8840, ch=<optimized out>) at fuse_lowlevel.c:2546
        bufv = {count = 1, idx = 0, off = 0, buf = {{size = 136, flags = (unknown: 0), mem = 0x1f4e6000, fd = 26, pos = 0}}}
        tmpbuf = {count = 1, idx = 0, off = 0, buf = {{size = 80, flags = (unknown: 0), mem = 0x0, fd = -1, pos = 0}}}
        in = 0x1f4e6000
        inarg = 0x1f4e6028
        req = 0x267b780
        mbuf = 0x0
        err = <optimized out>
        res = <optimized out>
#12 0x00007f0c00834371 in fuse_do_work (data=0xf0b8820) at fuse_loop_mt.c:159
        isforget = 0
        res = 136
        w = 0xf0b8820
        mt = 0x7ffed2a9bde0
#13 0x00007f0c003f66ba in start_thread (arg=0x7f0bf834a700) at pthread_create.c:333
        __res = <optimized out>
        pd = 0x7f0bf834a700
        now = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {139689385568000, 1383958741476702548, 0, 139689410742991, 8388608, 40172064, -1359173059717367468, -1356481400723080876}, mask_was_saved = 0}}, 
          priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
        pagesize_m1 = <optimized out>
        sp = <optimized out>
        freesize = <optimized out>
        __PRETTY_FUNCTION__ = "start_thread"
#14 0x00007f0c0012c3dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
No locals.
(gdb) 
kolyshkin commented 7 years ago

Note I'm using bleeding edge docker build:

# grep docker /etc/apt/sources.list
deb [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable edge test
# deb-src [arch=amd64] https://download.docker.com/linux/ubuntu xenial stable edge test
root@ubuntu-2gb-nyc1-01:~/go/src/github.com/docker/docker# dpkg -l docker-ce
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                        Version                    Architecture               Description
+++-===========================================-==========================-==========================-============================================================================================
ii  docker-ce                                   17.07.0~ce~rc4-0~ubuntu    amd64                      Docker: the open-source application container engine
root@ubuntu-2gb-nyc1-01:~/go/src/github.com/docker/docker# docker --version
Docker version 17.07.0-ce-rc4, build fe143e3
kolyshkin commented 7 years ago

Let me know if there's anything I can help with

jjobi commented 7 years ago

I could not reproduce the crash on my system with the version of docker I have installed on it. Let me try with the latest version of docker. The assertion in LCFS is fired because a read-write layer does not have a parent init layer, as expected. Normally, Docker creates an init layer before creating a read-write root image layer for a container.

By the way, just curious why you are not using our installation scripts for setting up lcfs. That is a single line command which will set up docker with lcfs and you don't have to run individual commands manually to get going.

dulom commented 6 years ago

Seems like I have a similar problem. lcfs seems to work for about 200 layers and then constantly crashes with a segfault. SMP machine with 4 sockets and 64 threads .. waybe a threading issue ?