vmware-archive / vsphere-storage-for-docker

vSphere Storage for Docker
https://vmware.github.io/vsphere-storage-for-docker
Apache License 2.0
251 stars 95 forks source link

[vFile] file server service spends one minute to start #1954

Closed luomiao closed 6 years ago

luomiao commented 6 years ago

After file server service is created, the time for the service state from Starting to Running is approximate 1 minute. This was not the case in previous testing. Some new changes (might be on the file server side) has introduced this long startup time. Need to root the cause and find a solution if it's possible.

lipingxue commented 6 years ago

Check the commit history here https://github.com/dperson/samba/commits/master On Oct.21, some change is made to use Alpine as the base image (before that change, the base image is based on debian). I also did some test and verified that this change causes the samba server take longer time (1 minutes) to start.

I manually started a samba service with image based on alpline and debian.

image based on debian

root@sc-rdops-vm18-dhcp-57-89:~# docker service ps testvol3_service
ID                  NAME                 IMAGE                           NODE                       DESIRED STATE       CURRENT STATE              ERROR               PORTS
ma48oovvwbgf        testvol3_service.1   lipingxue/samba-debian:latest   sc-rdops-vm18-dhcp-57-89   Running             Preparing 19 seconds ago                       
root@sc-rdops-vm18-dhcp-57-89:~# 
root@sc-rdops-vm18-dhcp-57-89:~# docker service ps testvol3_service
ID                  NAME                 IMAGE                           NODE                       DESIRED STATE       CURRENT STATE              ERROR               PORTS
ma48oovvwbgf        testvol3_service.1   lipingxue/samba-debian:latest   sc-rdops-vm18-dhcp-57-89   Running             Preparing 27 seconds ago                       
root@sc-rdops-vm18-dhcp-57-89:~# docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                           PORTS
gizbw0jq64fn        testvol1            replicated          1/1                 dperson/samba:latest            *:10445->445/tcp
bulughgw2nup        testvol2_service    replicated          1/1                 lipingxue/samba:latest          *:10446->446/tcp
i0kig1efb9e2        testvol3_service    replicated          0/1                 lipingxue/samba-debian:latest   *:10447->447/tcp
root@sc-rdops-vm18-dhcp-57-89:~# docker service ps testvol3_service
ID                  NAME                 IMAGE                           NODE                       DESIRED STATE       CURRENT STATE           ERROR               PORTS
ma48oovvwbgf        testvol3_service.1   lipingxue/samba-debian:latest   sc-rdops-vm18-dhcp-57-89   Running             Starting 1 second ago                       
root@sc-rdops-vm18-dhcp-57-89:~# 
root@sc-rdops-vm18-dhcp-57-89:~# docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                           PORTS
gizbw0jq64fn        testvol1            replicated          1/1                 dperson/samba:latest            *:10445->445/tcp
bulughgw2nup        testvol2_service    replicated          1/1                 lipingxue/samba:latest          *:10446->446/tcp
i0kig1efb9e2        testvol3_service    replicated          1/1                 lipingxue/samba-debian:latest   *:10447->447/tcp

We can see that the service is in "preparing state" for 27 second and then in "starting state" for 1 second and then go to the "running state".

image based on alpine

root@sc-rdops-vm18-dhcp-57-89:~# docker service ps testvol4_service 
ID                  NAME                 IMAGE                    NODE                       DESIRED STATE       CURRENT STATE             ERROR               PORTS
il7oooauxz7h        testvol4_service.1   lipingxue/samba:latest   sc-rdops-vm18-dhcp-57-89   Running             Starting 17 seconds ago                       
root@sc-rdops-vm18-dhcp-57-89:~# 
root@sc-rdops-vm18-dhcp-57-89:~# docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                           PORTS
gizbw0jq64fn        testvol1            replicated          1/1                 dperson/samba:latest            *:10445->445/tcp
bulughgw2nup        testvol2_service    replicated          1/1                 lipingxue/samba:latest          *:10446->446/tcp
i0kig1efb9e2        testvol3_service    replicated          1/1                 lipingxue/samba-debian:latest   *:10447->447/tcp
szpco6zxlct9        testvol4_service    replicated          0/1                 lipingxue/samba:latest          *:10448->448/tcp
root@sc-rdops-vm18-dhcp-57-89:~# 
root@sc-rdops-vm18-dhcp-57-89:~# docker service ps testvol4_service
ID                  NAME                 IMAGE                    NODE                       DESIRED STATE       CURRENT STATE             ERROR               PORTS
il7oooauxz7h        testvol4_service.1   lipingxue/samba:latest   sc-rdops-vm18-dhcp-57-89   Running             Starting 42 seconds ago                       
root@sc-rdops-vm18-dhcp-57-89:~# 
root@sc-rdops-vm18-dhcp-57-89:~# 
root@sc-rdops-vm18-dhcp-57-89:~# docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                           PORTS
gizbw0jq64fn        testvol1            replicated          1/1                 dperson/samba:latest            *:10445->445/tcp
bulughgw2nup        testvol2_service    replicated          1/1                 lipingxue/samba:latest          *:10446->446/tcp
i0kig1efb9e2        testvol3_service    replicated          1/1                 lipingxue/samba-debian:latest   *:10447->447/tcp
szpco6zxlct9        testvol4_service    replicated          0/1                 lipingxue/samba:latest          *:10448->448/tcp
root@sc-rdops-vm18-dhcp-57-89:~# docker service ps testvol4_service
ID                  NAME                 IMAGE                    NODE                       DESIRED STATE       CURRENT STATE             ERROR               PORTS
il7oooauxz7h        testvol4_service.1   lipingxue/samba:latest   sc-rdops-vm18-dhcp-57-89   Running             Starting 49 seconds ago                       
root@sc-rdops-vm18-dhcp-57-89:~# 
root@sc-rdops-vm18-dhcp-57-89:~# docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                           PORTS
gizbw0jq64fn        testvol1            replicated          1/1                 dperson/samba:latest            *:10445->445/tcp
bulughgw2nup        testvol2_service    replicated          1/1                 lipingxue/samba:latest          *:10446->446/tcp
i0kig1efb9e2        testvol3_service    replicated          1/1                 lipingxue/samba-debian:latest   *:10447->447/tcp
szpco6zxlct9        testvol4_service    replicated          0/1                 lipingxue/samba:latest          *:10448->448/tcp
root@sc-rdops-vm18-dhcp-57-89:~# 
root@sc-rdops-vm18-dhcp-57-89:~# docker service ps testvol4_service
ID                  NAME                 IMAGE                    NODE                       DESIRED STATE       CURRENT STATE             ERROR               PORTS
il7oooauxz7h        testvol4_service.1   lipingxue/samba:latest   sc-rdops-vm18-dhcp-57-89   Running             Starting 56 seconds ago                       
root@sc-rdops-vm18-dhcp-57-89:~# docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                           PORTS
gizbw0jq64fn        testvol1            replicated          1/1                 dperson/samba:latest            *:10445->445/tcp
bulughgw2nup        testvol2_service    replicated          1/1                 lipingxue/samba:latest          *:10446->446/tcp
i0kig1efb9e2        testvol3_service    replicated          1/1                 lipingxue/samba-debian:latest   *:10447->447/tcp
szpco6zxlct9        testvol4_service    replicated          0/1                 lipingxue/samba:latest          *:10448->448/tcp
root@sc-rdops-vm18-dhcp-57-89:~# docker service ps testvol4_service
ID                  NAME                 IMAGE                    NODE                       DESIRED STATE       CURRENT STATE                    ERROR               PORTS
il7oooauxz7h        testvol4_service.1   lipingxue/samba:latest   sc-rdops-vm18-dhcp-57-89   Running             Running less than a second ago                       
root@sc-rdops-vm18-dhcp-57-89:~# 
root@sc-rdops-vm18-dhcp-57-89:~# docker service ls
ID                  NAME                MODE                REPLICAS            IMAGE                           PORTS
gizbw0jq64fn        testvol1            replicated          1/1                 dperson/samba:latest            *:10445->445/tcp
bulughgw2nup        testvol2_service    replicated          1/1                 lipingxue/samba:latest          *:10446->446/tcp
i0kig1efb9e2        testvol3_service    replicated          1/1                 lipingxue/samba-debian:latest   *:10447->447/tcp
szpco6zxlct9        testvol4_service    replicated          1/1                 lipingxue/samba:latest          *:10448->448/tcp

However, for alpine based image, it is in "starting state" for 1 minutes and then go to "running state".

lipingxue commented 6 years ago

Filed an issue https://github.com/dperson/samba/issues/86 in dperson/samba to track this.

lipingxue commented 6 years ago

Filed an issue to request a build based on debian image. https://github.com/dperson/samba/issues/89

tusharnt commented 6 years ago

@lipingxue looks like dperson/samba#89 is closed. What are the next steps for us?

diegolima commented 6 years ago

Hello!

I've been performing tests with the vfile plugin and found out that the Alpine-based dperson/samba image suffers a severe performance hit when running as a service with public ports forwarded to it. I'll detail my test scenario and results below, but as a quick TLDR, using an older dperson/samba image, based on debian, gets us rid of all issues. In my case I built an image from commit https://github.com/dperson/samba/commit/c72d17eee35f0016c495d316f13ad3db7848837b and pushed it to a private repository as dperson/samba:latest, forcing my servers to use that instead of the Alpine-based image. Later commits should work as well as long as they're from before the change to Alpine.

For my test case: on a swarm-enabled server with both vfile and vsphere plugins installed and configured I created a new volume and tried to run a container mounting it, which failed:

[root@lab-dock05 admin]# docker volume create --driver vfile lab-dock-timeout -o size=1gb
lab-dock-timeout

[root@lab-dock05 admin]# docker run -ti -v lab-dock-timeout:/mnt ubuntu bash
docker: Error response from daemon: VolumeDriver.Mount: Failed to blocking wait for Mounted state. Error: Timeout reached; BlockingWait is not complete.
See 'docker run --help'.

[root@lab-dock05 admin]# docker service ls
ID                  NAME                          MODE                REPLICAS            IMAGE               PORTS
plk74i10fm8e        vFileServerlab-dock-timeout   replicated          0/1                 dperson/samba       *:0->445/tcp

As you can see there are no healthy replicas of the Samba service. Please note that setting VFILE_TIMEOUT_IN_SECOND on the plugin didn't solve the problem, as all it seems to do is to wait longer for the Samba service to start.

If you take a look at https://github.com/dperson/samba/blob/master/Dockerfile the healthcheck is a smbclient command with a timeout of 15 seconds, which isn't nearly enough to complete on the alpine-based image:

[root@lab-dock05 admin]# docker service ps vFileServerlab-dock-timeout
ID                  NAME                            IMAGE               NODE                        DESIRED STATE       CURRENT STATE             ERROR               PORTS
pev3c03ra6tg        vFileServerlab-dock-timeout.1   dperson/samba       lab-dock05.domain.org.br   Running             Starting 10 seconds ago

[root@lab-dock05 admin]# docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS                             PORTS                           NAMES
155755f0145f        dperson/samba:latest   "samba.sh -s share..."   20 seconds ago      Up 13 seconds (health: starting)   139/tcp, 137-138/udp, 445/tcp   vFileServerlab-dock-timeout.1.pev3c03ra6tge8ltvv1ssj8n4

[root@lab-dock05 admin]# docker exec -ti 155755f0145f bash
bash-4.3# time smbclient -L '\\localhost\' -U 'guest%' -m SMB3

        Sharename       Type      Comment
        ---------       ----      -------
        share1          Disk
        IPC$            IPC       IPC Service (Samba Server)

        Server               Comment
        ---------            -------

        Workgroup            Master
        ---------            -------

real    1m0.421s
user    0m0.023s
sys     0m0.007s

Since it took 1 minute for the smbclient to run (!) the healthcheck will always fail and the service will never become healthy. Manually increasing the healthcheck timeout to something like 120s works too, but I don't believe that waiting 1 minute for a connection to get established is acceptable.

After building the image from https://github.com/dperson/samba/commit/c72d17eee35f0016c495d316f13ad3db7848837b and launching it as a service using the same parameters as vfile does, I got a very different result:

docker service create --health-timeout 90s --name samba --mount type=bind,source=/tmp,target=/mount -p445:445 -d docker-registry.domain.org.br/dperson/samba -s "share1;/mount;yes;no;no;all;root;root" -u "root;badpass"

[root@lab-dock05 admin]# docker service ps samba
ID                  NAME                IMAGE                                                 NODE                        DESIRED STATE       CURRENT STATE             ERROR                              PORTS
hirg6c8k4ac1        samba.1             docker-registry.domain.org.br/dperson/samba:latest   lab-dock05.domain.org.br   Running             Running 28 seconds ago

[root@lab-dock05 admin]# docker exec -ti b0f535d6b833 bash

root@b0f535d6b833:/# apt-get update && apt-get install -y smbclient
(ommited for brevity)

root@b0f535d6b833:/# time smbclient -L '\\localhost\' -U 'guest%' -m SMB3
WARNING: The "syslog" option is deprecated

        Sharename       Type      Comment
        ---------       ----      -------
        share1          Disk
        IPC$            IPC       IPC Service (Samba 4.5.12-Debian)

        Server               Comment
        ---------            -------

        Workgroup            Master
        ---------            -------

real    0m0.204s
user    0m0.035s
sys     0m0.010s

As you can see the healthcheck itself isn't the problem, as the service becomes healthy despite the healthcheck. The problem is that the new Alpine-based image seems to suffer from performance problems when running on swarm (I didn't really dig deep enough as to why this is happening, but it isn't really my focus right now).

IMHO it might be beneficial to fork dperson/samba and maintain a separate, debian-based Samba image until the current problems are sorted.

luomiao commented 6 years ago

@diegolima Thank you for your detailed evaluation! Yes we also observed that the health check in some case just cannot pass and will make the SMB server never up running. Thank you for locating the reason! We are currently looking for solutions such as replace with other SMB server image or forking a separate image as you suggested. will have an update soon!

luomiao commented 6 years ago

Has been resolved with https://github.com/vmware/vsphere-storage-for-docker/pull/2001