minio / minio

The Object Store for AI Data Infrastructure
https://min.io/download
GNU Affero General Public License v3.0
45.88k stars 5.34k forks source link

In distributed deployment, Minio responses slowly after one node with data is down #4887

Closed xinfengliu closed 6 years ago

xinfengliu commented 6 years ago

Following "minio server --help", set up a simple 4-node distributed minio, each node consists of 1 disk. Push some data to it, then shutdown a node, minio responses very slowly either from browser or minio client.

Expected Behavior

minio should work as normally after 1 node shutdown since we still have (4/2 + 1 = 3) quorum.

Current Behavior

Before shutting down the node:

docker@docker-ee-54 ~]$ time mc ls test54/dtr/docker
[2017-09-04 16:46:55 CST]     0B registry/
real    0m0.017s
user    0m0.004s
sys 0m0.005s

After shutting down the node, can not get response after 4 minutes

[docker@docker-ee-54 ~]$ time mc ls test54/dtr/docker
^C
real    4m7.454s
user    0m0.006s
sys 0m0.003s

Possible Solution

Hope minio can identify and remove the dead node quickly.

Steps to Reproduce (for bugs)

  1. set up minio on 4 nodes

    export MINIO_ACCESS_KEY=minio
    export MINIO_SECRET_KEY=miniostorage
    nohup /usr/local/bin/minio server \
    http://192.168.105.54/data/ \
    http://192.168.105.55/data/ \
    http://192.168.105.56/data/ \
    http://192.168.105.57/data/ > /tmp/minio.log 2>&1 &
  2. Push some data (I'm using minio as Docker DTR storage backend, the bucket name is "dtr")

    [docker@docker-ee-51 ~]$ docker push 192.168.105.51:8443/admin/alpine
    The push refers to a repository [192.168.105.51:8443/admin/alpine]
    5bef08742407: Pushed 
    latest: digest: sha256:0930dd4cc97ed5771ebe9be9caf3e8dc5341e0b5e32e8fb143394d7dfdfa100e size: 528
  3. Shutdown one node

    [docker@docker-ee-57 foo]$ sudo shutdown -h now
  4. Observe behavior of minio Minio starts to response very slowly when accessing objects under "dtr" bucket. In addition, pushing new Docker images such as "nginx" received errors.

    [docker@docker-ee-51 ~]$ docker push 192.168.105.51:8443/admin/nginx:stable-alpine
    The push refers to a repository [192.168.105.51:8443/admin/nginx]
    1690bc77acd5: Retrying in 1 second 
    cbeb94c1f91a: Retrying in 1 second 
    4bbeb364e643: Retrying in 1 second 
    040fd7841192: Retrying in 1 second 
    received unexpected HTTP status: 504 Gateway Time-out

    From dtr-registry-xxx container logs, I can see

    10.1.0.8 - - [04/Sep/2017:06:00:52 +0000] "HEAD /v2/admin/alpine/blobs/sha256:7328f6f8b41890597575cbaadc884e7386ae0acc53b747401ebce5cf0d624560 HTTP/1.1" 404 157 "" "docker/17.06.1-ee-2 go/go1.8.3 git-commit/8e43158 kernel/3.10.0-514.26.1.el7.x86_64 os/linux arch/amd64 UpstreamClient(Docker-Client/17.06.1-ee-2 \\(linux\\))"t be enabled. Not submitting scan for repo: admin/nginx","time":"2017-09-04T08:23:46.19330931Z"}

    While there were some data relating to "nginx" image uploaded to minio data directory (after shutting down 1 minio node).

    [docker@docker-ee-55 ~]$ ls -ltr /data/dtr/docker/registry/v2/repositories/admin/nginx/_uploads/
    total 0
    drwxrwxr-x. 4 docker docker 41 Sep  4 14:59 e46b6069-419a-46d3-bd59-32a7e689a36a
    drwxrwxr-x. 4 docker docker 41 Sep  4 15:00 bd08e3c4-ee65-423c-b1b8-63f0a86a9f2c
    drwxrwxr-x. 4 docker docker 41 Sep  4 15:01 b56c3983-fe77-4a9b-8f06-016c48306a44
    drwxrwxr-x. 4 docker docker 41 Sep  4 15:02 d2f70a30-5b66-4190-9343-6a36fd3a8cb2
    drwxrwxr-x. 4 docker docker 41 Sep  4 15:07 cd98738d-b255-4cf6-b292-c91a901e36fe
    ....

Context

I'm evaluating minio as a HA storage for Docker DTR.

Your Environment

harshavardhana commented 6 years ago

@xinfengliu we have mitigated the issue differently in the master branch - you can test the "minio/minio:edge" and let us know.

commit 98b62cbec8fda7f2628c49a8d736e65da48819aa
Author: Frank Wessels <fwessels@xs4all.nl>
Date:   Fri Aug 11 11:38:46 2017 -0700

    Implement an offline mode for a distributed node (#4646)

    Implement an offline mode for remote storage to cache the
    offline status of a node in order to prevent network calls
    that are bound to fail. After a time interval an attempt
    will be made to restore the connection and mark the node
    as online if successful.

    Fixes #4183

Previous issue reported is #4183

xinfengliu commented 6 years ago

@harshavardhana Thank you very much for the pointer.

Today I built a new minio binary from master branch, cleared existing minio storage, then did the test again. But it seems things get worse: Shutdown one node, minio operations never come back. e.g. "mc admin info" (I waited over 10 minutes).

By the way, I also tested the scenario of just killing minio process on one node. I tested both current stable version (2017-08-05T00:00:53Z) and new build from minio:master. Minio operations are fine under this scenario. "mc admin info" immediately marked the node is offline. And docker push operations are also working well.

fwessels commented 6 years ago

By the way, I also tested the scenario of just killing minio process on one node. I tested both current stable version (2017-08-05T00:00:53Z) and new build from minio:master. Minio operations are fine under this scenario. "mc admin info" immediately marked the node is offline. And docker push operations are also working well.

Hello @xinfengliu you say you also tested the "scenario of just killing minio process on one node" which works fine as you report.

What is the difference between "shutdown one node" and "killing minio process on one node" -- obviously if you are fully shutting down a node and killing all minio instances as a side effect then there are no minio instances left running, meaning that it will not work.

Note that, unless you use something like a load balancer inbetween, mc is connected to a single minio process, if this process is taken down while the other minio processes are still up, you would need to reconfigure mc to connect to one of the minio servers that is still running.

Let us know if this addresses your issue or whether you have any other questions.

xinfengliu commented 6 years ago

Hi @fwessels , I have 4 hosts (nodes), each host runs a minio instance. when you just kill a minio process on one host , the host OS is alive, so testing minio service liveness gets result very quickly:

[docker@docker-ee-54 ~]$ time curl http://docker-ee-56:9000 #minio service is down, host OS is live
curl: (7) Failed connect to docker-ee-56:9000; Connection refused

real    0m0.008s
user    0m0.001s
sys 0m0.004s

When you shutdown a node, the host OS is down, testing minio service liveness depends on underlying TCP/IP stack, in my environment, this can be over 2 minutes (if the host OS is shutdown just now) or around 8 seconds (if the host OS has been shutdown for long time).

[docker@docker-ee-54 ~]$ time curl http://docker-ee-55:9000 . (the host is down)
curl: (7) Failed connect to docker-ee-55:9000; No route to host

real    2m23.098s
user    0m0.003s
sys 0m0.013s

I also noticed this issue when using NFS mount as minio's storage, if I shutdown nfs server, minio did not know nfs server is down for long time, while if I manually umount that NFS directory, minio can immediately notice that.

As for load balancer, yes, I did use a load balancer (nginx, layer four) in front of 4 minio instances. (Sorry I didn't mention it in previous comments).

fwessels commented 6 years ago

There is not much we can improve if the host OS is down (shutdown), those timeouts are beyond our control.

Great that you are using a load-balancer. Obviously when one of the minio instances is down, it may takes some short time for the load balancer to detect that an instance is down and start skipping it.

The NFS mount looks a different issue to me, feel free to file a separate issue for this.

morph027 commented 6 years ago

Came cross the same while applying rolling updates to the cluster ....

While all operations tend to be slow, i've also came across this (vps13 was really offline at this time):

/ # mc admin info minio

●  localhost:9000
   Uptime : online since 24 minutes ago
  Version : 2017-08-05T00:00:53Z
   Region : 
 SQS ARNs : <none>
  Network : Incoming 32MiB, Outgoing 408KiB
  Storage : Total 91GiB, Free 83GiB, Online Disks: 3, Offline Disks: 1

●  vps13.example.com:9000
   Uptime : Server is offline
    Error : dial-http vps13.example.com:9000/minio/admin: dial tcp 10.0.0.170:9000: getsockopt: connection timed out

●  vps27.example.com:9000
   Uptime : Server is offline
    Error : unexpected EOF

●  vps99.example.com:9000
   Uptime : Server is offline
    Error : unexpected EOF
morph027 commented 6 years ago

Anything we can do here? Workaround on OS level? This is s showstopper for my current project :cry:

Nodes can go down, this is why i want clustered solutions...But right now, i can't use the cluster until the node has been fixed. hm

morph027 commented 6 years ago

Ok, came around with this little hack...Idea was to circumvent the tcp timeout. So let's just drop port 9000 packets to the (network-wise) offline node via iptables (and of course allow if it's back up online):

#!/bin/bash

##
## Requirements:
## -------------
##
## - fping
## - iptables

NODES=( "minio1.example.com" "minio2.example.com" "minio3.example.com" "minio4.example.com" )

for NODE in ${NODES[@]}
do
  if [  "$NODE" != "$HOSTNAME" ]; then
    RULE="-d $NODE -p tcp --dport 9000 -j REJECT --reject-with tcp-reset"
    if ! fping -c1 -t500 "$NODE" >/dev/null 2>&1; then
      logger -t minio-circuit-breaker "disabling $NODE"
      iptables -A OUTPUT $RULE
    else
      if iptables -C OUTPUT $RULE >/dev/null 2>&1; then
        logger -t minio-circuit-breaker "re-enabling $NODE"
        iptables -D OUTPUT $RULE
      fi
    fi
  fi
done

Log output:

Sep 14 17:36:00 minio4.example.com systemd[1]: Starting minio circuit breaker...
Sep 14 17:36:00 minio4.example.com minio-circuit-breaker[8531]: disabling minio3.example.com
Sep 14 17:36:00 minio4.example.com systemd[1]: Started minio circuit breaker.
Sep 14 17:36:09 minio4.example.com minio[1775]: time="2017-09-14T17:36:09+02:00" level=error msg="Unable to fetch disk info for &cmd.retryStorage{remoteStorage:(*cmd.networkStorage)(0xc42015a898), maxRetryAttempts:1, retryUnit:1000000, retryCap:5000000}" cause="disk not found" source="[xl-v1.go:199:getDisksInfo()]"
Sep 14 17:36:11 minio4.example.com minio[1775]: time="2017-09-14T17:36:11+02:00" level=error msg="Unable to fetch disk info for &cmd.retryStorage{remoteStorage:(*cmd.networkStorage)(0xc42015a898), maxRetryAttempts:1, retryUnit:1000000, retryCap:5000000}" cause="disk not found" source="[xl-v1.go:199:getDisksInfo()]"
Sep 14 17:36:14 minio4.example.com minio[1775]: time="2017-09-14T17:36:14+02:00" level=error msg="Unable to fetch disk info for &cmd.retryStorage{remoteStorage:(*cmd.networkStorage)(0xc42015a898), maxRetryAttempts:1, retryUnit:1000000, retryCap:5000000}" cause="disk not found" source="[xl-v1.go:199:getDisksInfo()]"
Sep 14 17:37:00 minio4.example.com systemd[1]: Starting minio circuit breaker...
Sep 14 17:37:00 minio4.example.com minio-circuit-breaker[8569]: re-enabling minio3.example.com
Sep 14 17:37:00 minio4.example.com systemd[1]: Started minio circuit breaker.

Node is offline:

# time mc admin info minio
●  localhost:9000
   Uptime : online since 42 minutes ago
  Version : 2017-08-05T00:00:53Z
   Region : 
 SQS ARNs : <none>
  Network : Incoming 349KiB, Outgoing 355KiB
  Storage : Total 91GiB, Free 85GiB, Online Disks: 3, Offline Disks: 1

●  minio2.example.com:9000
   Uptime : Server is offline
    Error : dial-http minio2.example.com:9000/minio/admin: dial tcp 10.0.0.170:9000: getsockopt: connection refused

●  minio3.example.com:9000
   Uptime : online since 41 minutes ago
  Version : 2017-08-05T00:00:53Z
   Region : 
 SQS ARNs : <none>
  Network : Incoming 248KiB, Outgoing 176KiB
  Storage : Total 91GiB, Free 85GiB, Online Disks: 3, Offline Disks: 1

●  minio4.example.com:9000
   Uptime : online since 15 minutes ago
  Version : 2017-08-05T00:00:53Z
   Region : 
 SQS ARNs : <none>
  Network : Incoming 146KiB, Outgoing 68KiB
  Storage : Total 91GiB, Free 85GiB, Online Disks: 3, Offline Disks: 1

real    0m 1.91s
user    0m 0.08s
sys 0m 0.00s

Not sure if you cool golang guys can just hack a fancy daemon which does a thing like this.

morph027 commented 6 years ago

Just for the record, systemd timer:

[Unit]
Description=check minio cluster nodes every minute

[Timer]
AccuracySec=1
RemainAfterElapse=no

[Timer]
OnCalendar=minutely

[Install]
WantedBy=timers.target
[Unit]
Description=minio circuit breaker

[Service]
Type=oneshot
ExecStart=/usr/bin/minio-circuit-breaker.sh
fwessels commented 6 years ago

@morph027 You need a minimum of N/2 of servers to be online in order to be able to read, or N/2+1 servers to be able to read.

So (on a 4 system cluster) you would want to restart one server after the other for a rolling update.

morph027 commented 6 years ago

Sure, but this one was bugging me more in terms of unplanned outage of a server. But my workaround helps for planned maintenance too ;)

harshavardhana commented 6 years ago

The PR is merged and this issue should be fixed as of #5840

lock[bot] commented 4 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.