milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
27.03k stars 2.61k forks source link

[Bug]: Milvus components graceful stop done #32638

Open cpla13 opened 1 week ago

cpla13 commented 1 week ago

Is there an existing issue for this?

Environment

- Milvus version: V.2.3.2-Dev
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):java milvus v2.3.4
- OS(Ubuntu or CentOS): CentOS
- CPU/Memory:  cpu/32g
- GPU: 
- Others:

Current Behavior

milvus 中途挂掉,只是standalone挂掉,minio和ectc正常,重启一段时间继续挂掉,日志见附件

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

2.log日志

Anything else?

No response

github-actions[bot] commented 1 week ago

The title and description of this issue contains Chinese. Please use English to describe your issue.

cpla13 commented 1 week ago

docker-compose.yml

version: '3.5'
services:
  etcd:
    container_name: milvus-etcd-232
    image: quay.io/coreos/etcd:v3.5.5
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_SNAPSHOT_COUNT=50000
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    healthcheck:
      test: ["CMD", "etcdctl", "endpoint", "health"]
      interval: 30s
      timeout: 20s
      retries: 3

  minio:
    container_name: milvus-minio-232
    image: minio/minio:RELEASE.2023-03-20T20-16-18Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    ports:
      - "9071:9001"
      - "9070:9000"
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
    command: minio server /minio_data --console-address ":9001"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3

  standalone:
    container_name: milvus-standalone-232
    image: milvusdb/milvus:v2.3.2
    command: ["milvus", "run", "standalone"]
    security_opt:
    - seccomp:unconfined
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
      interval: 30s
      start_period: 90s
      timeout: 20s
      retries: 3
    ports:
      - "19535:19530"
      - "9095:9091"
    depends_on:
      - "etcd"
      - "minio"
xiaofan-luan commented 1 week ago
  1. the reason of milvus down is due to minio read timeout
  2. there seems to be many small files and the minio reaches its throughput limit

try to update to 2.3.13 and bring the docker up see if it works. If it can be start succesfully, trigger a compaction to reduce the segment number.

cpla13 commented 1 week ago
  1. the reason of milvus down is due to minio read timeout
  2. there seems to be many small files and the minio reaches its throughput limit

try to update to 2.3.13 and bring the docker up see if it works. If it can be start succesfully, trigger a compaction to reduce the segment number.

ok,thanks for your reply, i want to origin version up and compaction by myself,then observe for a period of time

xiaofan-luan commented 1 week ago

let me know if there are any other issue causing compaction fail

yanliang567 commented 1 week ago

@cpla13 also please make sure the etcd service is running against SSD volumes, I saw there are a few logs saying etcd writing is slow, which could be one of the reason milvus exits.

2024/04/25 03:57:56.665 +00:00] [WARN] [etcd/etcd_kv.go:648] ["Slow etcd operation save"] ["time spent"=7.000848603s] [key=by-dev/kv/gid/timestamp]
[2024/04/25 03:57:56.665 +00:00] [WARN] [rootcoord/root_coord.go:236] ["failed to update tso"] [error="etcdserver: request timed out"] [errorVerbose="etcdserver: request timed out\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).saveTimestamp\n  | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:98\n  | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).UpdateTimestamp\n  | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:201\n  | github.com/milvus-io/milvus/internal/tso.(*GlobalTSOAllocator).UpdateTSO\n  | \t/go/src/github.com/milvus-io/milvus/internal/tso/global_allocator.go:100\n  | github.com/milvus-io/milvus/internal/rootcoord.(*Core).tsLoop\n  | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:235\n  | runtime.goexit\n  | \t/usr/local/
yanliang567 commented 1 week ago

/assign @cpla13 /unassign

cpla13 commented 1 day ago

@cpla13 also please make sure the etcd service is running against SSD volumes, I saw there are a few logs saying etcd writing is slow, which could be one of the reason milvus exits.

2024/04/25 03:57:56.665 +00:00] [WARN] [etcd/etcd_kv.go:648] ["Slow etcd operation save"] ["time spent"=7.000848603s] [key=by-dev/kv/gid/timestamp]
[2024/04/25 03:57:56.665 +00:00] [WARN] [rootcoord/root_coord.go:236] ["failed to update tso"] [error="etcdserver: request timed out"] [errorVerbose="etcdserver: request timed out\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).saveTimestamp\n  | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:98\n  | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).UpdateTimestamp\n  | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:201\n  | github.com/milvus-io/milvus/internal/tso.(*GlobalTSOAllocator).UpdateTSO\n  | \t/go/src/github.com/milvus-io/milvus/internal/tso/global_allocator.go:100\n  | github.com/milvus-io/milvus/internal/rootcoord.(*Core).tsLoop\n  | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:235\n  | runtime.goexit\n  | \t/usr/local/

Using a mechanical hard drive, the data consists of only 60,000 768-dimensional vectors.

xiaofan-luan commented 1 day ago

@cpla13 also please make sure the etcd service is running against SSD volumes, I saw there are a few logs saying etcd writing is slow, which could be one of the reason milvus exits.

2024/04/25 03:57:56.665 +00:00] [WARN] [etcd/etcd_kv.go:648] ["Slow etcd operation save"] ["time spent"=7.000848603s] [key=by-dev/kv/gid/timestamp]
[2024/04/25 03:57:56.665 +00:00] [WARN] [rootcoord/root_coord.go:236] ["failed to update tso"] [error="etcdserver: request timed out"] [errorVerbose="etcdserver: request timed out\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).saveTimestamp\n  | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:98\n  | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).UpdateTimestamp\n  | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:201\n  | github.com/milvus-io/milvus/internal/tso.(*GlobalTSOAllocator).UpdateTSO\n  | \t/go/src/github.com/milvus-io/milvus/internal/tso/global_allocator.go:100\n  | github.com/milvus-io/milvus/internal/rootcoord.(*Core).tsLoop\n  | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:235\n  | runtime.goexit\n  | \t/usr/local/

Using a mechanical hard drive, the data consists of only 60,000 768-dimensional vectors.

you have to run etcd on SSD drives. it seems that etcd spend 7 seconds to run single operation

cpla13 commented 1 day ago

3.zip Here is the latest downtime log [in the attached zip package], the error displayed is of a different type, please help me look into it @xiaofan-luan schema in attu: privileges;Array<VarChar(250)>[500]

file_path;VarChar(800)

phash;VarChar(150)

module;VarChar(50)

belong_id;VarChar(100)

type;VarChar(50)

content;FloatVector(768);_default_idx_106

metric_type:IPnlist:1024

tags;Array<VarChar(250)>[500]

file_type;VarChar(50)

yanliang567 commented 1 day ago

@cpla13 are you running the etcd against SSD volumes this time? it seems that the communication with etcd is still very slow, which causes milvus components disconnect from it.

[2024/04/30 06:46:38.347 +00:00] [WARN] [tso/tso.go:178] ["clock offset is huge, check network latency and clock skew"] [jet-lag=49.058196849s] [prev-physical=2024/04/30 06:45:49.289 +00:00] [now=2024/04/30 06:46:38.347 +00:00]
...
[2024/04/30 06:46:45.349 +00:00] [WARN] [etcd/etcd_kv.go:648] ["Slow etcd operation save"] ["time spent"=7.001496076s] [key=by-dev/kv/gid/timestamp]
...
[2024/04/30 06:46:58.085 +00:00] [WARN] [sessionutil/session_util.go:499] ["fail to retry keepAliveOnce"] [serverName=rootcoord] [LeaseID=7587877353474758497] [error="attempt #0: context deadline exceeded: attempt #1: etcdserver: requested lease not found: attempt #2: etcdserver: requested lease not found"]
cpla13 commented 1 day ago

@cpla13 are you running the etcd against SSD volumes this time? it seems that the communication with etcd is still very slow, which causes milvus components disconnect from it.

[2024/04/30 06:46:38.347 +00:00] [WARN] [tso/tso.go:178] ["clock offset is huge, check network latency and clock skew"] [jet-lag=49.058196849s] [prev-physical=2024/04/30 06:45:49.289 +00:00] [now=2024/04/30 06:46:38.347 +00:00]
...
[2024/04/30 06:46:45.349 +00:00] [WARN] [etcd/etcd_kv.go:648] ["Slow etcd operation save"] ["time spent"=7.001496076s] [key=by-dev/kv/gid/timestamp]
...
[2024/04/30 06:46:58.085 +00:00] [WARN] [sessionutil/session_util.go:499] ["fail to retry keepAliveOnce"] [serverName=rootcoord] [LeaseID=7587877353474758497] [error="attempt #0: context deadline exceeded: attempt #1: etcdserver: requested lease not found: attempt #2: etcdserver: requested lease not found"]

@yanliang567 mechanical hard drive ,Next time, I plan to use SSD

cpla13 commented 1 day ago

@yanliang567 What is the cause of this problem?I have not use DataType\n(1) [2024/04/30 06:46:57.934 +00:00] [WARN] [datanode/compaction_executor.go:98] ["compaction task failed"] [planID=449341343607604738] [error="unknown shema DataType"] [errorVerbose="unknown shema DataType\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/datanode.init\n | \t/go/src/github.com/milvus-io/milvus/internal/datanode/compactor.go:51\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6525\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6502\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6502\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6502\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6502\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6502\n | runtime.main\n | \t/usr/local/go/src/runtime/proc.go:233\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) unknown shema DataType\nError types: (1) withstack.withStack (2) errutil.leafError"]