Open cpla13 opened 1 week ago
The title and description of this issue contains Chinese. Please use English to describe your issue.
docker-compose.yml
version: '3.5'
services:
etcd:
container_name: milvus-etcd-232
image: quay.io/coreos/etcd:v3.5.5
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_SNAPSHOT_COUNT=50000
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
healthcheck:
test: ["CMD", "etcdctl", "endpoint", "health"]
interval: 30s
timeout: 20s
retries: 3
minio:
container_name: milvus-minio-232
image: minio/minio:RELEASE.2023-03-20T20-16-18Z
environment:
MINIO_ACCESS_KEY: minioadmin
MINIO_SECRET_KEY: minioadmin
ports:
- "9071:9001"
- "9070:9000"
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
command: minio server /minio_data --console-address ":9001"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 30s
timeout: 20s
retries: 3
standalone:
container_name: milvus-standalone-232
image: milvusdb/milvus:v2.3.2
command: ["milvus", "run", "standalone"]
security_opt:
- seccomp:unconfined
environment:
ETCD_ENDPOINTS: etcd:2379
MINIO_ADDRESS: minio:9000
volumes:
- ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9091/healthz"]
interval: 30s
start_period: 90s
timeout: 20s
retries: 3
ports:
- "19535:19530"
- "9095:9091"
depends_on:
- "etcd"
- "minio"
try to update to 2.3.13 and bring the docker up see if it works. If it can be start succesfully, trigger a compaction to reduce the segment number.
- the reason of milvus down is due to minio read timeout
- there seems to be many small files and the minio reaches its throughput limit
try to update to 2.3.13 and bring the docker up see if it works. If it can be start succesfully, trigger a compaction to reduce the segment number.
ok,thanks for your reply, i want to origin version up and compaction by myself,then observe for a period of time
let me know if there are any other issue causing compaction fail
@cpla13 also please make sure the etcd service is running against SSD volumes, I saw there are a few logs saying etcd writing is slow, which could be one of the reason milvus exits.
2024/04/25 03:57:56.665 +00:00] [WARN] [etcd/etcd_kv.go:648] ["Slow etcd operation save"] ["time spent"=7.000848603s] [key=by-dev/kv/gid/timestamp]
[2024/04/25 03:57:56.665 +00:00] [WARN] [rootcoord/root_coord.go:236] ["failed to update tso"] [error="etcdserver: request timed out"] [errorVerbose="etcdserver: request timed out\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).saveTimestamp\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:98\n | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).UpdateTimestamp\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:201\n | github.com/milvus-io/milvus/internal/tso.(*GlobalTSOAllocator).UpdateTSO\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/global_allocator.go:100\n | github.com/milvus-io/milvus/internal/rootcoord.(*Core).tsLoop\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:235\n | runtime.goexit\n | \t/usr/local/
/assign @cpla13 /unassign
@cpla13 also please make sure the etcd service is running against SSD volumes, I saw there are a few logs saying etcd writing is slow, which could be one of the reason milvus exits.
2024/04/25 03:57:56.665 +00:00] [WARN] [etcd/etcd_kv.go:648] ["Slow etcd operation save"] ["time spent"=7.000848603s] [key=by-dev/kv/gid/timestamp] [2024/04/25 03:57:56.665 +00:00] [WARN] [rootcoord/root_coord.go:236] ["failed to update tso"] [error="etcdserver: request timed out"] [errorVerbose="etcdserver: request timed out\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).saveTimestamp\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:98\n | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).UpdateTimestamp\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:201\n | github.com/milvus-io/milvus/internal/tso.(*GlobalTSOAllocator).UpdateTSO\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/global_allocator.go:100\n | github.com/milvus-io/milvus/internal/rootcoord.(*Core).tsLoop\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:235\n | runtime.goexit\n | \t/usr/local/
Using a mechanical hard drive, the data consists of only 60,000 768-dimensional vectors.
@cpla13 also please make sure the etcd service is running against SSD volumes, I saw there are a few logs saying etcd writing is slow, which could be one of the reason milvus exits.
2024/04/25 03:57:56.665 +00:00] [WARN] [etcd/etcd_kv.go:648] ["Slow etcd operation save"] ["time spent"=7.000848603s] [key=by-dev/kv/gid/timestamp] [2024/04/25 03:57:56.665 +00:00] [WARN] [rootcoord/root_coord.go:236] ["failed to update tso"] [error="etcdserver: request timed out"] [errorVerbose="etcdserver: request timed out\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).saveTimestamp\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:98\n | github.com/milvus-io/milvus/internal/tso.(*timestampOracle).UpdateTimestamp\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/tso.go:201\n | github.com/milvus-io/milvus/internal/tso.(*GlobalTSOAllocator).UpdateTSO\n | \t/go/src/github.com/milvus-io/milvus/internal/tso/global_allocator.go:100\n | github.com/milvus-io/milvus/internal/rootcoord.(*Core).tsLoop\n | \t/go/src/github.com/milvus-io/milvus/internal/rootcoord/root_coord.go:235\n | runtime.goexit\n | \t/usr/local/
Using a mechanical hard drive, the data consists of only 60,000 768-dimensional vectors.
you have to run etcd on SSD drives. it seems that etcd spend 7 seconds to run single operation
3.zip Here is the latest downtime log [in the attached zip package], the error displayed is of a different type, please help me look into it @xiaofan-luan schema in attu: privileges;Array<VarChar(250)>[500]
file_path;VarChar(800)
phash;VarChar(150)
module;VarChar(50)
belong_id;VarChar(100)
type;VarChar(50)
content;FloatVector(768);_default_idx_106
metric_type:IPnlist:1024
tags;Array<VarChar(250)>[500]
file_type;VarChar(50)
@cpla13 are you running the etcd against SSD volumes this time? it seems that the communication with etcd is still very slow, which causes milvus components disconnect from it.
[2024/04/30 06:46:38.347 +00:00] [WARN] [tso/tso.go:178] ["clock offset is huge, check network latency and clock skew"] [jet-lag=49.058196849s] [prev-physical=2024/04/30 06:45:49.289 +00:00] [now=2024/04/30 06:46:38.347 +00:00]
...
[2024/04/30 06:46:45.349 +00:00] [WARN] [etcd/etcd_kv.go:648] ["Slow etcd operation save"] ["time spent"=7.001496076s] [key=by-dev/kv/gid/timestamp]
...
[2024/04/30 06:46:58.085 +00:00] [WARN] [sessionutil/session_util.go:499] ["fail to retry keepAliveOnce"] [serverName=rootcoord] [LeaseID=7587877353474758497] [error="attempt #0: context deadline exceeded: attempt #1: etcdserver: requested lease not found: attempt #2: etcdserver: requested lease not found"]
@cpla13 are you running the etcd against SSD volumes this time? it seems that the communication with etcd is still very slow, which causes milvus components disconnect from it.
[2024/04/30 06:46:38.347 +00:00] [WARN] [tso/tso.go:178] ["clock offset is huge, check network latency and clock skew"] [jet-lag=49.058196849s] [prev-physical=2024/04/30 06:45:49.289 +00:00] [now=2024/04/30 06:46:38.347 +00:00] ... [2024/04/30 06:46:45.349 +00:00] [WARN] [etcd/etcd_kv.go:648] ["Slow etcd operation save"] ["time spent"=7.001496076s] [key=by-dev/kv/gid/timestamp] ... [2024/04/30 06:46:58.085 +00:00] [WARN] [sessionutil/session_util.go:499] ["fail to retry keepAliveOnce"] [serverName=rootcoord] [LeaseID=7587877353474758497] [error="attempt #0: context deadline exceeded: attempt #1: etcdserver: requested lease not found: attempt #2: etcdserver: requested lease not found"]
@yanliang567 mechanical hard drive ,Next time, I plan to use SSD
@yanliang567 What is the cause of this problem?I have not use DataType\n(1) [2024/04/30 06:46:57.934 +00:00] [WARN] [datanode/compaction_executor.go:98] ["compaction task failed"] [planID=449341343607604738] [error="unknown shema DataType"] [errorVerbose="unknown shema DataType\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/datanode.init\n | \t/go/src/github.com/milvus-io/milvus/internal/datanode/compactor.go:51\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6525\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6502\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6502\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6502\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6502\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:6502\n | runtime.main\n | \t/usr/local/go/src/runtime/proc.go:233\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) unknown shema DataType\nError types: (1) withstack.withStack (2) errutil.leafError"]
Is there an existing issue for this?
Environment
Current Behavior
milvus 中途挂掉,只是standalone挂掉,minio和ectc正常,重启一段时间继续挂掉,日志见附件
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
2.log日志
Anything else?
No response