milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.3k stars 2.91k forks source link

[Bug]: Catastropic Loss of collections in etcd #36094

Open Karlusk opened 1 month ago

Karlusk commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4.4 GPU
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): Windows using Docker image
- CPU/Memory: Gen 11 Intel
- GPU: 3090
- Others:

Current Behavior

I was reinstalling a different instance of WSL independent of docker-desktop or docker-desktop-data.

Milvus said I had an unexpected shutdown, when the etcd server started it was configured with a new cluster state: "initial-cluster-state":"new" kvstore was "restored" with a "current-rev":1

and then it went into auto compaction mode: auto-compaction-mode: revision, auto-compaction-retention: 1µs

I lost all of my production collections. Not sure why this happened.

I get it's on me to have not set the compaction mode to something more reasonable like 24h, but I guess I didnt expect to roll back to rev:1

I included the log file (which is still there).

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

2024-09-04 21:49:02 {"level":"info","ts":"2024-09-05T01:49:02.270Z","caller":"mvcc/kvstore_compaction.go:66","msg":"finished scheduled compaction","compact-revision":2264221,"took":"1.522456ms","hash":2329894398} 2024-09-04 21:49:02 {"level":"info","ts":"2024-09-05T01:49:02.270Z","caller":"mvcc/hash.go:137","msg":"storing new hash","hash":2329894398,"revision":2264221,"compact-revision":2264116} 2024-09-04 21:49:03 {"level":"info","ts":"2024-09-05T01:49:03.948Z","caller":"osutil/interrupt_unix.go:64","msg":"received signal; shutting down","signal":"terminated"} 2024-09-04 21:49:03 {"level":"info","ts":"2024-09-05T01:49:03.952Z","caller":"embed/etcd.go:371","msg":"closing etcd server","name":"default","data-dir":"/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://127.0.0.1:2379"]} 2024-09-04 21:49:03 {"level":"info","ts":"2024-09-05T01:49:03.961Z","caller":"etcdserver/server.go:1465","msg":"skipped leadership transfer for single voting member cluster","local-member-id":"8e9e05c52164694d","current-leader-member-id":"8e9e05c52164694d"} 2024-09-04 21:49:03 {"level":"info","ts":"2024-09-05T01:49:03.981Z","caller":"embed/etcd.go:566","msg":"stopping serving peer traffic","address":"127.0.0.1:2380"} 2024-09-04 21:49:03 {"level":"info","ts":"2024-09-05T01:49:03.984Z","caller":"embed/etcd.go:571","msg":"stopped serving peer traffic","address":"127.0.0.1:2380"} 2024-09-04 21:49:03 {"level":"info","ts":"2024-09-05T01:49:03.984Z","caller":"embed/etcd.go:373","msg":"closed etcd server","name":"default","data-dir":"/etcd","advertise-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://127.0.0.1:2379"]} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.099Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_MODE","variable-value":"revision"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.104Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_AUTO_COMPACTION_RETENTION","variable-value":"1000"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.104Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_QUOTA_BACKEND_BYTES","variable-value":"4294967296"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.104Z","caller":"flags/flag.go:113","msg":"recognized and used environment variable","variable-name":"ETCD_SNAPSHOT_COUNT","variable-value":"50000"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.104Z","caller":"etcdmain/etcd.go:73","msg":"Running: ","args":["etcd","-advertise-client-urls=http://127.0.0.1:2379","-listen-client-urls","http://0.0.0.0:2379","--data-dir","/etcd"]} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.104Z","caller":"embed/etcd.go:124","msg":"configuring peer listeners","listen-peer-urls":["http://localhost:2380"]} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.107Z","caller":"embed/etcd.go:132","msg":"configuring client listeners","listen-client-urls":["http://0.0.0.0:2379"]} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.107Z","caller":"embed/etcd.go:306","msg":"starting an etcd server","etcd-version":"3.5.5","git-sha":"19002cfc6","go-version":"go1.16.15","go-os":"linux","go-arch":"amd64","max-cpu-set":20,"max-cpu-available":20,"member-initialized":false,"name":"default","data-dir":"/etcd","wal-dir":"","wal-dir-dedicated":"","member-dir":"/etcd/member","force-new-cluster":false,"heartbeat-interval":"100ms","election-timeout":"1s","initial-election-tick-advance":true,"snapshot-count":50000,"snapshot-catchup-entries":5000,"initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://127.0.0.1:2379"],"listen-client-urls":["http://0.0.0.0:2379"],"listen-metrics-urls":[],"cors":["*"],"host-whitelist":["*"],"initial-cluster":"default=http://localhost:2380","initial-cluster-state":"new","initial-cluster-token":"etcd-cluster","quota-backend-bytes":4294967296,"max-request-bytes":1572864,"max-concurrent-streams":4294967295,"pre-vote":true,"initial-corrupt-check":false,"corrupt-check-time-interval":"0s","compact-check-time-enabled":false,"compact-check-time-interval":"1m0s","auto-compaction-mode":"revision","auto-compaction-retention":"1µs","auto-compaction-interval":"1µs","discovery-url":"","discovery-proxy":"","downgrade-check-interval":"5s"} 2024-09-04 22:39:29 {"level":"warn","ts":1725503969.107887,"caller":"fileutil/fileutil.go:57","msg":"check file permission","error":"directory \"/etcd\" exist, but the permission is \"drwxr-xr-x\". The recommended permission is \"-rwx------\" to prevent possible unprivileged access to the data"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.110Z","caller":"etcdserver/backend.go:81","msg":"opened backend db","path":"/etcd/member/snap/db","took":"2.872926ms"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.118Z","caller":"etcdserver/raft.go:494","msg":"starting local member","local-member-id":"8e9e05c52164694d","cluster-id":"cdf818194e3a8c32"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.118Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d switched to configuration voters=()"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.118Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d became follower at term 0"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.118Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"newRaft 8e9e05c52164694d [peers: [], term: 0, commit: 0, applied: 0, lastindex: 0, lastterm: 0]"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.118Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d became follower at term 1"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.118Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d switched to configuration voters=(10276657743932975437)"} 2024-09-04 22:39:29 {"level":"warn","ts":"2024-09-05T02:39:29.119Z","caller":"auth/store.go:1233","msg":"simple token is not cryptographically signed"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.119Z","caller":"mvcc/kvstore.go:393","msg":"kvstore restored","current-rev":1} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.119Z","caller":"etcdserver/quota.go:117","msg":"enabled backend quota","quota-name":"v3-applier","quota-size-bytes":4294967296,"quota-size":"4.3 GB"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.119Z","caller":"etcdserver/server.go:854","msg":"starting etcd server","local-member-id":"8e9e05c52164694d","local-server-version":"3.5.5","cluster-version":"to_be_decided"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.119Z","caller":"etcdserver/server.go:738","msg":"started as single-node; fast-forwarding election ticks","local-member-id":"8e9e05c52164694d","forward-ticks":9,"forward-duration":"900ms","election-ticks":10,"election-timeout":"1s"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.121Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d switched to configuration voters=(10276657743932975437)"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.121Z","caller":"membership/cluster.go:421","msg":"added member","cluster-id":"cdf818194e3a8c32","local-member-id":"8e9e05c52164694d","added-peer-id":"8e9e05c52164694d","added-peer-peer-urls":["http://localhost:2380"]} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.122Z","caller":"embed/etcd.go:275","msg":"now serving peer/client/metrics","local-member-id":"8e9e05c52164694d","initial-advertise-peer-urls":["http://localhost:2380"],"listen-peer-urls":["http://localhost:2380"],"advertise-client-urls":["http://127.0.0.1:2379"],"listen-client-urls":["http://0.0.0.0:2379"],"listen-metrics-urls":[]} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.122Z","caller":"embed/etcd.go:584","msg":"serving peer traffic","address":"127.0.0.1:2380"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.122Z","caller":"embed/etcd.go:556","msg":"cmux::serve","address":"127.0.0.1:2380"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.219Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d is starting a new election at term 1"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.219Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d became pre-candidate at term 1"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.219Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d received MsgPreVoteResp from 8e9e05c52164694d at term 1"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.219Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d became candidate at term 2"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.219Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d received MsgVoteResp from 8e9e05c52164694d at term 2"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.219Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"8e9e05c52164694d became leader at term 2"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.219Z","logger":"raft","caller":"etcdserver/zap_raft.go:77","msg":"raft.node: 8e9e05c52164694d elected leader 8e9e05c52164694d at term 2"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.219Z","caller":"etcdserver/server.go:2563","msg":"setting up initial cluster version using v2 API","cluster-version":"3.5"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.219Z","caller":"membership/cluster.go:584","msg":"set initial cluster version","cluster-id":"cdf818194e3a8c32","local-member-id":"8e9e05c52164694d","cluster-version":"3.5"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.220Z","caller":"api/capability.go:75","msg":"enabled capabilities for version","cluster-version":"3.5"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.220Z","caller":"embed/serve.go:100","msg":"ready to serve client requests"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.220Z","caller":"etcdserver/server.go:2587","msg":"cluster version is updated","cluster-version":"3.5"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.220Z","caller":"etcdserver/server.go:2054","msg":"published local member to cluster through raft","local-member-id":"8e9e05c52164694d","local-member-attributes":"{Name:default ClientURLs:[http://127.0.0.1:2379]}","request-path":"/0/members/8e9e05c52164694d/attributes","cluster-id":"cdf818194e3a8c32","publish-timeout":"7s"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.220Z","caller":"etcdmain/main.go:44","msg":"notifying init daemon"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.220Z","caller":"etcdmain/main.go:50","msg":"successfully notified init daemon"} 2024-09-04 22:39:29 {"level":"info","ts":"2024-09-05T02:39:29.222Z","caller":"embed/serve.go:146","msg":"serving client traffic insecurely; this is strongly discouraged!","address":"[::]:2379"}

Anything else?

No response

yanliang567 commented 1 month ago

@Karlusk could you please share more details about how you reinstall milvus? did you change any configurations for milvus deployment? also please attach the completed milvus log files for investigation. /assign @Karlusk /unassign

Karlusk commented 1 month ago

I didnt reinstall milvus. I was reinstalling a different WSL that had nothing to do with docker where milvus is. I didnt want to touch my milvus.

The log you see is what I have of the event.

Karlusk commented 1 month ago

/assign @yanliang567

xiaofan-luan commented 1 month ago

I'm not very familiar with WSL but personally the guess is etcd brought up with wrong configuration of the disk.

if you are running on standalone mode, I guess the data disk milvus used is /var/lib/milvus and etcd use a seperate dir default.etcd I guess (unless you modify it)

you will need to find the right etcd data dir

Karlusk commented 1 month ago

It's fine, the data is gone. I didnt modify anything with milvus or docker, I was installing a completely different WSL.

Milvus somehow thought that it needed to "initial-cluster-state":"new" kvstore was "restored" with a "current-rev":1 and then proceeded to compact the database back to 0 size.

There is no other directory, it chewed up the one I was using.

I just dont know what caused this behavior. I guess I need to just focus on a better backup strategy.

Thanks

/assign @yanliang567

yanliang567 commented 1 month ago

@Karlusk maybe you could try the backup & restore tool: https://github.com/zilliztech/milvus-backup

Karlusk commented 1 month ago

I will try it out. Thanks for your hard work! I really do like Milvus