Closed furiousteabag closed 1 year ago
@xiaofan-luan saw your answers on similar issue, might be related
/assign @aoiasd is there a way to enable zstd on old milvus version?
@SmirnovAlexander can you explain a little bit about your scenario
@SmirnovAlexander Meantime, did you have the chance to check your disk utilization? Seems that IO is very slow and each mq operation takes more than 100ms
seems you have many collections and partitions, that might be the major issue causing rocksmq slow. But why this is not happened on 2.2.6 remained to be investigate
@SmirnovAlexander Meantime, did you have the chance to check your disk utilization? Seems that IO is very slow and each mq operation takes more than 100ms
iotop
Total DISK READ: 7.15 K/s | Total DISK WRITE: 526.43 K/s [0/0]
Current DISK READ: 7.15 K/s | Current DISK WRITE: 237.47 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
245666 be/4 root 4.29 K/s 4.29 K/s 0.00 % 1.61 % minio server /minio_data
190 be/3 root 0.00 B/s 22.89 K/s 0.00 % 0.89 % [jbd2/xvda1-8]
247076 be/4 root 1464.86 B/s 1464.86 B/s 0.00 % 0.60 % minio server /minio_data
245663 be/4 root 1464.86 B/s 1464.86 B/s 0.00 % 0.55 % minio server /minio_data
246515 be/4 root 0.00 B/s 4.29 K/s 0.00 % 0.52 % milvus run standalone
246399 be/4 root 0.00 B/s 1464.86 B/s 0.00 % 0.14 % milvus run standalone
245614 be/4 root 0.00 B/s 1464.86 B/s 0.00 % 0.09 % etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
245809 be/4 root 0.00 B/s 7.15 K/s 0.00 % 0.07 % etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
246439 be/4 root 0.00 B/s 5.72 K/s 0.00 % 0.02 % milvus run standalone
258966 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.02 % [kworker/u30:0-flush-202:0]
246355 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.02 % milvus run standalone
246388 be/4 root 0.00 B/s 4.29 K/s 0.00 % 0.00 % milvus run standalone
639 be/4 root 0.00 B/s 55.79 K/s 0.00 % 0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
654 be/4 root 0.00 B/s 61.51 K/s 0.00 % 0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
1502 be/4 root 0.00 B/s 30.04 K/s 0.00 % 0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
245839 be/4 root 0.00 B/s 4.29 K/s 0.00 % 0.00 % milvus run standalone
iostat
avg-cpu: %user %nice %system %iowait %steal %idle
15.46 0.00 2.15 0.20 53.94 28.25
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 7.92 182.18 35.64 81.82 3.00 23.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 5.54
- how many collections, partitions you have?
- what kind of disk you are running? is it SSD or normal disks?
- Any possibility you can perf and know which function takes the cpu. We are hard to reproduce this on our environment, the only thing we find is that rocksdb may takes the cpu to do compaction
sudo perf record -p 245817 -g
where 245817 is the id of the process milvus run standalone
. Then I ran sudo perf report
:
Samples: 153K of event 'cpu-clock:pppH', Event count (approx.): 38449750000
Children Self Command Shared Object Symbol
+ 96.09% 0.00% milvus milvus [.] runtime.goexit.abi0 ◆
+ 44.02% 0.00% milvus milvus [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/client.(*consumer).Chan.func1.1 ▒
+ 44.01% 0.07% milvus milvus [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/client.(*client).consume ▒
+ 42.42% 0.35% milvus milvus [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/client.(*client).deliver ▒
+ 41.26% 0.44% milvus milvus [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/server.(*rocksmq).Consume ▒
+ 23.68% 0.00% milvus milvus [.] github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start.func1 ▒
+ 23.62% 0.48% milvus milvus [.] github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work ▒
+ 22.54% 0.15% milvus milvus [.] runtime.asmcgocall.abi0 ▒
+ 19.17% 8.92% milvus [vdso] [.] __vdso_clock_gettime ▒
+ 16.86% 0.05% milvus milvus [.] github.com/tecbot/gorocksdb.(*Iterator).Seek ▒
+ 16.81% 0.05% milvus milvus [.] github.com/tecbot/gorocksdb.(*Iterator).Seek.func1 ▒
+ 16.66% 0.04% milvus milvus [.] github.com/tecbot/gorocksdb._Cfunc_rocksdb_iter_seek.abi0 ▒
+ 16.25% 0.02% milvus milvus [.] rocksdb_iter_seek ▒
+ 16.16% 0.29% milvus milvus [.] rocksdb::DBIter::Seek ▒
+ 15.27% 0.31% milvus milvus [.] rocksdb::MergingIterator::Seek ▒
+ 14.10% 0.00% milvus [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe ▒
+ 14.02% 0.12% milvus [kernel.kallsyms] [k] do_syscall_64 ▒
+ 11.76% 0.05% milvus milvus [.] github.com/milvus-io/milvus/internal/mq/msgstream/mqwrapper/rmq.(*rmqProducer).Send ▒
+ 11.07% 0.26% milvus milvus [.] rocksdb::BlockBasedTableIterator::SeekImpl ▒
+ 10.33% 0.09% milvus milvus [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/client.(*producer).Send ▒
+ 10.27% 0.11% milvus milvus [.] runtime.nanotime1.abi0 ▒
+ 10.15% 0.20% milvus milvus [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/server.(*rocksmq).Produce ▒
+ 9.64% 0.03% milvus milvus [.] github.com/milvus-io/milvus/internal/util/retry.Do ▒
+ 9.35% 0.05% milvus milvus [.] github.com/milvus-io/milvus/internal/datanode.(*ddNode).Operate ▒
+ 9.09% 0.01% milvus milvus [.] github.com/milvus-io/milvus/internal/datanode.(*ddNode).Operate.func2 ▒
+ 9.08% 0.02% milvus milvus [.] github.com/milvus-io/milvus/internal/datanode.(*ddNode).forwardDeleteMsg
Everything is measured while not quering Milvus and with high CPU usage like this:
[0/1]
0[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 72.5%] 4[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 68.8%]
1[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 68.0%] 5[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 85.6%]
2[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 68.1%] 6[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 71.4%]
3[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 69.2%] 7[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 62.6%]
Mem[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| 10.2G/31.4G] Tasks: 120, 1091 thr; 8 running
Swp[ 0K/0K] Load average: 22.70 17.65 15.84
Uptime: 18:34:00
PID USER PRI NI VIRT RES SHR S CPU%▽MEM% TIME+ Command
245817 root 20 0 15.2G 5580M 108M S 348. 17.4 7h07:36 milvus run standalone
seems to be issue related to rocks mq. might be related to rmq client
We've seen very long time to get the time tick channel lock, which is a little weird.
If some other logs/measurements could help to debug let me know!
If some other logs/measurements could help to debug let me know!
It seems under your environment the cpu steal is very high? Does that normal?
and + 19.17% 8.92% milvus [vdso] [.] __vdso_clock_gettime the vdso_clock seems to take much more cpu than expected
It seems under your environment the cpu steal is very high? Does that normal?
When I stop standalone
cpu steal becomes nearly zero:
avg-cpu: %user %nice %system %iowait %steal %idle
1.00 0.00 0.50 0.00 0.12 98.38
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util
xvda 0.00 0.00 0.00 0.00 0.00 0.00 38.00 184.00 4.00 9.52 0.79 4.84 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.80
It seems under your environment the cpu steal is very high? Does that normal?
When I stop
standalone
cpu steal becomes nearly zero:avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.50 0.00 0.12 98.38 Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util xvda 0.00 0.00 0.00 0.00 0.00 0.00 38.00 184.00 4.00 9.52 0.79 4.84 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 0.80
Could you check how many goroutine and thread are there when you start.
avg-cpu: %user %nice %system %iowait %steal %idle 15.46 0.00 2.15 0.20 53.94 28.25
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util xvda 0.00 0.00 0.00 0.00 0.00 0.00 7.92 182.18 35.64 81.82 3.00 23.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 5.54
This is what I get from GPT4: The output you provided appears to come from iostat, a command line tool used in Unix-based systems to monitor system input/output device loading by observing the time the devices are active in relation to their average transfer rates.
The high %steal value in the avg-cpu section is noteworthy. In a virtualized computing environment, "steal time" refers to the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor.
This high value (53.94% in your case) indicates that your VM (Virtual Machine) is waiting for the physical CPU because it's being used significantly by other VMs on the host machine. This could lead to sluggish or uneven performance on your VM.
Is it a possibility to migrate your milvus docker to an another physical node and give a try?
From the code and the very short log you offers, the only uncommon behaiour is milvus spent a lot of time on getting locks
milvus-standalone | [2023/06/11 10:27:32.585 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3090] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=100] ["produce total elapse"=3190] milvus-standalone | [2023/06/11 10:27:32.585 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3190] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3190] milvus-standalone | [2023/06/11 10:27:32.585 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3190] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3190] milvus-standalone | [2023/06/11 10:27:32.585 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.585 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.699 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3300] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=1] ["produce total elapse"=3301] milvus-standalone | [2023/06/11 10:27:32.828 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3388] ["alloc elapse"=0] ["write elapse"=18] ["updatePage elapse"=24] ["produce total elapse"=3430]
but the tricky part is all IO stage "write elapse", "updatePage elapse" is not slow.
code is at rmq_impl Produce
Another problem is vdso_clock_gettime takes too much cpus. which is weird. And from GPT4, this might also be a vm issue: "The vdso_clock_gettime is a function provided by the Linux kernel that applications can use to get the current time, without the cost of a system call. It is faster because it doesn't require the context switch that a traditional system call does.
In normal conditions, this function shouldn't take a significant amount of CPU time. If it does, it may be a sign that either an application is calling this function excessively, or there may be some other underlying issue. Here are a few things you could consider:
Excessive calls: An application might be calling __vdso_clock_gettime excessively. This might occur in applications that do a lot of logging with timestamps or that calculate time differences frequently. You might want to look at your applications and see if they're making an excessive number of calls to get the current time. Profiling the applications may help in identifying this.
High Precision Requirements: Some applications might require high precision timers which can lead to increased CPU utilization. If high precision is not required, you might want to modify the application to reduce the precision requirement.
CPU contention: If your application is running in a virtualized environment, CPU contention with other VMs may cause the high CPU usage. You can discuss with your provider or consider migrating to a different VM or even a dedicated server.
Hardware or VM issues: If none of the above points help, it might be worth checking for any hardware issues or issues with the VM (if you are in a virtualized environment), though this is less likely.
Remember, high CPU usage is not always a problem. It might just mean that the CPU is being utilized efficiently. However, if the system is lagging, you notice significant slowdowns, or other tasks are being starved of CPU time, then it's worth investigating. "
there is something we could improve, which is remove data timetick channel, because I've seen many duplicated report on timetick channel, but seems to be not a root cause issue for me.
/assign @yanliang567
please also try to reproduce the scenario in house: 1) with 256 shards 2) with 100+ collections writing some data. 3) restart the standalone server
Keep the loop see we can hit cpu usage went high
Could you check how many goroutine and thread are there when you start.
When I just started standalone
:
$ ps -eLf | grep "milvus run standalone" | wc -l
24
Then it steadily grows to 339
.
When I have connected via pymilvus:
$ ps -eLf | grep "milvus run standalone" | wc -l
445
I hadn't found info about how I can check the amount of goroutines of a running process.
Is it a possibility to migrate your milvus docker to an another physical node and give a try?
It would be kind of hard and I don't think it is an issue, because it is a clean AWS VM with almost nothing onboard except Docker. Also previously everything was working fine on it.
I also rebooted but it hadn't helped.
From the code and the very short log you offers, the only uncommon behaiour is milvus spent a lot of time on getting locks
In the initial message, I had also attached 40k rows log file. I can log more if needed!
400 seems to be a ok number. Goroutine can be shown on pprof, but it should not be a key issue.
Excessive calls: An application might be calling __vdso_clock_gettime excessively. This might occur in applications that do a lot of logging with timestamps or that calculate time differences frequently. You might want to look at your applications and see if they're making an excessive number of calls to get the current time. Profiling the applications may help in identifying this.
Hell yeah, milvus has a ton of logs. Might it be an issue?
seems to be similar as #23309
this is how you can enable pprof. https://github.com/milvus-io/milvus/issues/13978
With pprof you could see more details about: 1) where rocksmq is blocked at 2) who is holding the topic lock and why 3) which part is consuming most of the cpus on go side
$ wget "http://localhost:9091/debug/pprof/goroutine?debug=2" -O pprof.txt
: pprof.txt
$ go tool pprof http://localhost:9091/debug/pprof/heap
: pprof.milvus.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
$ wget -O trace.txt "http://localhost:9091/debug/pprof/trace?seconds=5"
: trace.txt
$ go tool pprof "http://localhost:9091/debug/pprof/profile?seconds=30"
: pprof.milvus.samples.cpu.001.pb.gz
$ go tool pprof -http=localhost:9999 pprof.milvus.samples.cpu.001.pb.gz
:
Should I measure something else?
try to change proxy. timeTickInterval from 200 to 1000 see if it works.
Some obervation:
and rocksdb takes rest of the cpus.
try to increase rocksmq: lrucacheratio: 0.06 # rocksdb cache memory ratio see if it helps?
try to change proxy. timeTickInterval from 200 to 1000 see if it works.
I have changed and now even a single collection is unable to load. It just hangs forever on .load()
.
Here is the log: log.txt
Upd: I have loaded collections after deleting /var/lib/milvus/rdb_data
and /var/lib/milvus/rdb_data_meta_kv
and restarting Milvus.
Upd2: Seems to work well so far! Will observe it further for a couple of days and get back. Also, there is way less logs from standalone
now.
Thanks, the pprof gives us a lot of information. And good to know your cluster recovered. We will be working on figure the reason of cpu usage. Seems that the rocksmq consumption is slow, many goroutine is waiting on data consumption
/assign @ThreadDao any chance we could reproduce it in house?
I think the major reason is the lock of rocksmq hold topic lock. And due to this cluster has too many collections, when cluster start from cold, all of the consumer try to pull data from rocksmq and causing the system hard to recover and stayed in a loop to catch up new data.
I believe we have reproduce it in house, @aoiasd is working on it
I believe we have reproduce it in house, @aoiasd is working on it
Good to know!
proxy. timeTickInterval from 200 to 1000 see if it works.
Still works great, might consider this as a solution
Seems to work well so far! Will observe it further for a couple of days and get back. Also, there is way less logs from
standalone
now
Hi, can Ichange this timeTickInterval parameter without deleting the data? @SmirnovAlexander @xiaofan-luan
you can do so. this won't change any data. Meanwhile you can wait for next milvus release and we will fix the issue
We have refine idle scene at master(remove time tick channel) and 2.3. For version 2.2, i will optimize rocksmq and try to do something helpful. This question caused by that rocksmq cpu usage will increase with data, and idle milvus will send time tick massage by rocksmq, and cpu usage will increase. So reduce interval of time tick massage or reduce interval and size of rocksmq retention could help reduce cpu usage.
We have refine idle scene at master(remove time tick channel) and 2.3. For version 2.2, i will optimize rocksmq and try to do something helpful. This question caused by that rocksmq cpu usage will increase with data, and idle milvus will send time tick massage by rocksmq, and cpu usage will increase. So reduce interval of time tick massage or reduce interval and size of rocksmq retention could help reduce cpu usage.
Hi, does this mean that this problem will be improved when I upgrade to 2.2.12?
The situation has gotten worse in recent days as the number of data and collections has grown @aoiasd @yanliang567
The situation has gotten worse in recent days as the number of data and collections has grown @aoiasd @yanliang567
how many collections do you have?
collections numbers do significantly affect cpu usage, especially on standalones from our test. We recommend to use < 1000 collections. Also if we know more about your use cases maybe we can give more advices
The situation has gotten worse in recent days as the number of data and collections has grown @aoiasd @yanliang567
how many collections do you have?
collections numbers do significantly affect cpu usage, especially on standalones from our test. We recommend to use < 1000 collections. Also if we know more about your use cases maybe we can give more advices
I have about 500 collections, the amount of data in each collection varies, with a maximum of 5000 entities per collection.
I encountered the same problem when I have hundreds of collections
2.2.12 may improved . By https://github.com/milvus-io/milvus/pull/24011.
2.2.12 may improved . By #24011.
@xiaofan-luan @aoiasd
When I upgraded from 2.2.11 to 2.2.12, my mivlus standalone can't even start properly, docker keeps scrolling the following logs:
[2023/08/12 17:48:33.950 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.950 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.950 +00:00] [WARN] [server/rocksmq_impl.go:768] ["rocksmq consume too slowly"] [topic=by-dev-rootcoord-dml_0] ["get lock elapse"=232] ["iterator elapse"=0] ["moveConsumePosTime elapse"=0] ["total consume elapse"=232]
[2023/08/12 17:48:33.951 +00:00] [WARN] [server/rocksmq_impl.go:768] ["rocksmq consume too slowly"] [topic=by-dev-rootcoord-dml_0] ["get lock elapse"=233] ["iterator elapse"=0] ["moveConsumePosTime elapse"=0] ["total consume elapse"=233]
[2023/08/12 17:48:33.952 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.952 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.952 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.952 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.952 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.955 +00:00] [WARN] [server/rocksmq_impl.go:768] ["rocksmq consume too slowly"] [topic=by-dev-rootcoord-dml_0] ["get lock elapse"=237] ["iterator elapse"=0] ["moveConsumePosTime elapse"=0] ["total consume elapse"=237]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
And my cpu usage didn't improve:
What should I do now
Is there an existing issue for this?
Environment
Current Behavior
Hi Milvus team!
After I upgraded version of Milvus from
v2.2.6
tov.2.2.9
I noticed significant performance issues. Search through collections became ~10-15 times slower. I noticed that almost in every moment Milvus heavily utilizing CPU (I have 8 core machine, andmilvus run standalone
takes 400%-500% of it) even there are no requests (no insertion, no search, etc). Sometimes it stops utilizing CPU that much and in these moments performance becomes as fast as it was previously, but this happens rarely (1 minute good performance for each 10 minutes bad performance because of loaded CPU).I started digging and found a lot of these lines in Milvus logs:
So I've tried changing the MQ to Pulsar. I commented out rocksmq-related rows in
milvus.yaml
and added pulsar todocker-compose.yml
:I had no success because even in Milvus logs everything was fine, when I tried to connect via python SDK it was told that Milvus is not ready yet.
Seeking your help! Maybe changing broker will help, but it is better to understand what is wrong with the default rocksmq.
p.s. I can't rollback because when I try I get an error on Milvus startup:
My
milvus.yml
is almost same to one provided by you milvus-io/milvus/v2.2.9/configs/milvus.yaml with only difference inauthorizationEnabled
option.Here is my
docker-compose.yml
:Expected Behavior
I expect low CPU utilization on idle and fast responses.
Steps To Reproduce
No response
Milvus Log
log.txt
Anything else?
This one is similar https://github.com/milvus-io/milvus/issues/22571