milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.05k stars 2.95k forks source link

[Bug]: High CPU usage on idle #24812

Closed furiousteabag closed 1 year ago

furiousteabag commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: v2.2.9
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka): rocksmq
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus v2.2.9
- OS(Ubuntu or CentOS): Debian
- CPU/Memory: Intel Xeon E5-2686 v4 (8) @ 2.299GHz / 32GB
- GPU: N/A

Current Behavior

Hi Milvus team!

After I upgraded version of Milvus from v2.2.6 to v.2.2.9 I noticed significant performance issues. Search through collections became ~10-15 times slower. I noticed that almost in every moment Milvus heavily utilizing CPU (I have 8 core machine, and milvus run standalone takes 400%-500% of it) even there are no requests (no insertion, no search, etc). Sometimes it stops utilizing CPU that much and in these moments performance becomes as fast as it was previously, but this happens rarely (1 minute good performance for each 10 minutes bad performance because of loaded CPU).

I started digging and found a lot of these lines in Milvus logs:

milvus-standalone  | [2023/06/11 10:12:20.758 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=1190] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=1190]
milvus-standalone  | [2023/06/11 10:12:20.775 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=1238] ["alloc elapse"=0] ["write elapse"=17] ["updatePage elapse"=0] ["produce total elapse"=1255]
milvus-standalone  | [2023/06/11 10:12:20.776 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=1255] ["alloc elapse"=0] ["write elapse"=1] ["updatePage elapse"=0] ["produce total elapse"=1256]
milvus-standalone  | [2023/06/11 10:12:20.776 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=1254] ["alloc elapse"=0] ["write elapse"=1] ["updatePage elapse"=0] ["produce total elapse"=1255]

So I've tried changing the MQ to Pulsar. I commented out rocksmq-related rows in milvus.yaml and added pulsar to docker-compose.yml:

  pulsar:
    container_name: milvus-pulsar
    image: apachepulsar/pulsar:2.8.2
    command: bin/pulsar standalone --no-functions-worker --no-stream-storage
    ports:
      - "6650:6650"
      - "8080:8080"

I had no success because even in Milvus logs everything was fine, when I tried to connect via python SDK it was told that Milvus is not ready yet.

Seeking your help! Maybe changing broker will help, but it is better to understand what is wrong with the default rocksmq.

p.s. I can't rollback because when I try I get an error on Milvus startup:

milvus-standalone | [2023/06/10 17:21:09.906 +00:00] [INFO] [server/rocksmq_impl.go:154] ["Start rocksmq "] ["max proc"=8] [parallism=1] ["lru cache"=2020363714]
milvus-standalone | panic: Not implemented: Unsupported compression method for this build: ZSTD
milvus-standalone |
milvus-standalone | goroutine 1 [running]:
milvus-standalone | [github.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).Run](http://github.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).Run)(0xc0006efe58, 0x1, {0x0, 0x0})
milvus-standalone | /go/src/github.com/milvus-io/milvus/cmd/roles/roles.go:214 +0xd95
milvus-standalone | [github.com/milvus-io/milvus/cmd/milvus.(*run).execute](http://github.com/milvus-io/milvus/cmd/milvus.(*run).execute)(0xc0005bf9e0, {0xc000124060?, 0x3, 0x3}, 0xc00013e7e0)
milvus-standalone | /go/src/github.com/milvus-io/milvus/cmd/milvus/run.go:112 +0x66e
milvus-standalone | [github.com/milvus-io/milvus/cmd/milvus.RunMilvus](http://github.com/milvus-io/milvus/cmd/milvus.RunMilvus)({0xc000124060?, 0x3, 0x3})
milvus-standalone | /go/src/github.com/milvus-io/milvus/cmd/milvus/milvus.go:60 +0x21e
milvus-standalone | main.main()
milvus-standalone | /go/src/github.com/milvus-io/milvus/cmd/main.go:26 +0x2e

My milvus.yml is almost same to one provided by you milvus-io/milvus/v2.2.9/configs/milvus.yaml with only difference in authorizationEnabled option.

Here is my docker-compose.yml:

  etcd:
    container_name: milvus-etcd
    image: quay.io/coreos/etcd:v3.5.5
    environment:
      - ETCD_AUTO_COMPACTION_MODE=revision
      - ETCD_AUTO_COMPACTION_RETENTION=1000
      - ETCD_QUOTA_BACKEND_BYTES=4294967296
      - ETCD_SNAPSHOT_COUNT=50000
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/etcd:/etcd
      - /etc/milvus.yaml:/milvus/configs/milvus.yaml
    command: etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
    restart: always

  minio:
    container_name: milvus-minio
    image: minio/minio:RELEASE.2023-03-20T20-16-18Z
    environment:
      MINIO_ACCESS_KEY: minioadmin
      MINIO_SECRET_KEY: minioadmin
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/minio:/minio_data
      - /etc/milvus.yaml:/milvus/configs/milvus.yaml
    command: minio server /minio_data
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 30s
      timeout: 20s
      retries: 3
    restart: always

  standalone:
    container_name: milvus-standalone
    image: milvusdb/milvus:v2.2.9
    command: ["milvus", "run", "standalone"]
    environment:
      ETCD_ENDPOINTS: etcd:2379
      MINIO_ADDRESS: minio:9000
    volumes:
      - ${DOCKER_VOLUME_DIRECTORY:-.}/volumes/milvus:/var/lib/milvus
      - /etc/milvus.yaml:/milvus/configs/milvus.yaml
    ports:
      - "19530:19530"
      - "9091:9091"
    depends_on:
      - "etcd"
      - "minio"
    logging:
      options:
        max-size: "100m"
    restart: always

  attu:
    container_name: milvus-attu
    image: zilliz/attu:latest
    depends_on:
      - "standalone"
    environment:
      MILVUS_URL: standalone:19530
    ports:
      - "8082:3000"
    restart: always

networks:
  default:
    name: milvus

Expected Behavior

I expect low CPU utilization on idle and fast responses.

Steps To Reproduce

No response

Milvus Log

log.txt

Anything else?

This one is similar https://github.com/milvus-io/milvus/issues/22571

furiousteabag commented 1 year ago

@xiaofan-luan saw your answers on similar issue, might be related

xiaofan-luan commented 1 year ago

/assign @aoiasd is there a way to enable zstd on old milvus version?

xiaofan-luan commented 1 year ago

@SmirnovAlexander can you explain a little bit about your scenario

  1. how many collections, partitions you have?
  2. what kind of disk you are running? is it SSD or normal disks?
  3. Any possibility you can perf and know which function takes the cpu. We are hard to reproduce this on our environment, the only thing we find is that rocksdb may takes the cpu to do compaction
xiaofan-luan commented 1 year ago

@SmirnovAlexander Meantime, did you have the chance to check your disk utilization? Seems that IO is very slow and each mq operation takes more than 100ms

xiaofan-luan commented 1 year ago

seems you have many collections and partitions, that might be the major issue causing rocksmq slow. But why this is not happened on 2.2.6 remained to be investigate

furiousteabag commented 1 year ago

@SmirnovAlexander Meantime, did you have the chance to check your disk utilization? Seems that IO is very slow and each mq operation takes more than 100ms

iotop

Total DISK READ:         7.15 K/s | Total DISK WRITE:       526.43 K/s                                                                                                                                                                    [0/0]
Current DISK READ:       7.15 K/s | Current DISK WRITE:     237.47 K/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
 245666 be/4 root        4.29 K/s    4.29 K/s  0.00 %  1.61 % minio server /minio_data
    190 be/3 root        0.00 B/s   22.89 K/s  0.00 %  0.89 % [jbd2/xvda1-8]
 247076 be/4 root     1464.86 B/s 1464.86 B/s  0.00 %  0.60 % minio server /minio_data
 245663 be/4 root     1464.86 B/s 1464.86 B/s  0.00 %  0.55 % minio server /minio_data
 246515 be/4 root        0.00 B/s    4.29 K/s  0.00 %  0.52 % milvus run standalone
 246399 be/4 root        0.00 B/s 1464.86 B/s  0.00 %  0.14 % milvus run standalone
 245614 be/4 root        0.00 B/s 1464.86 B/s  0.00 %  0.09 % etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
 245809 be/4 root        0.00 B/s    7.15 K/s  0.00 %  0.07 % etcd -advertise-client-urls=http://127.0.0.1:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
 246439 be/4 root        0.00 B/s    5.72 K/s  0.00 %  0.02 % milvus run standalone
 258966 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.02 % [kworker/u30:0-flush-202:0]
 246355 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.02 % milvus run standalone
 246388 be/4 root        0.00 B/s    4.29 K/s  0.00 %  0.00 % milvus run standalone
    639 be/4 root        0.00 B/s   55.79 K/s  0.00 %  0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
    654 be/4 root        0.00 B/s   61.51 K/s  0.00 %  0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
   1502 be/4 root        0.00 B/s   30.04 K/s  0.00 %  0.00 % dockerd -H fd:// --containerd=/run/containerd/containerd.sock
 245839 be/4 root        0.00 B/s    4.29 K/s  0.00 %  0.00 % milvus run standalone

iostat

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          15.46    0.00    2.15    0.20   53.94   28.25

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
xvda             0.00      0.00     0.00   0.00    0.00     0.00    7.92    182.18    35.64  81.82    3.00    23.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.02   5.54
furiousteabag commented 1 year ago
  1. how many collections, partitions you have?
  2. what kind of disk you are running? is it SSD or normal disks?
  1. I have 220 collections and 1 default partition for each collection
  2. I am running on 1TB SSD
furiousteabag commented 1 year ago
  1. Any possibility you can perf and know which function takes the cpu. We are hard to reproduce this on our environment, the only thing we find is that rocksdb may takes the cpu to do compaction
  1. I ran sudo perf record -p 245817 -g where 245817 is the id of the process milvus run standalone. Then I ran sudo perf report:
    Samples: 153K of event 'cpu-clock:pppH', Event count (approx.): 38449750000
    Children      Self  Command       Shared Object        Symbol
    +   96.09%     0.00%  milvus        milvus               [.] runtime.goexit.abi0                                                                                                                                                              ◆
    +   44.02%     0.00%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/client.(*consumer).Chan.func1.1                                                                                           ▒
    +   44.01%     0.07%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/client.(*client).consume                                                                                                  ▒
    +   42.42%     0.35%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/client.(*client).deliver                                                                                                  ▒
    +   41.26%     0.44%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/server.(*rocksmq).Consume                                                                                                 ▒
    +   23.68%     0.00%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).Start.func1                                                                                                       ▒
    +   23.62%     0.48%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/util/flowgraph.(*nodeCtx).work                                                                                                              ▒
    +   22.54%     0.15%  milvus        milvus               [.] runtime.asmcgocall.abi0                                                                                                                                                          ▒
    +   19.17%     8.92%  milvus        [vdso]               [.] __vdso_clock_gettime                                                                                                                                                             ▒
    +   16.86%     0.05%  milvus        milvus               [.] github.com/tecbot/gorocksdb.(*Iterator).Seek                                                                                                                                     ▒
    +   16.81%     0.05%  milvus        milvus               [.] github.com/tecbot/gorocksdb.(*Iterator).Seek.func1                                                                                                                               ▒
    +   16.66%     0.04%  milvus        milvus               [.] github.com/tecbot/gorocksdb._Cfunc_rocksdb_iter_seek.abi0                                                                                                                        ▒
    +   16.25%     0.02%  milvus        milvus               [.] rocksdb_iter_seek                                                                                                                                                                ▒
    +   16.16%     0.29%  milvus        milvus               [.] rocksdb::DBIter::Seek                                                                                                                                                            ▒
    +   15.27%     0.31%  milvus        milvus               [.] rocksdb::MergingIterator::Seek                                                                                                                                                   ▒
    +   14.10%     0.00%  milvus        [kernel.kallsyms]    [k] entry_SYSCALL_64_after_hwframe                                                                                                                                                   ▒
    +   14.02%     0.12%  milvus        [kernel.kallsyms]    [k] do_syscall_64                                                                                                                                                                    ▒
    +   11.76%     0.05%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/mq/msgstream/mqwrapper/rmq.(*rmqProducer).Send                                                                                              ▒
    +   11.07%     0.26%  milvus        milvus               [.] rocksdb::BlockBasedTableIterator::SeekImpl                                                                                                                                       ▒
    +   10.33%     0.09%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/client.(*producer).Send                                                                                                   ▒
    +   10.27%     0.11%  milvus        milvus               [.] runtime.nanotime1.abi0                                                                                                                                                           ▒
    +   10.15%     0.20%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/mq/mqimpl/rocksmq/server.(*rocksmq).Produce                                                                                                 ▒
    +    9.64%     0.03%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/util/retry.Do                                                                                                                               ▒
    +    9.35%     0.05%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/datanode.(*ddNode).Operate                                                                                                                  ▒
    +    9.09%     0.01%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/datanode.(*ddNode).Operate.func2                                                                                                            ▒
    +    9.08%     0.02%  milvus        milvus               [.] github.com/milvus-io/milvus/internal/datanode.(*ddNode).forwardDeleteMsg
furiousteabag commented 1 year ago

Everything is measured while not quering Milvus and with high CPU usage like this:

                                                                                                                                                                                                                                          [0/1]
    0[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                       72.5%]     4[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                           68.8%]
    1[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                            68.0%]     5[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||         85.6%]
    2[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                             68.1%]     6[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                         71.4%]
    3[|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                           69.2%]     7[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                                  62.6%]
  Mem[||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                                    10.2G/31.4G]   Tasks: 120, 1091 thr; 8 running
  Swp[                                                                                                        0K/0K]   Load average: 22.70 17.65 15.84
                                                                                                                       Uptime: 18:34:00

    PID USER      PRI  NI  VIRT   RES   SHR S CPU%▽MEM%   TIME+  Command
 245817 root       20   0 15.2G 5580M  108M S 348. 17.4  7h07:36 milvus run standalone
xiaofan-luan commented 1 year ago

seems to be issue related to rocks mq. might be related to rmq client

xiaofan-luan commented 1 year ago

We've seen very long time to get the time tick channel lock, which is a little weird.

furiousteabag commented 1 year ago

If some other logs/measurements could help to debug let me know!

xiaofan-luan commented 1 year ago

If some other logs/measurements could help to debug let me know!

It seems under your environment the cpu steal is very high? Does that normal?

and + 19.17% 8.92% milvus [vdso] [.] __vdso_clock_gettime the vdso_clock seems to take much more cpu than expected

furiousteabag commented 1 year ago

It seems under your environment the cpu steal is very high? Does that normal?

When I stop standalone cpu steal becomes nearly zero:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    0.50    0.00    0.12   98.38

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
xvda             0.00      0.00     0.00   0.00    0.00     0.00   38.00    184.00     4.00   9.52    0.79     4.84    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.03   0.80
xiaofan-luan commented 1 year ago

It seems under your environment the cpu steal is very high? Does that normal?

When I stop standalone cpu steal becomes nearly zero:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.00    0.00    0.50    0.00    0.12   98.38

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
xvda             0.00      0.00     0.00   0.00    0.00     0.00   38.00    184.00     4.00   9.52    0.79     4.84    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.03   0.80

Could you check how many goroutine and thread are there when you start.

avg-cpu: %user %nice %system %iowait %steal %idle 15.46 0.00 2.15 0.20 53.94 28.25

Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util xvda 0.00 0.00 0.00 0.00 0.00 0.00 7.92 182.18 35.64 81.82 3.00 23.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 5.54

This is what I get from GPT4: The output you provided appears to come from iostat, a command line tool used in Unix-based systems to monitor system input/output device loading by observing the time the devices are active in relation to their average transfer rates.

The high %steal value in the avg-cpu section is noteworthy. In a virtualized computing environment, "steal time" refers to the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor.

This high value (53.94% in your case) indicates that your VM (Virtual Machine) is waiting for the physical CPU because it's being used significantly by other VMs on the host machine. This could lead to sluggish or uneven performance on your VM.

xiaofan-luan commented 1 year ago

Is it a possibility to migrate your milvus docker to an another physical node and give a try?

From the code and the very short log you offers, the only uncommon behaiour is milvus spent a lot of time on getting locks

xiaofan-luan commented 1 year ago

milvus-standalone | [2023/06/11 10:27:32.585 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3090] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=100] ["produce total elapse"=3190] milvus-standalone | [2023/06/11 10:27:32.585 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3190] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3190] milvus-standalone | [2023/06/11 10:27:32.585 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3190] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3190] milvus-standalone | [2023/06/11 10:27:32.585 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.585 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3189] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3189] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.586 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3188] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=3188] milvus-standalone | [2023/06/11 10:27:32.699 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3300] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=1] ["produce total elapse"=3301] milvus-standalone | [2023/06/11 10:27:32.828 +00:00] [WARN] [server/rocksmq_impl.go:638] ["rocksmq produce too slowly"] [topic=by-dev-datacoord-timetick-channel] ["get lock elapse"=3388] ["alloc elapse"=0] ["write elapse"=18] ["updatePage elapse"=24] ["produce total elapse"=3430]

xiaofan-luan commented 1 year ago

but the tricky part is all IO stage "write elapse", "updatePage elapse" is not slow.

xiaofan-luan commented 1 year ago

code is at rmq_impl Produce

xiaofan-luan commented 1 year ago

Another problem is vdso_clock_gettime takes too much cpus. which is weird. And from GPT4, this might also be a vm issue: "The vdso_clock_gettime is a function provided by the Linux kernel that applications can use to get the current time, without the cost of a system call. It is faster because it doesn't require the context switch that a traditional system call does.

In normal conditions, this function shouldn't take a significant amount of CPU time. If it does, it may be a sign that either an application is calling this function excessively, or there may be some other underlying issue. Here are a few things you could consider:

Excessive calls: An application might be calling __vdso_clock_gettime excessively. This might occur in applications that do a lot of logging with timestamps or that calculate time differences frequently. You might want to look at your applications and see if they're making an excessive number of calls to get the current time. Profiling the applications may help in identifying this.

High Precision Requirements: Some applications might require high precision timers which can lead to increased CPU utilization. If high precision is not required, you might want to modify the application to reduce the precision requirement.

CPU contention: If your application is running in a virtualized environment, CPU contention with other VMs may cause the high CPU usage. You can discuss with your provider or consider migrating to a different VM or even a dedicated server.

Hardware or VM issues: If none of the above points help, it might be worth checking for any hardware issues or issues with the VM (if you are in a virtualized environment), though this is less likely.

Remember, high CPU usage is not always a problem. It might just mean that the CPU is being utilized efficiently. However, if the system is lagging, you notice significant slowdowns, or other tasks are being starved of CPU time, then it's worth investigating. "

xiaofan-luan commented 1 year ago

there is something we could improve, which is remove data timetick channel, because I've seen many duplicated report on timetick channel, but seems to be not a root cause issue for me.

xiaofan-luan commented 1 year ago

/assign @yanliang567

please also try to reproduce the scenario in house: 1) with 256 shards 2) with 100+ collections writing some data. 3) restart the standalone server

Keep the loop see we can hit cpu usage went high

furiousteabag commented 1 year ago

Could you check how many goroutine and thread are there when you start.

When I just started standalone:

$ ps -eLf | grep "milvus run standalone" | wc -l
24

Then it steadily grows to 339.

When I have connected via pymilvus:

$ ps -eLf | grep "milvus run standalone" | wc -l
445

I hadn't found info about how I can check the amount of goroutines of a running process.

furiousteabag commented 1 year ago

Is it a possibility to migrate your milvus docker to an another physical node and give a try?

It would be kind of hard and I don't think it is an issue, because it is a clean AWS VM with almost nothing onboard except Docker. Also previously everything was working fine on it.

I also rebooted but it hadn't helped.

furiousteabag commented 1 year ago

From the code and the very short log you offers, the only uncommon behaiour is milvus spent a lot of time on getting locks

In the initial message, I had also attached 40k rows log file. I can log more if needed!

xiaofan-luan commented 1 year ago

400 seems to be a ok number. Goroutine can be shown on pprof, but it should not be a key issue.

furiousteabag commented 1 year ago

Excessive calls: An application might be calling __vdso_clock_gettime excessively. This might occur in applications that do a lot of logging with timestamps or that calculate time differences frequently. You might want to look at your applications and see if they're making an excessive number of calls to get the current time. Profiling the applications may help in identifying this.

Hell yeah, milvus has a ton of logs. Might it be an issue?

xiaofan-luan commented 1 year ago

seems to be similar as #23309

xiaofan-luan commented 1 year ago

this is how you can enable pprof. https://github.com/milvus-io/milvus/issues/13978

With pprof you could see more details about: 1) where rocksmq is blocked at 2) who is holding the topic lock and why 3) which part is consuming most of the cpus on go side

furiousteabag commented 1 year ago

$ wget "http://localhost:9091/debug/pprof/goroutine?debug=2" -O pprof.txt: pprof.txt

$ go tool pprof http://localhost:9091/debug/pprof/heap: pprof.milvus.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz

$ wget -O trace.txt "http://localhost:9091/debug/pprof/trace?seconds=5": trace.txt

$ go tool pprof "http://localhost:9091/debug/pprof/profile?seconds=30": pprof.milvus.samples.cpu.001.pb.gz

$ go tool pprof -http=localhost:9999 pprof.milvus.samples.cpu.001.pb.gz: image

furiousteabag commented 1 year ago

Should I measure something else?

xiaofan-luan commented 1 year ago

try to change proxy. timeTickInterval from 200 to 1000 see if it works.

Some obervation:

  1. since you have many collections, timetick consumes many cpus, but I guess it shouldn't be that high.
  2. change timeTickInterval should reduce the interval to send tt, thus reduce the cpu usage.
  3. Time takes a lot of cpus, it's weird, need to check what really happens image
xiaofan-luan commented 1 year ago

and rocksdb takes rest of the cpus.

try to increase rocksmq: lrucacheratio: 0.06 # rocksdb cache memory ratio see if it helps?

furiousteabag commented 1 year ago

try to change proxy. timeTickInterval from 200 to 1000 see if it works.

I have changed and now even a single collection is unable to load. It just hangs forever on .load(). Here is the log: log.txt

Upd: I have loaded collections after deleting /var/lib/milvus/rdb_data and /var/lib/milvus/rdb_data_meta_kv and restarting Milvus.

Upd2: Seems to work well so far! Will observe it further for a couple of days and get back. Also, there is way less logs from standalone now.

xiaofan-luan commented 1 year ago

Thanks, the pprof gives us a lot of information. And good to know your cluster recovered. We will be working on figure the reason of cpu usage. Seems that the rocksmq consumption is slow, many goroutine is waiting on data consumption

yanliang567 commented 1 year ago

/assign @ThreadDao any chance we could reproduce it in house?

xiaofan-luan commented 1 year ago

I think the major reason is the lock of rocksmq hold topic lock. And due to this cluster has too many collections, when cluster start from cold, all of the consumer try to pull data from rocksmq and causing the system hard to recover and stayed in a loop to catch up new data.

yanliang567 commented 1 year ago

I believe we have reproduce it in house, @aoiasd is working on it

xiaofan-luan commented 1 year ago

I believe we have reproduce it in house, @aoiasd is working on it

Good to know!

furiousteabag commented 1 year ago

proxy. timeTickInterval from 200 to 1000 see if it works.

Still works great, might consider this as a solution

darkerin commented 1 year ago

Seems to work well so far! Will observe it further for a couple of days and get back. Also, there is way less logs from standalone now

Hi, can Ichange this timeTickInterval parameter without deleting the data? @SmirnovAlexander @xiaofan-luan

xiaofan-luan commented 1 year ago

you can do so. this won't change any data. Meanwhile you can wait for next milvus release and we will fix the issue

aoiasd commented 1 year ago

We have refine idle scene at master(remove time tick channel) and 2.3. For version 2.2, i will optimize rocksmq and try to do something helpful. This question caused by that rocksmq cpu usage will increase with data, and idle milvus will send time tick massage by rocksmq, and cpu usage will increase. So reduce interval of time tick massage or reduce interval and size of rocksmq retention could help reduce cpu usage.

darkerin commented 1 year ago

We have refine idle scene at master(remove time tick channel) and 2.3. For version 2.2, i will optimize rocksmq and try to do something helpful. This question caused by that rocksmq cpu usage will increase with data, and idle milvus will send time tick massage by rocksmq, and cpu usage will increase. So reduce interval of time tick massage or reduce interval and size of rocksmq retention could help reduce cpu usage.

Hi, does this mean that this problem will be improved when I upgrade to 2.2.12?

darkerin commented 1 year ago

The situation has gotten worse in recent days as the number of data and collections has grown @aoiasd @yanliang567

image
xiaofan-luan commented 1 year ago

The situation has gotten worse in recent days as the number of data and collections has grown @aoiasd @yanliang567

image

how many collections do you have?

collections numbers do significantly affect cpu usage, especially on standalones from our test. We recommend to use < 1000 collections. Also if we know more about your use cases maybe we can give more advices

darkerin commented 1 year ago

The situation has gotten worse in recent days as the number of data and collections has grown @aoiasd @yanliang567

image

how many collections do you have?

collections numbers do significantly affect cpu usage, especially on standalones from our test. We recommend to use < 1000 collections. Also if we know more about your use cases maybe we can give more advices

I have about 500 collections, the amount of data in each collection varies, with a maximum of 5000 entities per collection.

yud11 commented 1 year ago

I encountered the same problem when I have hundreds of collections

aoiasd commented 1 year ago

2.2.12 may improved . By https://github.com/milvus-io/milvus/pull/24011.

darkerin commented 1 year ago

2.2.12 may improved . By #24011.

@xiaofan-luan @aoiasd

When I upgraded from 2.2.11 to 2.2.12, my mivlus standalone can't even start properly, docker keeps scrolling the following logs:

[2023/08/12 17:48:33.950 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.950 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.950 +00:00] [WARN] [server/rocksmq_impl.go:768] ["rocksmq consume too slowly"] [topic=by-dev-rootcoord-dml_0] ["get lock elapse"=232] ["iterator elapse"=0] ["moveConsumePosTime elapse"=0] ["total consume elapse"=232]
[2023/08/12 17:48:33.951 +00:00] [WARN] [server/rocksmq_impl.go:768] ["rocksmq consume too slowly"] [topic=by-dev-rootcoord-dml_0] ["get lock elapse"=233] ["iterator elapse"=0] ["moveConsumePosTime elapse"=0] ["total consume elapse"=233]
[2023/08/12 17:48:33.952 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.952 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.952 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.952 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.952 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.954 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.955 +00:00] [WARN] [server/rocksmq_impl.go:768] ["rocksmq consume too slowly"] [topic=by-dev-rootcoord-dml_0] ["get lock elapse"=237] ["iterator elapse"=0] ["moveConsumePosTime elapse"=0] ["total consume elapse"=237]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]
[2023/08/12 17:48:33.956 +00:00] [WARN] [flowgraph/node.go:204] ["Invalid operate message input"] ["input length"=0]

And my cpu usage didn't improve:

image

What should I do now