milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.46k stars 2.82k forks source link

[Bug]: [db] Standalone oomkilled after running all db test cases several times #24168

Closed ThreadDao closed 1 year ago

ThreadDao commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: 2.2.0-20230516-d7bc8afe
- Deployment mode(standalone or cluster): standalone
- MQ type(rocksmq, pulsar or kafka):  rocksmq   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.2.9.dev8
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. deploy standalone with config

    Limits:
      cpu:     2
      memory:  6Gi
    Requests:
      cpu:      10m
      memory:   128Mi
  2. Debug and run all the db test cases.

    cd /tests/python_client
    pip install -r requirements.txt 
    pytest testcases/test_database.py
  3. run a utility case:

    def test_revoke_user_after_delete_user(self, host, port):
        """
        target: test revoke user with deleted user
        method: 1. create user -> create a role -> role add user
                2. delete user
                3. revoke the deleted user
        expected: revoke successfully
        """
        # root connect
        self.connection_wrap.connect(host=host, port=port, user=ct.default_user,
                                     password=ct.default_password, check_task=ct.CheckTasks.ccr)
    
        # create user
        user_name = cf.gen_unique_str("user")
        u, _ = self.utility_wrap.create_user(user=user_name, password=ct.default_password)
    
        # create a role and bind user
        role_name = cf.gen_unique_str("role")
        self.utility_wrap.init_role(role_name)
        self.utility_wrap.create_role()
        self.utility_wrap.role_add_user(user_name)
    
        # delete user
        self.utility_wrap.delete_user(user_name)
    
        # remove user successfully
        self.utility_wrap.role_remove_user(user_name)
    
        # get role users and verify user removed
        role_users, _ = self.utility_wrap.role_get_users()
        assert user_name not in role_users
  4. standalone pod oomkilled. The weird things is that there is only 7 collections and every collection is empty (no insertion)

    
    c.list_collections()
    ['db_Vb2tsSQR', 'db_xuHut7cZ', 'db_QOMN4Rjq', 'db_XTMwuDnD', 'db_4uO2Icsv', 'db_3S8QG4lr', 'db_D3198izZ']
    for i in c.list_collections():
    print(i, c.get_collection_stats(i))

db_D3198izZ {'row_count': 0} db_Vb2tsSQR {'row_count': 0} db_xuHut7cZ {'row_count': 0} db_QOMN4Rjq {'row_count': 0} db_XTMwuDnD {'row_count': 0} db_4uO2Icsv {'row_count': 0} db_3S8QG4lr {'row_count': 0}


standalone oomkilled:
![image](https://github.com/milvus-io/milvus/assets/27288593/bb9fa9e4-7e4b-4a92-9743-ebb6dfdc46ee)

![image](https://github.com/milvus-io/milvus/assets/27288593/fd7071fa-62bc-4021-b03e-1446f5ea7eaf)

It is worth mentioning that I ran the case many times. I'm not sure if memory growth is related to frequent creation and deletion of db
def test_create_db_exceeds_max_num(self):
    """
    target: test db num exceeds max num
    method: create many dbs and exceeds max
    expected: exception
    """
    self._connect()
    dbs, _ = self.database_wrap.list_database()

    # because max num 64 not include default
    for i in range(ct.max_database_num + 1 - len(dbs)):
        self.database_wrap.create_database(cf.gen_unique_str(prefix))

    # there are ct.max_database_num-1 dbs (default is not included)
    error = {ct.err_code: 1,
             ct.err_msg: f"database number ({ct.max_database_num + 1}) exceeds max configuration ({ct.max_database_num})"}
    self.database_wrap.create_database(cf.gen_unique_str(prefix), check_task=CheckTasks.err_res, check_items=error)

### Expected Behavior

_No response_

### Steps To Reproduce

_No response_

### Milvus Log

memory pprof:  pyroscope.io/application-name=zong-db-milvus-standalone

pod:

zong-db-etcd-0 1/1 Running 0 4h52m zong-db-milvus-standalone-c775f6d94-pshxl 1/1 Running 7 (3m15s ago) 4h49m zong-db-minio-95fb5b866-zss9c 1/1 Running 0 4h52m



### Anything else?

_No response_
yanliang567 commented 1 year ago

I think db creation is all about meta, it shall not lead to OOM. /assign @jaime0815 /unassign

jaime0815 commented 1 year ago

It seems rocksdb had used a lot of memory, the size of sst files grows to 11G in the data directory.

/var/lib/milvus/rdb_data:
total 11G
-rw-r--r-- 1 root root  66M May 17 06:34 003435.sst
-rw-r--r-- 1 root root  66M May 17 03:16 002822.sst
-rw-r--r-- 1 root root  66M May 17 00:37 002275.sst
-rw-r--r-- 1 root root  66M May 16 20:10 001496.sst
-rw-r--r-- 1 root root  66M May 16 12:59 000306.sst
-rw-r--r-- 1 root root  66M May 16 12:30 000212.sst
-rw-r--r-- 1 root root  66M May 16 12:48 000267.sst
-rw-r--r-- 1 root root  66M May 16 12:21 000178.sst
-rw-r--r-- 1 root root  66M May 16 13:13 000361.sst
-rw-r--r-- 1 root root  66M May 16 13:09 000341.sst
-rw-r--r-- 1 root root  66M May 16 12:25 000194.sst
-rw-r--r-- 1 root root  66M May 16 13:09 000343.sst
-rw-r--r-- 1 root root  66M May 16 13:04 000325.sst
-rw-r--r-- 1 root root  66M May 16 12:53 000285.sst
-rw-r--r-- 1 root root  66M May 16 12:37 000231.sst
-rw-r--r-- 1 root root  66M May 16 12:20 000177.sst
-rw-r--r-- 1 root root  66M May 16 12:42 000247.sst
-rw-r--r-- 1 root root  66M May 16 13:57 000531.sst
-rw-r--r-- 1 root root  66M May 16 12:17 000158.sst
-rw-r--r-- 1 root root  66M May 16 11:47 000051.sst
-rw-r--r-- 1 root root  66M May 16 13:04 000326.sst
-rw-r--r-- 1 root root  66M May 16 11:44 000038.sst
-rw-r--r-- 1 root root  66M May 16 11:57 000097.sst
-rw-r--r-- 1 root root  66M May 16 12:37 000230.sst
-rw-r--r-- 1 root root  66M May 16 11:50 000064.sst
-rw-r--r-- 1 root root  66M May 16 12:08 000126.sst
-rw-r--r-- 1 root root  66M May 16 11:41 000026.sst
-rw-r--r-- 1 root root  66M May 16 12:12 000142.sst
-rw-r--r-- 1 root root  66M May 16 11:57 000098.sst
-rw-r--r-- 1 root root  66M May 16 11:44 000039.sst
-rw-r--r-- 1 root root  66M May 16 12:08 000125.sst
-rw-r--r-- 1 root root  66M May 17 06:34 003436.sst
-rw-r--r-- 1 root root  65M May 17 07:15 003531.sst
-rw-r--r-- 1 root root  65M May 16 21:54 001814.sst
-rw-r--r-- 1 root root  65M May 16 20:37 001619.sst
....

milvus failed to start, due to the produce or consume stream being too slow.

[2023/05/17 06:57:12.545 +00:00] [WARN] [server/rocksmq_impl.go:628] ["rocksmq produce too slowly"] [topic=zong-db-rootcoord-delta_0] ["get lock elapse"=8226] ["alloc elapse"=0] ["write elapse"=1] ["updatePage elapse"=0] ["produce total elapse"=8227]
[2023/05/17 06:57:12.545 +00:00] [WARN] [server/rocksmq_impl.go:628] ["rocksmq produce too slowly"] [topic=zong-db-rootcoord-delta_0] ["get lock elapse"=8221] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=8221]
[2023/05/17 06:57:12.545 +00:00] [WARN] [server/rocksmq_impl.go:628] ["rocksmq produce too slowly"] [topic=zong-db-rootcoord-delta_0] ["get lock elapse"=8212] ["alloc elapse"=0] ["write elapse"=1] ["updatePage elapse"=0] ["produce total elapse"=8213]
[2023/05/17 06:57:12.545 +00:00] [WARN] [server/rocksmq_impl.go:628] ["rocksmq produce too slowly"] [topic=zong-db-rootcoord-delta_0] ["get lock elapse"=8212] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=8212]
[2023/05/17 06:57:12.545 +00:00] [WARN] [server/rocksmq_impl.go:628] ["rocksmq produce too slowly"] [topic=zong-db-rootcoord-delta_0] ["get lock elapse"=8212] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=8212]
[2023/05/17 06:57:12.546 +00:00] [WARN] [server/rocksmq_impl.go:628] ["rocksmq produce too slowly"] [topic=zong-db-rootcoord-delta_0] ["get lock elapse"=8212] ["alloc elapse"=0] ["write elapse"=0] ["updatePage elapse"=0] ["produce total elapse"=8212]

[2023/05/17 06:57:12.550 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=3.116324127s]
[2023/05/17 06:57:12.550 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=3.037529922s]
[2023/05/17 06:57:12.550 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=3.027311795s]
[2023/05/17 06:57:12.550 +00:00] [DEBUG] [rmq/rmq_producer.go:47] ["tr/send msg to stream"] [msg="send msg to stream done"] [duration=2.783017938s]
jaime0815 commented 1 year ago

related to https://github.com/milvus-io/milvus/issues/24106, the retention mechanism doesn't work.

ThreadDao commented 1 year ago

Maybe the reason is the case test_create_collection_exceeds_per_db, create max_collections_per_db=65536 collections. just my guess

jaime0815 commented 1 year ago

It seems rocksdb memory leak

liangjw commented 1 year ago

Any idea about this error? My milvus version : v2.0.2 standlone Only 100 rows test data , 2 collection...

yanliang567 commented 1 year ago

Any idea about this error? My milvus version : v2.0.2 standlone Only 100 rows test data , 2 collection...

I don't think you met the same issue, as v2.0.2 does not include database feature in this issue. please retry with latest v2.2.8

ThreadDao commented 1 year ago

@jaime0815 image: 2.2.0-20230524-e8545777 I guess the reason is so many (max 65536) collections are created. Inserted data can be negligible image

image

zong-db-1-etcd-0                                                  1/1     Running       0               4d2h
zong-db-1-milvus-standalone-84f7c587bc-hbbwt                      1/1     Running       1 (9m27s ago)   48m
zong-db-1-minio-7457dc9fdb-dfpns                                  1/1     Running       0               4d2h
jaime0815 commented 1 year ago

The prof option of Jemalloc is enabled, It causes significant performance degradation. https://github.com/milvus-io/milvus/blob/15368f5e752cc5c152b9954dd4a55d0f79926e27/internal/core/thirdparty/jemalloc/CMakeLists.txt#L50

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.