milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.91k stars 2.95k forks source link

[Bug]: can't find collection name and datanode panic: send on closed channel while inserting data into multi partitions concurrently #23357

Closed ThreadDao closed 1 year ago

ThreadDao commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version: bigsheeper-2304-memory-sync-3849f72-20230410
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. deploy milvus with config
    components:
    dataNode:
      paused: false
      replicas: 1
      resources:
        limits:
          cpu: "2" 
          memory: 8Gi 
        requests:
          cpu: "1" 
          memory: 2Gi 
    disableMetric: false
    indexNode:
      paused: false
      replicas: 1
      resources:
        limits:
          cpu: "1" 
          memory: 4Gi 
        requests:
          cpu: "1" 
          memory: 2Gi 
    metricInterval: ""
    paused: false
    queryNode:
      paused: false
      replicas: 1
      resources:
        limits:
          cpu: "2" 
          memory: 4Gi 
        requests:
          cpu: "1" 
          memory: 2Gi 
    config:
    dataNode:
      memory:
        forceSyncPeriod: 2
        forceSyncSegmentRatio: 0.5 
        watermarkCluster: 0.3 
    log:
      level: debug
  2. test steps:
    1. Create collection -> create index -> insert 5k data -> load
    2. Concurrent tasks: concurrent_scene_insert_partition (spawn 200 users, each user: create partition -> insert data -> flush partition -> drop partition)
    3. An exception was returned on an insert:
      [2023-04-11 15:53:09,739 - ERROR - fouram]: RPC error: [batch_insert], <MilvusException: (code=1, message=can't find collection name:fouram_AW3BshwJ)>, <Time:{'RPC start': '2023-04-11 15:53:02.191323', 'RPC error': '2023-04-11 15:53:09.739466'}> (decorators.py:108)
      [2023-04-11 15:53:09,740 - ERROR - fouram]: Traceback (most recent call last):
      File "/home/zong/zong/projects/fouram/client/util/api_request.py", line 33, in inner_wrapper
      res = func(*args, **kwargs)
      File "/home/zong/zong/projects/fouram/client/util/api_request.py", line 70, in api_request
      return func(*arg, **kwargs)
      File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/orm/collection.py", line 430, in insert
      res = conn.batch_insert(self._name, entities, partition_name,
      File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/decorators.py", line 109, in handler
      raise e
      File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/decorators.py", line 105, in handler
      return func(*args, **kwargs)
      File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/decorators.py", line 136, in handler
      ret = func(self, *args, **kwargs)
      File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/decorators.py", line 85, in handler
      raise e
      File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/decorators.py", line 50, in handler
      return func(self, *args, **kwargs)
      File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 388, in batch_insert
      raise err 
      File "/home/zong/zong/.virtualenvs/fouram/lib/python3.8/site-packages/pymilvus/client/grpc_handler.py", line 384, in batch_insert
      raise MilvusException(response.status.error_code, response.status.reason)
      pymilvus.exceptions.MilvusException: <MilvusException: (code=1, message=can't find collection name:fouram_AW3BshwJ)>
      (api_request.py:48)
      [2023-04-11 15:53:09,740 - ERROR - fouram]: (api_response) : <MilvusException: (code=1, message=can't find collection name:fouram_AW3BshwJ)> (api_request.py:49)

Expected Behavior

No response

Steps To Reproduce

Case:

@pytest.mark.locust
    def test_concurrent_insert_partitions_sync(self, input_params: InputParamsBase):
        """
        :test steps:
            1. concurrent test and calculation of RT and QPS
        """
        data_size = "5k"

        # concurrent tasks
        concurrent_tasks = [
            ConcurrentParams.params_scene_insert_partition(weight=10, data_size='5k', ni=500, timeout=120),
            ConcurrentParams.params_release(weight=2)
        ]

        # concurrent params
        default_case_params = ConcurrentParams().params_scene_concurrent(
            concurrent_tasks, concurrent_number=[200], during_time="30m", interval=10, dim=128, dataset_size=data_size,
            ni_per=5000)

        self.concurrency_template(input_params=input_params, cpu=dp.min_cpu, mem=8, deploy_mode=CLUSTER,
                                  old_version_format=False,
                                  case_callable_obj=ConcurrentClientBase().scene_concurrent_locust,
                                  default_case_params=default_case_params)

### Milvus Log

Client logs:
[test_concurrent_insert_partitions_sync.zip](https://github.com/milvus-io/milvus/files/11198241/test_concurrent_insert_partitions_sync.zip)

Datanode panic logs:
[dn_pre.log](https://github.com/milvus-io/milvus/files/11198452/dn_pre.log)

33890 [2023/04/11 08:24:05.595 +00:00] [DEBUG] [datanode/channel_meta.go:701] ["getChannelCheckpoint for segment"] [segmentID=440716061767050448] [isCurIBEmpty=true] [isCurDBEmpty=true] [len(hisIB)=1] [len(hisDB)=0] [newChannelCpTs=440716750761754626] 33891 [2023/04/11 08:24:05.595 +00:00] [DEBUG] [datanode/channel_meta.go:701] ["getChannelCheckpoint for segment"] [segmentID=440716061766443353] [isCurIBEmpty=true] [isCurDBEmpty=true] [len(hisIB)=0] [len(hisDB)=0] [newChannelCpTs=440716750761754626] 33892 [2023/04/11 08:24:05.595 +00:00] [DEBUG] [datanode/channel_meta.go:701] ["getChannelCheckpoint for segment"] [segmentID=440716061766241384] [isCurIBEmpty=true] [isCurDBEmpty=true] [len(hisIB)=0] [len(hisDB)=0] [newChannelCpTs=440716750761754626] 33893 [2023/04/11 08:24:05.595 +00:00] [DEBUG] [datanode/channel_meta.go:701] ["getChannelCheckpoint for segment"] [segmentID=440716061767051221] [isCurIBEmpty=true] [isCurDBEmpty=true] [len(hisIB)=1] [len(hisDB)=0] [newChannelCpTs=440716750761754626] 33894 [2023/04/11 08:24:05.600 +00:00] [INFO] [datanode/flow_graph_time_tick_node.go:115] ["UpdateChannelCheckpoint success"] [channel=fouramf-6s79w-op-73-7527-rootcoord-dml_2_440716061758175227v0] [cpTs=440716750761754626] [cpTime=2023/04/11 08:16:24.046 +00:00] 33895 [2023/04/11 08:24:05.600 +00:00] [INFO] [flowgraph/node.go:112] ["flow graph node closed"] [nodeName=ttNode-fouramf-6s79w-op-73-7527-rootcoord-dml_2_440716061758175227v0] 33896 panic: send on closed channel 33897 33898 goroutine 5381 [running]: 33899 github.com/milvus-io/milvus/internal/datanode.(DataNode).FlushSegments.func1({0xc009fa8000, 0x82, 0xc0074c3e80?}, 0x1) 33900 /go/src/github.com/milvus-io/milvus/internal/datanode/data_node.go:650 +0x545 33901 github.com/milvus-io/milvus/internal/datanode.(DataNode).FlushSegments(0xc001265560, {0xc001c398c0?, 0x13fb5c5?}, 0xc009f06a00) 33902 /go/src/github.com/milvus-io/milvus/internal/datanode/data_node.go:665 +0x6f0 33903 github.com/milvus-io/milvus/internal/distributed/datanode.(Server).FlushSegments(0xc000892500, {0x42744d8, 0xc00a019710}, 0x10?) 33904 /go/src/github.com/milvus-io/milvus/internal/distributed/datanode/service.go:356 +0xc5 33905 github.com/milvus-io/milvus/internal/proto/datapb._DataNode_FlushSegments_Handler.func1({0x42744d8, 0xc00a019710}, {0x3ce6540?, 0xc009f06a00}) 33906 /go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:7135 +0x78 33907 github.com/milvus-io/milvus/internal/util/logutil.UnaryTraceLoggerInterceptor({0x42744d8?, 0xc00a019620?}, {0x3ce6540, 0xc009f06a00}, 0x425e4e0?, 0xc00a660828) 33908 /go/src/github.com/milvus-io/milvus/internal/util/logutil/grpc_interceptor.go:22 +0x49 33909 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x42744d8?, 0xc00a019620?}, {0x3ce6540?, 0xc009f06a00?}) 33910 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x3a 33911 github.com/grpc-ecosystem/go-grpc-middleware/tracing/opentracing.UnaryServerInterceptor.func1({0x42744d8, 0xc00a019590}, {0x3ce6540, 0xc009f06a00}, 0xc001da7160?, 0xc001da7180) 33912 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/tracing/opentracing/server_interceptors.go:38 +0x16a 33913 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x42744d8?, 0xc00a019590?}, {0x3ce6540?, 0xc009f06a00?}) 33914 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:25 +0x3a 33915 github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x42744d8, 0xc00a019590}, {0x3ce6540, 0xc009f06a00}, 0xc002545af0?, 0x3ab2660?) 33916 /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware@v1.3.0/chain.go:34 +0xbf 33917 github.com/milvus-io/milvus/internal/proto/datapb._DataNode_FlushSegments_Handler({0x3d60aa0?, 0xc000892500}, {0x42744d8, 0xc00a019590}, 0xc0085afec0, 0xc000e5afc0) 33918 /go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:7137 +0x138 33919 google.golang.org/grpc.(Server).processUnaryRPC(0xc00063d180, {0x4284d10, 0xc0014f9ba0}, 0xc005f82b40, 0xc000e5b140, 0x57603c8, 0x0) 33920 /go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1283 +0xcfd 33921 google.golang.org/grpc.(Server).handleStream(0xc00063d180, {0x4284d10, 0xc0014f9ba0}, 0xc005f82b40, 0x0) 33922 /go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:1620 +0xa1b 33923 google.golang.org/grpc.(Server).serveStreams.func1.2() 33924 /go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:922 +0x98 33925 created by google.golang.org/grpc.(*Server).serveStreams.func1 33926 /go/pkg/mod/google.golang.org/grpc@v1.46.0/server.go:920 +0x28a


Server pods in `fouram` cluster and `qa-milvus` ns.  There are a lot of logs, please filter by client error log time.

fouramf-6s79w-op-73-7527-etcd-0 1/1 Running 0 28h 10.104.9.65 4am-node14 fouramf-6s79w-op-73-7527-etcd-1 1/1 Running 0 28h 10.104.6.110 4am-node13 fouramf-6s79w-op-73-7527-etcd-2 1/1 Running 0 28h 10.104.4.98 4am-node11 fouramf-6s79w-op-73-7527-milvus-datacoord-7bc5cc6bc7-g6lnp 1/1 Running 0 44m 10.104.5.53 4am-node12 fouramf-6s79w-op-73-7527-milvus-datanode-6c5848c6b7-b62w9 1/1 Running 0 45m 10.104.5.51 4am-node12 fouramf-6s79w-op-73-7527-milvus-indexcoord-665b859bbf-w9shk 1/1 Running 0 44m 10.104.5.52 4am-node12 fouramf-6s79w-op-73-7527-milvus-indexnode-6bdbdcb97c-bdqsm 1/1 Running 0 45m 10.104.6.249 4am-node13 fouramf-6s79w-op-73-7527-milvus-proxy-66d8876d8d-7njnh 1/1 Running 0 45m 10.104.6.248 4am-node13 fouramf-6s79w-op-73-7527-milvus-querycoord-76c574d977-fxq9b 1/1 Running 0 44m 10.104.5.55 4am-node12 fouramf-6s79w-op-73-7527-milvus-querynode-7ff5dc66df-4g6wm 1/1 Running 0 45m 10.104.1.107 4am-node10 fouramf-6s79w-op-73-7527-milvus-rootcoord-67b74b694f-bgv7d 1/1 Running 0 44m 10.104.5.54 4am-node12 fouramf-6s79w-op-73-7527-minio-0 1/1 Running 0 28h 10.104.9.66 4am-node14 fouramf-6s79w-op-73-7527-minio-1 1/1 Running 0 28h 10.104.6.111 4am-node13 fouramf-6s79w-op-73-7527-minio-2 1/1 Running 0 28h 10.104.1.36 4am-node10 fouramf-6s79w-op-73-7527-minio-3 1/1 Running 0 28h 10.104.4.104 4am-node11 fouramf-6s79w-op-73-7527-pulsar-bookie-0 1/1 Running 0 28h 10.104.6.115 4am-node13 fouramf-6s79w-op-73-7527-pulsar-bookie-1 1/1 Running 0 28h 10.104.4.108 4am-node11 fouramf-6s79w-op-73-7527-pulsar-bookie-2 1/1 Running 0 28h 10.104.1.39 4am-node10 fouramf-6s79w-op-73-7527-pulsar-bookie-init-7hzxh 0/1 Completed 0 28h 10.104.6.104 4am-node13 fouramf-6s79w-op-73-7527-pulsar-broker-0 1/1 Running 0 28h 10.104.1.34 4am-node10 fouramf-6s79w-op-73-7527-pulsar-proxy-0 1/1 Running 0 28h 10.104.6.106 4am-node13 fouramf-6s79w-op-73-7527-pulsar-pulsar-init-g9wm6 0/1 Completed 0 28h 10.104.9.63 4am-node14 fouramf-6s79w-op-73-7527-pulsar-recovery-0 1/1 Running 0 28h 10.104.6.105 4am-node13 fouramf-6s79w-op-73-7527-pulsar-zookeeper-0 1/1 Running 0 28h 10.104.4.106 4am-node11 fouramf-6s79w-op-73-7527-pulsar-zookeeper-1 1/1 Running 0 28h 10.104.9.68 4am-node14 fouramf-6s79w-op-73-7527-pulsar-zookeeper-2 1/1 Running 0 28h 10.104.5.10 4am-node12



### Anything else?

_No response_
yanliang567 commented 1 year ago

/assign @bigsheeper /unassign

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.