milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.51k stars 2.83k forks source link

[Bug]: [Nightly] Streaming service is not stable and crash during nightly tests #36378

Open NicoYuan1986 opened 2 hours ago

NicoYuan1986 commented 2 hours ago

Is there an existing issue for this?

Environment

- Milvus version: 89397d1
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):     streaming service
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Streaming service is not stable and crash during nightly tests. link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/122/pipeline/139/

The test ended with failure:

[pytest : test] [gw2] [ 23%] FAILED testcases/test_index.py::TestBitmapIndex::test_bitmap_primary_field_data_repeated[2-3791-varchar_pk] 
[pytest : test] [gw3] [ 23%] FAILED testcases/test_index.py::TestBitmapIndex::test_bitmap_primary_field_data_repeated[16-1600-varchar_pk] 
[pytest : test] testcases/test_index.py::TestBitmapIndex::test_bitmap_primary_field_data_not_repeated[1-1000-int64_pk] 
[pytest : test] testcases/test_index.py::TestBitmapIndex::test_bitmap_primary_field_data_not_repeated[1-1000-varchar_pk] 
[pytest : test] [gw1] [ 23%] FAILED testcases/test_index.py::TestBitmapIndex::test_bitmap_insert_before_loading[int64_pk-True] 
[pytest : test] [gw5] [ 23%] FAILED testcases/test_index.py::TestBitmapIndex::test_bitmap_insert_before_loading[int64_pk-False] 
[pytest : test] testcases/test_index.py::TestBitmapIndex::test_bitmap_primary_field_data_repeated[2-3791-int64_pk] 
[pytest : test] testcases/test_index.py::TestBitmapIndex::test_bitmap_primary_field_data_repeated[16-1600-int64_pk] 
[pytest : test] [gw0] [ 23%] FAILED testcases/test_index.py::TestBitmapIndex::test_bitmap_insert_before_loading[varchar_pk-True] 

Maybe the crash has something wrong with bitmap.

Expected Behavior

pass

Steps To Reproduce

No response

Milvus Log

panic log: artifacts-milvus-distributed-streaming-service-mdss-master-122-py-n-122-e2e-logs.tar.gz

2024-09-20T02:00:22.993914092+08:00 stderr F panic: failed to create etcd client: context deadline exceeded
2024-09-20T02:00:22.9939464+08:00 stderr F
2024-09-20T02:00:22.993952067+08:00 stderr F goroutine 1 [running]:
2024-09-20T02:00:22.993955784+08:00 stderr F panic({0x63aa5e0?, 0xc001cbd600?})
2024-09-20T02:00:22.993960204+08:00 stderr F    /usr/local/go/src/runtime/panic.go:1017 +0x3ac fp=0xc00158f550 sp=0xc00158f4a0 pc=0x2179d8c
2024-09-20T02:00:22.993963026+08:00 stderr F github.com/milvus-io/milvus/internal/util/dependency/kv.getEtcdAndPath()
2024-09-20T02:00:22.993966544+08:00 stderr F    /workspace/source/internal/util/dependency/kv/kv_client_handler.go:54 +0x1bf fp=0xc00158f5d8 sp=0xc00158f550 pc=0x46782df
2024-09-20T02:00:22.993969522+08:00 stderr F github.com/milvus-io/milvus/internal/util/dependency/kv.GetEtcdAndPath(...)
2024-09-20T02:00:22.993972456+08:00 stderr F    /workspace/source/internal/util/dependency/kv/kv_client_handler.go:26
2024-09-20T02:00:22.993975898+08:00 stderr F github.com/milvus-io/milvus/internal/distributed/streaming.Init()
2024-09-20T02:00:22.993979116+08:00 stderr F    /workspace/source/internal/distributed/streaming/streaming.go:18 +0x1b fp=0xc00158f5f0 sp=0xc00158f5d8 pc=0x5935e7b
2024-09-20T02:00:22.993982622+08:00 stderr F github.com/milvus-io/milvus/cmd/roles.(*MilvusRoles).Run(0xc00170d4f0)
2024-09-20T02:00:22.993986133+08:00 stderr F    /workspace/source/cmd/roles/roles.go:388 +0x5fc fp=0xc00158fb70 sp=0xc00158f5f0 pc=0x5e1a01c

grafana link: https://grafana-4am.zilliz.cc/d/uLf5cJ3Ga/milvus2-0?orgId=1&from=1726771890361&to=1726778674656&var-datasource=P1809F7CD0C75ACF3&var-namespace=milvus-ci&var-instance=mdss-master-122-py-n&var-collection=All&var-app_name=milvus

Anything else?

No response

yanliang567 commented 2 hours ago

/assign @chyezh /unassign

chyezh commented 2 hours ago

The panic info is reported when etcd-initializing, the etcd is not ready. So it's not the cause of test failure.

chyezh commented 1 hour ago

action with pulsar topic failure:

[2024/09/19 19:23:55.001 +00:00] [WARN] [timetick/timetick_sync_operator.go:84] ["send time tick sync message failed"] [pchannel="{\"Name\":\"by-dev-rootcoord-dml_7\",\"Term\":2}"] [error="append time tick msg to wal failed, timestamp: 452663381635105769, previous message counter: 100: message send timeout: TimeoutError"] [errorVerbose="append time tick msg to wal failed, timestamp: 452663381635105769, previous message counter: 100: message send timeout: TimeoutError\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick.(*timeTickSyncOperator).sendPersistentTsMsg\n  | \t/workspace/source/internal/streamingnode/server/wal/interceptors/timetick/timetick_sync_operator.go:218\n  | github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick.(*timeTickSyncOperator).sendTsMsg\n  | \t/workspace/source/internal/streamingnode/server/wal/interceptors/timetick/timetick_sync_operator.go:200\n  | github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick.(*timeTickSyncOperator).Sync\n  | \t/workspace/source/internal/streamingnode/server/wal/interceptors/timetick/timetick_sync_operator.go:76\n  | github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick/inspector.(*timeTickSyncInspectorImpl).background.func1\n  | \t/workspace/source/internal/streamingnode/server/wal/interceptors/timetick/inspector/impls.go:75\n  | github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick/inspector.(*timeTickSyncInspectorImpl).background.(*ConcurrentMap[...]).Range.func3\n  | \t/workspace/source/pkg/util/typeutil/map.go:54\n  | sync.(*Map).Range\n  | \t/usr/local/go/src/sync/map.go:476\n  | github.com/milvus-io/milvus/pkg/util/typeutil.(*ConcurrentMap[...]).Range\n  | \t/workspace/source/pkg/util/typeutil/map.go:51\n  | github.com/milvus-io/milvus/internal/streamingnode/server/wal/interceptors/timetick/inspector.(*timeTickSyncInspectorImpl).background\n  | \t/workspace/source/internal/streamingnode/server/wal/interceptors/timetick/inspector/impls.go:74\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1650\nWraps: (2) append time tick msg to wal failed, timestamp: 452663381635105769, previous message counter: 100\nWraps: (3) message send timeout: TimeoutError\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *pulsar.Error"]