milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.97k stars 2.95k forks source link

[Bug]:When continuously inserting data into Milvus, the QueryNode panicked with error `fatal runtime error: failed to initiate` #37623

Closed zhuwenxing closed 1 week ago

zhuwenxing commented 2 weeks ago

Is there an existing issue for this?

Environment

- Milvus version:master-20241112-f5b06a3c-amd64
- Deployment mode(standalone or cluster):
- MQ type(rocksmq, pulsar or kafka):    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024/11/12 11:40:11.174 +00:00] [INFO] [segments/segment.go:999] ["load field done"] [traceID=a23b4fdf27d950e470bd051a28e7c807] [collectionID=453877445987156654] [partitionID=453877445987156655] [segmentID=453877445989794312] [fieldID=102] [rowCount=2309000]
[2024/11/12 11:40:11.175 +00:00] [INFO] [segments/segment_loader.go:991] ["load field binlogs done for sealed segment"] [traceID=a23b4fdf27d950e470bd051a28e7c807] [collection=453877445987156654] [segment=453877445989794312] [len(field)=5] [segmentType=Sealed]
[2024/11/12 11:40:11.175 +00:00] [INFO] [segments/segment.go:1411] ["create text index for segment"] [traceID=a23b4fdf27d950e470bd051a28e7c807] [segmentID=453877445989794312] [fieldID=101]
thread '<unnamed>' panicked at src/index_writer_text.rs:38:64:
called `Result::unwrap()` on an `Err` value: IndexAlreadyExists
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
fatal runtime error: failed to initiate panic, error 5
SIGABRT: abort
PC=0x7fbfdcb949fc m=28 sigcode=18446744073709551610
signal arrived during cgo execution

goroutine 1494 gp=0xc001902c40 m=28 mp=0xc0017e1808 [syscall, locked to thread]:
I20241112 11:40:11.422313   242 ChunkedSegmentSealedImpl.cpp:287] [SERVER][LoadFieldData][milvus] segment 453877445995448936 loads field 101 mmap true done
I20241112 11:40:11.898198   163 ChunkedSegmentSealedImpl.cpp:287] [SERVER][LoadFieldData][milvus] segment 453877445993595587 loads field 101 mmap true done
non-Go function
        pc=0x7fbfdcb949fc
non-Go function
        pc=0x7fbfdcb40475
non-Go function
        pc=0x7fbfdcb267f2
_ZN3std3sys4unix14abort_internal17hb82186f9b9b64ef6E
        library/std/src/sys/unix/mod.rs:365 pc=0x7fbfdfd04626
rust_panic
        library/std/src/panicking.rs:758 pc=0x7fbfdfcf8a13
_ZN3std9panicking20rust_panic_with_hook17h57e78470c47c84deE
        library/std/src/panicking.rs:729 pc=0x7fbfdfcf8871
_ZN3std9panicking19begin_panic_handler28_$u7b$$u7b$closure$u7d$$u7d$17h3dfd2453cf356ecbE
        library/std/src/panicking.rs:599 pc=0x7fbfdfcf85b6
_ZN3std10sys_common9backtrace26__rust_end_short_backtrace17hdb177d43678e4d7eE
        library/std/src/sys_common/backtrace.rs:170 pc=0x7fbfdfcf58a5
rust_begin_unwind
        library/std/src/panicking.rs:595 pc=0x7fbfdfcf8301
_ZN4core9panicking9panic_fmt17hd1e971d8d7c78e0eE
        library/core/src/panicking.rs:67 pc=0x7fbfdee21b62
_ZN4core6result13unwrap_failed17hccb456d39e9c31fcE
        library/core/src/result.rs:1652 pc=0x7fbfdee22059
runtime.cgocall(0x5e03a20, 0xc00166d6c8)
        /go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-amd64/src/runtime/cgocall.go:157 +0x4b fp=0xc00166d6a0 sp=0xc00166d668 pc=0x1f68e6b
github.com/milvus-io/milvus/internal/querynodev2/segments._Cfunc_CreateTextIndex(0x7fbd667a2300, 0x65)
        _cgo_gotypes.go:458 +0x54 fp=0xc00166d6c8 sp=0xc00166d6a0 pc=0x5b58cb4
github.com/milvus-io/milvus/internal/querynodev2/segments.(*LocalSegment).CreateTextIndex.func1.1(0xc00166d718?, 0x65)
        /workspace/source/internal/querynodev2/segments/segment.go:1414 +0x47 fp=0xc00166d700 sp=0xc00166d6c8 pc=0x5b7ad87
github.com/milvus-io/milvus/internal/querynodev2/segments.(*LocalSegment).CreateTextIndex.func1()
        /workspace/source/internal/querynodev2/segments/segment.go:1414 +0x25 fp=0xc00166d728 sp=0xc00166d700 pc=0x5b7ace5
github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1()
        /workspace/source/pkg/util/conc/pool.go:81 +0xb3 fp=0xc00166d788 sp=0xc00166d728 pc=0x4a53753
github.com/panjf2000/ants/v2.(*goWorker).run.func1()
        /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67 +0x8d fp=0xc00166d7e0 sp=0xc00166d788 pc=0x3d8d7ad
runtime.goexit({})
        /go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-amd64/src/runtime/asm_amd64.s:1695 +0x1 fp=0xc00166d7e8 sp=0xc00166d7e0 pc=0x1fe0981
created by github.com/panjf2000/ants/v2.(*goWorker).run in goroutine 1334
        /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:48 +0x5c

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

    analyzer_params = {
        "tokenizer": "standard"
    }
    fields = [
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True),
        FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=25536,
                    enable_analyzer=True, analyzer_params=analyzer_params, enable_match=True),
        FieldSchema(name="sparse", dtype=DataType.SPARSE_FLOAT_VECTOR),
    ]
    schema = CollectionSchema(fields=fields, description="beir test collection")
    bm25_function = Function(
        name="text_bm25_emb",
        function_type=FunctionType.BM25,
        input_field_names=["text"],
        output_field_names=["sparse"],
        params={},
    )
    schema.add_function(bm25_function)
    collection = Collection(collection_name, schema)

class MilvusUser(MilvusBaseUser):
    """Main Milvus user class that defines the test tasks"""

    @tag('insert')
    @task(4)
    def insert(self):
        """Insert random vectors"""
        batch_size = 1000
        data = [
            {
                "id": int(time.time()*(10**6)),
                "text": faker.text(max_nb_chars=300),
            }
            for _ in range(batch_size)
        ]

        self.client.insert(data)

cluster: 4am ns: chaos-testing pod info

❯ k get pod -o wide|grep fts-stable-test-3                          
fts-stable-test-3-etcd-0                                          1/1     Running            0                  116m    10.104.24.236   4am-node29   <none>           <none>
fts-stable-test-3-etcd-1                                          1/1     Running            0                  116m    10.104.30.75    4am-node38   <none>           <none>
fts-stable-test-3-etcd-2                                          1/1     Running            0                  116m    10.104.18.186   4am-node25   <none>           <none>
fts-stable-test-3-kafka-0                                         2/2     Running            0                  116m    10.104.24.239   4am-node29   <none>           <none>
fts-stable-test-3-kafka-1                                         2/2     Running            1 (116m ago)       116m    10.104.30.77    4am-node38   <none>           <none>
fts-stable-test-3-kafka-2                                         2/2     Running            1 (116m ago)       116m    10.104.15.150   4am-node20   <none>           <none>
fts-stable-test-3-kafka-exporter-68d78fd97b-tc4v8                 1/1     Running            3 (116m ago)       116m    10.104.24.229   4am-node29   <none>           <none>
fts-stable-test-3-milvus-datanode-7ccf44bc54-2j7hm                1/1     Running            2 (116m ago)       116m    10.104.24.227   4am-node29   <none>           <none>
fts-stable-test-3-milvus-datanode-7ccf44bc54-6whg7                1/1     Running            2 (116m ago)       116m    10.104.26.38    4am-node32   <none>           <none>
fts-stable-test-3-milvus-indexnode-85d4b94dd5-vlr7t               1/1     Running            2 (116m ago)       116m    10.104.4.198    4am-node11   <none>           <none>
fts-stable-test-3-milvus-indexnode-85d4b94dd5-xk22c               1/1     Running            2 (116m ago)       116m    10.104.24.233   4am-node29   <none>           <none>
fts-stable-test-3-milvus-mixcoord-7d8887c68f-fphwt                1/1     Running            1 (116m ago)       116m    10.104.24.226   4am-node29   <none>           <none>
fts-stable-test-3-milvus-proxy-6c5f54657f-zg9v2                   1/1     Running            1 (116m ago)       116m    10.104.24.225   4am-node29   <none>           <none>
fts-stable-test-3-milvus-querynode-6db856bb84-9n7pl               0/1     CrashLoopBackOff   16 (39s ago)       116m    10.104.26.39    4am-node32   <none>           <none>
fts-stable-test-3-milvus-querynode-6db856bb84-csd5m               0/1     CrashLoopBackOff   16 (78s ago)       116m    10.104.32.211   4am-node39   <none>           <none>
fts-stable-test-3-milvus-querynode-6db856bb84-msttl               0/1     Running            16 (5m38s ago)     116m    10.104.24.228   4am-node29   <none>           <none>
fts-stable-test-3-minio-0                                         1/1     Running            0                  116m    10.104.24.237   4am-node29   <none>           <none>
fts-stable-test-3-minio-1                                         1/1     Running            0                  116m    10.104.30.74    4am-node38   <none>           <none>
fts-stable-test-3-minio-2                                         1/1     Running            0                  116m    10.104.18.185   4am-node25   <none>           <none>
fts-stable-test-3-minio-3                                         1/1     Running            0                  116m    10.104.20.74    4am-node22   <none>           <none>
fts-stable-test-3-zookeeper-0                                     1/1     Running            0                  116m    10.104.24.238   4am-node29   <none>           <none>
fts-stable-test-3-zookeeper-1                                     1/1     Running            0                  116m    10.104.30.76    4am-node38   <none>           <none>
fts-stable-test-3-zookeeper-2                                     1/1     Running            0                  116m    10.104.18.189   4am-node25   <none>           <none>

log.log

Anything else?

No response

zhuwenxing commented 2 weeks ago

/assign @zhengbuqian /assign @aoiasd

PTAL

zhengbuqian commented 2 weeks ago

this looks more like an issue of TextMatch, not BM25

github.com/milvus-io/milvus/internal/querynodev2/segments.(*LocalSegment).CreateTextIndex.func1.1

zhengbuqian commented 2 weeks ago

/unassign

/assign @czs007

zhagnlu commented 1 week ago

image upper case use mmap, and create inverted index with same fieldname and path, for different segment build index, using same index writer, image tantivy same index object throw exception

xiaocai2333 commented 1 week ago

upper case use mmap, and create inverted index with same fieldname and path, for different segment build index, using same index writer, tantivy same index object throw exception

So, was it caused by querynode and indexnode building the text index for this segment at the same time?

zhagnlu commented 1 week ago

upper case use mmap, and create inverted index with same fieldname and path, for different segment build index, using same index writer, tantivy same index object throw exception

So, was it caused by querynode and indexnode building the text index for this segment at the same time?

No, just querynode, the querynode build index itself, it is triggered when load process.

liliu-z commented 1 week ago

/assign @zhuwenxing

zhuwenxing commented 1 week ago

verified and fixed in master-20241115-e4b6773d-amd64

image