milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.36k stars 2.91k forks source link

[Bug]: Getting intermittently, search result is empty although collection has data #36404

Open prrs opened 1 month ago

prrs commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version: 2.3.11
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2): java 2.3.3
- OS(Ubuntu or CentOS): Linux
- CPU/Memory: n/a
- GPU: n/a
- Others: n/a

Current Behavior

Intermittently search result 0 response on a sealed collection where the number of rows are ~17 & 113. For the collection that I am sharing in the logs, there hasn't been any write operation recently. From the logs, it can be seen that we got 0 response intermittently.

1726818012_225262_260F7408-99B9-4AA1-B572-8DB64BB0FE8C - 1726818012_225262_260F7408-99B9-4AA1-B572-8DB64BB0FE8C.csv

Expected Behavior

It should alway result some data.

Steps To Reproduce

I am not able to repro in my local. It looks like an intermittent issue. Below are the steps followed on the collection for which we see the issue.

1. Create a collection
2. Write few rows 15 to 20
3. Flush the collection
4. Release and load the collection
5. Run few hundred queries
6. Repeat step 4 & 5

Index details - HNSW (efConstruction = 500, M = 48), ef = 5000(during query)

Milvus Log

1726818012_225262_260F7408-99B9-4AA1-B572-8DB64BB0FE8C - 1726818012_225262_260F7408-99B9-4AA1-B572-8DB64BB0FE8C.csv

Anything else?

No response

yanliang567 commented 1 month ago

@prrs

  1. sorry, i don't quite understand the attached .csv file, how to tell there is empty result in search with it?
  2. Also a quick question is does the empty result happen at the first time search after release and reload or it just happens in random order?
  3. please attach the milvus log files when it just reproduced to you. it would help us to know what was happening at that moment.

/assign @prrs

prrs commented 1 month ago

Below is the log that we see when it doesn't return any result.

[2024/09/19 04:36:16.116 +00:00] [ACCESS] Search [status: Successful] [code: 0] [sdk: Unknown] [msg: search result is empty] [traceID: 1629e2b06855571fa13d8b46ec01cd53] [timeCost: 4.407916ms] [database: milvusdefault] [collection: a360_prod_99351b69f2904256899f1d1d1921e417_8d8bbb8c_fb97_4410_b890_51ab3b745cd2] [partitions: [Shipping_FAQ_HS_V1_index__dlm_0d743192_a3f2_4c85_9424_61390fd3c1c8]] [expr: ]

And, below is the log when we get the result [2024/09/19 17:54:33.768 +00:00] [ACCESS] Search [status: Successful] [code: 0] [sdk: Unknown] [msg: ] [traceID: 0f9b724500aeae935d040a31121f9966] [timeCost: 6.581592ms] [database: milvusdefault] [collection: a360_prod_99351b69f2904256899f1d1d1921e417_8d8bbb8c_fb97_4410_b890_51ab3b745cd2] [partitions: [Shipping_FAQ_HS_V1_index__dlm_0d743192_a3f2_4c85_9424_61390fd3c1c8]] [expr: ]

prrs commented 1 month ago

Also a quick question is does the empty result happen at the first time search after release and reload or it just happens in random order?

Ans: It is intermittent. For the collection in the logs, most of the time it happened just after load, with few exception.

I have attached the Milvus proxy logs, exporting from Splunk. Please check the first column for the logs received from milvus.

SBodenstedt commented 1 month ago

So I think I am having a similar issue, two days ago, I created a collection, inserted about 150,000 data points into the collection and performed a couple of query operations. Everything worked as expected. Today, I am re-running the same code for querying (to the best of my knowledge, no code modifying the collection as been run since) and I am getting empty results for some of the queries. Also for at least some queries, the response seems to be different than two days ago (some metrics based on the responses I had calculated two days ago are different today).

I am using the stand-alone/docker based version (milvusdb/milvus:v2.4.5) and pymilvus 2.4.6.

In the docker log I am seeing this warning, maybe it is related?

[2024/09/18 08:37:16.296 +00:00] [WARN] [proxy/search_reduce_util.go:361] ["Proxy Reduce Search Result"] [traceID=7c17ca14a79a9973cdaf1c1de329a7a2] [error="the length (topk) between all result of query is different"] [errorVerbose="the length (topk) between all result of
 query is different\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.reduceSearchResultDataNoGroupBy\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:361\n  | github.com/milvus-io/milvus/internal/
proxy.reduceSearchResult\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:63\n  | github.com/milvus-io/milvus/internal/proxy.(*searchTask).reduceResults\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:552\n  | githu
b.com/milvus-io/milvus/internal/proxy.(*searchTask).PostExecute\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:640\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n  | \t/go/src/github.com/milvus-io/milvus/internal/p
roxy/task_scheduler.go:474\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:545\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/go
/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n  | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) the len
gth (topk) between all result of query is different\nError types: (1) *withstack.withStack (2) *errutil.leafError"]
xiaofan-luan commented 1 month ago

所以我想我遇到了类似的问题,两天前,我创建了一个集合,向该集合中插入了大约150,000个数据点,并执行了几个查询操作。一切都按预期进行。今天,我重新运行相同的代码进行查询(根据我相同,此后没有运行任何修改集合的代码),并且我得到了一些查询的空结果。另外,至少对于某些查询,响应似乎与两天前不同(基于我)两天前计算的响应的一些指标今天有所不同)。

我正在使用基于独立/docker的版本(milvusdb/milvus:v2.4.5)和pymilvus 2.4.6。

在 docker 日志中我看到了这个警告,也许它有关联?

[2024/09/18 08:37:16.296 +00:00] [WARN] [proxy/search_reduce_util.go:361] ["Proxy Reduce Search Result"] [traceID=7c17ca14a79a9973cdaf1c1de329a7a2] [error="the length (topk) between all result of query is different"] [errorVerbose="the length (topk) between all result of
 query is different\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.reduceSearchResultDataNoGroupBy\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:361\n  | github.com/milvus-io/milvus/internal/
proxy.reduceSearchResult\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:63\n  | github.com/milvus-io/milvus/internal/proxy.(*searchTask).reduceResults\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:552\n  | githu
b.com/milvus-io/milvus/internal/proxy.(*searchTask).PostExecute\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:640\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n  | \t/go/src/github.com/milvus-io/milvus/internal/p
roxy/task_scheduler.go:474\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:545\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/go
/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n  | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) the len
gth (topk) between all result of query is different\nError types: (1) *withstack.withStack (2) *errutil.leafError"]

@yanliang567 is there a way we can reproduce the issue?

xiaofan-luan commented 1 month ago

所以我想我遇到了类似的问题,两天前,我创建了一个集合,向该集合中插入了大约150,000个数据点,并执行了几个查询操作。一切都按预期进行。今天,我重新运行相同的代码进行查询(根据我相同,此后没有运行任何修改集合的代码),并且我得到了一些查询的空结果。另外,至少对于某些查询,响应似乎与两天前不同(基于我)两天前计算的响应的一些指标今天有所不同)。

我正在使用基于独立/docker的版本(milvusdb/milvus:v2.4.5)和pymilvus 2.4.6。

在 docker 日志中我看到了这个警告,也许它有关联?

[2024/09/18 08:37:16.296 +00:00] [WARN] [proxy/search_reduce_util.go:361] ["Proxy Reduce Search Result"] [traceID=7c17ca14a79a9973cdaf1c1de329a7a2] [error="the length (topk) between all result of query is different"] [errorVerbose="the length (topk) between all result of
 query is different\n(1) attached stack trace\n  -- stack trace:\n  | github.com/milvus-io/milvus/internal/proxy.reduceSearchResultDataNoGroupBy\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:361\n  | github.com/milvus-io/milvus/internal/
proxy.reduceSearchResult\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:63\n  | github.com/milvus-io/milvus/internal/proxy.(*searchTask).reduceResults\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:552\n  | githu
b.com/milvus-io/milvus/internal/proxy.(*searchTask).PostExecute\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:640\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n  | \t/go/src/github.com/milvus-io/milvus/internal/p
roxy/task_scheduler.go:474\n  | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n  | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:545\n  | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n  | \t/go
/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n  | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n  | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n  | runtime.goexit\n  | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) the len
gth (topk) between all result of query is different\nError types: (1) *withstack.withStack (2) *errutil.leafError"]

is there a way you can backup and give us a snapshot for reproduction?

yanliang567 commented 1 month ago

@weiliu1031 could you please double check if this is the same issue we fixed recently in 2.4

/assign @weiliu1031 /unassign

prrs commented 1 month ago

@xiaofan-luan It's happening intermittently in prod. So, I wouldn't be able to share the segment. I will share the segment, if it's repro in non prod.

@weiliu1031 did you get a chance to check this?

xiaofan-luan commented 1 month ago

There is not enough details for invesigtation. My guess is, you have duplicated primary key in your data, and you try to retrieve some field.

The duplicated pk cause requery can not find the right data.

  1. you need to check how much segment is there in your data
  2. is there a primary key duplicate in each segment? is there any primary key duplicate across the segment?
prrs commented 3 weeks ago

@xiaofan-luan we have a repro for this issue in our dev environment. We have taken a dump of segment. Please let me know what other data do you need for investigation. We can also do a live debugging.

xiaofan-luan commented 3 weeks ago

@prrs let's discuss on how to reproduce in the next meeting

prrs commented 3 weeks ago

@xiaofan-luan We still don't have a way to reproduce the issue. Unlike last time the issue is happening in our dev env, so we have full control to collect/dump whatever needed for debugging purpose. The only caveat is we have logic in place to release the collection within 48 hours and we have observed reload followed by release mitigate the issue.

Hence, I want to collect all the details before it goes away.

xiaofan-luan commented 3 weeks ago

@xiaofan-luan We still don't have a way to reproduce the issue. Unlike last time the issue is happening in our dev env, so we have full control to collect/dump whatever needed for debugging purpose. The only caveat is we have logic in place to release the collection within 48 hours and we have observed reload followed by release mitigate the issue.

Hence, I want to collect all the details before it goes away.

So you are saying release the collection and reload solved the problem?

xiaofan-luan commented 3 weeks ago

can we have logs for debugging?

liliu-z commented 3 weeks ago

Is it possible for us to reproduce this in debug mode and collect all logs it genereates?

prrs commented 3 weeks ago

@xiaofan-luan @liliu-z I had a close look. This issue is not similar to the one that we faced in prod. It seems there were some failure in the writepath path and this partition is not having any data.

liliu-z commented 3 weeks ago

@xiaofan-luan @liliu-z I had a close look. This issue is not similar to the one that we faced in prod. It seems there were some failure in the writepath path and this partition is not having any data.

Can you help provide logs about the failure?

xixihahaliu commented 1 week ago

@xiaofan-luan ,I also found the same phenomenon. Using the milvus 2.4.13 version, starting the milvus service with the docker.yaml method, the existing problem: After using it normally for a period of time, I restarted the container, and when I reloaded the existing collection, an error occurred, and I could not retrieve the existing data. The new data inserted could be retrieved normally. When I checked the logs, I found the related WARN: "failed to load collection from rootcoord", the yaml file is as follows: image,Due to the large size of the complete container logs: 230MB, I have extracted the last 20MB of the log file: last_20mb.log

weiliu1031 commented 1 week ago

@xiaofan-luan ,I also found the same phenomenon. Using the milvus 2.4.13 version, starting the milvus service with the docker.yaml method, the existing problem: After using it normally for a period of time, I restarted the container, and when I reloaded the existing collection, an error occurred, and I could not retrieve the existing data. The new data inserted could be retrieved normally. When I checked the logs, I found the related WARN: "failed to load collection from rootcoord", the yaml file is as follows: image,Due to the large size of the complete container logs: 230MB, I have extracted the last 20MB of the log file: last_20mb.log

milvus v2.4.13 has a critial bug, which has same behavior with your case, and you can get more details from https://github.com/milvus-io/milvus/releases/tag/v2.4.13-hotfix, it's recomand to upgrade to v2.4.13-hotfix or v2.4.14.