Open prrs opened 1 month ago
@prrs
/assign @prrs
Below is the log that we see when it doesn't return any result.
[2024/09/19 04:36:16.116 +00:00] [ACCESS]
And, below is the log when we get the result
[2024/09/19 17:54:33.768 +00:00] [ACCESS]
Also a quick question is does the empty result happen at the first time search after release and reload or it just happens in random order?
Ans: It is intermittent. For the collection in the logs, most of the time it happened just after load, with few exception.
I have attached the Milvus proxy logs, exporting from Splunk. Please check the first column for the logs received from milvus.
So I think I am having a similar issue, two days ago, I created a collection, inserted about 150,000 data points into the collection and performed a couple of query operations. Everything worked as expected. Today, I am re-running the same code for querying (to the best of my knowledge, no code modifying the collection as been run since) and I am getting empty results for some of the queries. Also for at least some queries, the response seems to be different than two days ago (some metrics based on the responses I had calculated two days ago are different today).
I am using the stand-alone/docker based version (milvusdb/milvus:v2.4.5) and pymilvus 2.4.6.
In the docker log I am seeing this warning, maybe it is related?
[2024/09/18 08:37:16.296 +00:00] [WARN] [proxy/search_reduce_util.go:361] ["Proxy Reduce Search Result"] [traceID=7c17ca14a79a9973cdaf1c1de329a7a2] [error="the length (topk) between all result of query is different"] [errorVerbose="the length (topk) between all result of
query is different\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/proxy.reduceSearchResultDataNoGroupBy\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:361\n | github.com/milvus-io/milvus/internal/
proxy.reduceSearchResult\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:63\n | github.com/milvus-io/milvus/internal/proxy.(*searchTask).reduceResults\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:552\n | githu
b.com/milvus-io/milvus/internal/proxy.(*searchTask).PostExecute\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:640\n | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n | \t/go/src/github.com/milvus-io/milvus/internal/p
roxy/task_scheduler.go:474\n | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:545\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go
/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) the len
gth (topk) between all result of query is different\nError types: (1) *withstack.withStack (2) *errutil.leafError"]
所以我想我遇到了类似的问题,两天前,我创建了一个集合,向该集合中插入了大约150,000个数据点,并执行了几个查询操作。一切都按预期进行。今天,我重新运行相同的代码进行查询(根据我相同,此后没有运行任何修改集合的代码),并且我得到了一些查询的空结果。另外,至少对于某些查询,响应似乎与两天前不同(基于我)两天前计算的响应的一些指标今天有所不同)。
我正在使用基于独立/docker的版本(milvusdb/milvus:v2.4.5)和pymilvus 2.4.6。
在 docker 日志中我看到了这个警告,也许它有关联?
[2024/09/18 08:37:16.296 +00:00] [WARN] [proxy/search_reduce_util.go:361] ["Proxy Reduce Search Result"] [traceID=7c17ca14a79a9973cdaf1c1de329a7a2] [error="the length (topk) between all result of query is different"] [errorVerbose="the length (topk) between all result of query is different\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/proxy.reduceSearchResultDataNoGroupBy\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:361\n | github.com/milvus-io/milvus/internal/ proxy.reduceSearchResult\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:63\n | github.com/milvus-io/milvus/internal/proxy.(*searchTask).reduceResults\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:552\n | githu b.com/milvus-io/milvus/internal/proxy.(*searchTask).PostExecute\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:640\n | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n | \t/go/src/github.com/milvus-io/milvus/internal/p roxy/task_scheduler.go:474\n | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:545\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go /src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) the len gth (topk) between all result of query is different\nError types: (1) *withstack.withStack (2) *errutil.leafError"]
@yanliang567 is there a way we can reproduce the issue?
所以我想我遇到了类似的问题,两天前,我创建了一个集合,向该集合中插入了大约150,000个数据点,并执行了几个查询操作。一切都按预期进行。今天,我重新运行相同的代码进行查询(根据我相同,此后没有运行任何修改集合的代码),并且我得到了一些查询的空结果。另外,至少对于某些查询,响应似乎与两天前不同(基于我)两天前计算的响应的一些指标今天有所不同)。
我正在使用基于独立/docker的版本(milvusdb/milvus:v2.4.5)和pymilvus 2.4.6。
在 docker 日志中我看到了这个警告,也许它有关联?
[2024/09/18 08:37:16.296 +00:00] [WARN] [proxy/search_reduce_util.go:361] ["Proxy Reduce Search Result"] [traceID=7c17ca14a79a9973cdaf1c1de329a7a2] [error="the length (topk) between all result of query is different"] [errorVerbose="the length (topk) between all result of query is different\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/proxy.reduceSearchResultDataNoGroupBy\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:361\n | github.com/milvus-io/milvus/internal/ proxy.reduceSearchResult\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/search_reduce_util.go:63\n | github.com/milvus-io/milvus/internal/proxy.(*searchTask).reduceResults\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:552\n | githu b.com/milvus-io/milvus/internal/proxy.(*searchTask).PostExecute\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:640\n | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n | \t/go/src/github.com/milvus-io/milvus/internal/p roxy/task_scheduler.go:474\n | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:545\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go /src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/worker.go:67\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (2) the len gth (topk) between all result of query is different\nError types: (1) *withstack.withStack (2) *errutil.leafError"]
is there a way you can backup and give us a snapshot for reproduction?
@weiliu1031 could you please double check if this is the same issue we fixed recently in 2.4
/assign @weiliu1031 /unassign
@xiaofan-luan It's happening intermittently in prod. So, I wouldn't be able to share the segment. I will share the segment, if it's repro in non prod.
@weiliu1031 did you get a chance to check this?
There is not enough details for invesigtation. My guess is, you have duplicated primary key in your data, and you try to retrieve some field.
The duplicated pk cause requery can not find the right data.
@xiaofan-luan we have a repro for this issue in our dev environment. We have taken a dump of segment. Please let me know what other data do you need for investigation. We can also do a live debugging.
@prrs let's discuss on how to reproduce in the next meeting
@xiaofan-luan We still don't have a way to reproduce the issue. Unlike last time the issue is happening in our dev env, so we have full control to collect/dump whatever needed for debugging purpose. The only caveat is we have logic in place to release the collection within 48 hours and we have observed reload followed by release mitigate the issue.
Hence, I want to collect all the details before it goes away.
@xiaofan-luan We still don't have a way to reproduce the issue. Unlike last time the issue is happening in our dev env, so we have full control to collect/dump whatever needed for debugging purpose. The only caveat is we have logic in place to release the collection within 48 hours and we have observed reload followed by release mitigate the issue.
Hence, I want to collect all the details before it goes away.
So you are saying release the collection and reload solved the problem?
can we have logs for debugging?
Is it possible for us to reproduce this in debug mode and collect all logs it genereates?
@xiaofan-luan @liliu-z I had a close look. This issue is not similar to the one that we faced in prod. It seems there were some failure in the writepath path and this partition is not having any data.
@xiaofan-luan @liliu-z I had a close look. This issue is not similar to the one that we faced in prod. It seems there were some failure in the writepath path and this partition is not having any data.
Can you help provide logs about the failure?
@xiaofan-luan ,I also found the same phenomenon. Using the milvus 2.4.13 version, starting the milvus service with the docker.yaml method, the existing problem: After using it normally for a period of time, I restarted the container, and when I reloaded the existing collection, an error occurred, and I could not retrieve the existing data. The new data inserted could be retrieved normally. When I checked the logs, I found the related WARN: "failed to load collection from rootcoord", the yaml file is as follows: ,Due to the large size of the complete container logs: 230MB, I have extracted the last 20MB of the log file: last_20mb.log
@xiaofan-luan ,I also found the same phenomenon. Using the milvus 2.4.13 version, starting the milvus service with the docker.yaml method, the existing problem: After using it normally for a period of time, I restarted the container, and when I reloaded the existing collection, an error occurred, and I could not retrieve the existing data. The new data inserted could be retrieved normally. When I checked the logs, I found the related WARN: "failed to load collection from rootcoord", the yaml file is as follows: ,Due to the large size of the complete container logs: 230MB, I have extracted the last 20MB of the log file: last_20mb.log
milvus v2.4.13 has a critial bug, which has same behavior with your case, and you can get more details from https://github.com/milvus-io/milvus/releases/tag/v2.4.13-hotfix, it's recomand to upgrade to v2.4.13-hotfix or v2.4.14.
Is there an existing issue for this?
Environment
Current Behavior
Intermittently search result 0 response on a sealed collection where the number of rows are ~17 & 113. For the collection that I am sharing in the logs, there hasn't been any write operation recently. From the logs, it can be seen that we got 0 response intermittently.
1726818012_225262_260F7408-99B9-4AA1-B572-8DB64BB0FE8C - 1726818012_225262_260F7408-99B9-4AA1-B572-8DB64BB0FE8C.csv
Expected Behavior
It should alway result some data.
Steps To Reproduce
Milvus Log
1726818012_225262_260F7408-99B9-4AA1-B572-8DB64BB0FE8C - 1726818012_225262_260F7408-99B9-4AA1-B572-8DB64BB0FE8C.csv
Anything else?
No response