Open yanliang567 opened 1 year ago
/assign @czs007 /unassign
pods on devops:
yanliang-sse4-2-182769-etcd-0 1/1 Running 0 15h 10.102.7.63 devops-node11 <none> <none>
yanliang-sse4-2-182769-milvus-standalone-ff74cc996-ph9fm 1/1 Running 0 15h 10.102.10.38 devops-node20 <none> <none>
yanliang-sse4-2-182769-minio-97cfc5445-9h4w6 1/1 Running 0 15h 10.102.9.87 devops-node13 <none> <none>
The process of index construction is asynchronous, and it is not possible to directly return this error through the "create_index" function. Additionally, it seems inappropriate to set the index status as "Failed" at the moment. In the upcoming version 2.3.3, a new feature will be added. If there are multiple retry attempts, the index status will be set as "disabled." After resolving the issue through subsequent operational means, the index construction can be triggered again using the "enable" command.
change the milestone to 2.4 as comments above
/assign @liliu-z could you take care about this?
/assign @liliu-z could you take care about this?
NP
The process of index construction is asynchronous, and it is not possible to directly return this error through the "create_index" function. Additionally, it seems inappropriate to set the index status as "Failed" at the moment. In the upcoming version 2.3.3, a new feature will be added. If there are multiple retry attempts, the index status will be set as "disabled." After resolving the issue through subsequent operational means, the index construction can be triggered again using the "enable" command.
But how can user know this. Is possible to store the error msg in meta data and throw it out during the following operations like load
and search
?
Looks like this is not a bug but a feature or enhancement, and it should not block the 2.4 I guess. cc @xiaofan-luan @yanliang567
@liliu-z was this fixed in 2.4?
Knowhere throws error out for this and Milvus keeps retrying. For short term: Milvus need to stop retrying on such error and alert user through prometheus. For mid term: Knowhere need to refine its error system to let Milvus know what is retryable and what is not. Created a issue for it https://github.com/zilliztech/knowhere/issues/482
For short term fix /assign @czs007
Knowhere throws error out for this and Milvus keeps retrying. For short term: Milvus need to stop retrying on such error and alert user through prometheus. For mid term: Knowhere need to refine its error system to let Milvus know what is retryable and what is not. Created a issue for it zilliztech/knowhere#482
For short term fix /assign @czs007
but how can this be fixed? if milvus stop trying what should we do with this index?
we can retry 10 times and stop, but could this collection still serving?
My suggestion is collection can drop exist indexes and stick on BF/interim indexes. And also Milvus need to alert user in some ways (like all aync tasks). We can record it in Prometheus metrics, also record this failure in somewhere like etcd so it can be return to users in the next call.
related pr : https://github.com/milvus-io/milvus/pull/31844
Is there an existing issue for this?
Environment
Current Behavior
build index hang if trying to build scann index on a milvus running with simd sse4_2
Expected Behavior
As sse4_2 does not support building scann index, milvus shall return a meaningful error immediately, instead of hang
Steps To Reproduce
Milvus Log
Milvus shall handel this error and return a meanful error to clients
[2023/10/24 02:46:53.103 +00:00] [ERROR] [indexnode/task.go:356] ["failed to build index"] [error="failed to create index, C Runtime Exception: => failed to build index, the current index is not supported on the current CPU model\n: internal code=2001: segcore error"] [errorVerbose="failed to create index, C Runtime Exception: => failed to build index, the current index is not supported on the current CPU model: internal code=2001: segcore error\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrSegcore\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:737\n | [...repeated from below...]\nWraps: (2) failed to create index, C Runtime Exception: => failed to build index, the current index is not supported on the current CPU model\nWraps: (3) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrSegcore\n
Anything else?
No response