milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
31.13k stars 2.95k forks source link

[Bug]: build index hang if trying to build scann index on a milvus running with simd sse4_2 #27875

Open yanliang567 opened 1 year ago

yanliang567 commented 1 year ago

Is there an existing issue for this?

Environment

- Milvus version:master-20231021-d2dbbbc1
- Deployment mode(standalone or cluster):

Current Behavior

build index hang if trying to build scann index on a milvus running with simd sse4_2

Expected Behavior

As sse4_2 does not support building scann index, milvus shall return a meaningful error immediately, instead of hang

Steps To Reproduce

1. run milvus with simd type see4_2
2. create a collection and insert some data
3. try to build scann index

Milvus Log

Milvus shall handel this error and return a meanful error to clients [2023/10/24 02:46:53.103 +00:00] [ERROR] [indexnode/task.go:356] ["failed to build index"] [error="failed to create index, C Runtime Exception: => failed to build index, the current index is not supported on the current CPU model\n: internal code=2001: segcore error"] [errorVerbose="failed to create index, C Runtime Exception: => failed to build index, the current index is not supported on the current CPU model: internal code=2001: segcore error\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrSegcore\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/merr/utils.go:737\n | [...repeated from below...]\nWraps: (2) failed to create index, C Runtime Exception: => failed to build index, the current index is not supported on the current CPU model\nWraps: (3) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/util/merr.WrapErrSegcore\n

Anything else?

No response

yanliang567 commented 1 year ago

/assign @czs007 /unassign

yanliang567 commented 1 year ago

pods on devops:

yanliang-sse4-2-182769-etcd-0                                  1/1     Running       0                 15h     10.102.7.63     devops-node11   <none>           <none>
yanliang-sse4-2-182769-milvus-standalone-ff74cc996-ph9fm       1/1     Running       0                 15h     10.102.10.38    devops-node20   <none>           <none>
yanliang-sse4-2-182769-minio-97cfc5445-9h4w6                   1/1     Running       0                 15h     10.102.9.87     devops-node13   <none>           <none>
czs007 commented 1 year ago

The process of index construction is asynchronous, and it is not possible to directly return this error through the "create_index" function. Additionally, it seems inappropriate to set the index status as "Failed" at the moment. In the upcoming version 2.3.3, a new feature will be added. If there are multiple retry attempts, the index status will be set as "disabled." After resolving the issue through subsequent operational means, the index construction can be triggered again using the "enable" command.

yanliang567 commented 1 year ago

change the milestone to 2.4 as comments above

xiaofan-luan commented 8 months ago

/assign @liliu-z could you take care about this?

liliu-z commented 8 months ago

/assign @liliu-z could you take care about this?

NP

liliu-z commented 8 months ago

The process of index construction is asynchronous, and it is not possible to directly return this error through the "create_index" function. Additionally, it seems inappropriate to set the index status as "Failed" at the moment. In the upcoming version 2.3.3, a new feature will be added. If there are multiple retry attempts, the index status will be set as "disabled." After resolving the issue through subsequent operational means, the index construction can be triggered again using the "enable" command.

But how can user know this. Is possible to store the error msg in meta data and throw it out during the following operations like load and search?

liliu-z commented 8 months ago

Looks like this is not a bug but a feature or enhancement, and it should not block the 2.4 I guess. cc @xiaofan-luan @yanliang567

yanliang567 commented 8 months ago

@liliu-z was this fixed in 2.4?

liliu-z commented 8 months ago

Knowhere throws error out for this and Milvus keeps retrying. For short term: Milvus need to stop retrying on such error and alert user through prometheus. For mid term: Knowhere need to refine its error system to let Milvus know what is retryable and what is not. Created a issue for it https://github.com/zilliztech/knowhere/issues/482

For short term fix /assign @czs007

xiaofan-luan commented 8 months ago

Knowhere throws error out for this and Milvus keeps retrying. For short term: Milvus need to stop retrying on such error and alert user through prometheus. For mid term: Knowhere need to refine its error system to let Milvus know what is retryable and what is not. Created a issue for it zilliztech/knowhere#482

For short term fix /assign @czs007

but how can this be fixed? if milvus stop trying what should we do with this index?

xiaofan-luan commented 8 months ago

we can retry 10 times and stop, but could this collection still serving?

liliu-z commented 8 months ago

My suggestion is collection can drop exist indexes and stick on BF/interim indexes. And also Milvus need to alert user in some ways (like all aync tasks). We can record it in Prometheus metrics, also record this failure in somewhere like etcd so it can be return to users in the next call.

czs007 commented 7 months ago

related pr : https://github.com/milvus-io/milvus/pull/31844