milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
30.41k stars 2.92k forks source link

[Bug]: Upgrade Milvus 2.4.1 -> 2.4.9 Error: Milvus is not ready yet. on Attu #35756

Open weiZhenkun opened 2 months ago

weiZhenkun commented 2 months ago

Is there an existing issue for this?

Environment

- Milvus version: 2.4.1 -> 2.4.9
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):  kafka  
- SDK version(e.g. pymilvus v2.0.0rc2): Attu
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Expected Behavior

Steps To Reproduce

Upgrade Milvus 2.4.1 -> 2.4.9 should be OK

Milvus realted vaules:
extraConfigFiles:
  user.yaml: |
    common:
      security:
        authorizationEnabled: true
      chanNamePrefix:
        cluster: test1

Milvus querynode Log

Aug 28, 2024 @ 11:49:32.096 [2024/08/28 03:49:32.096 +00:00] [WARN] [kafka/kafka_consumer.go:139] ["consume msg failed"] [topic=test1-rootcoord-dml_1] [groupID=querynode-100-test1-rootcoord-dml_1_451793181617364351v0-true] [error="Local: Timed out"]
Aug 28, 2024 @ 11:49:31.767 [2024/08/28 03:49:31.767 +00:00] [WARN] [kafka/kafka_consumer.go:139] ["consume msg failed"] [topic=test1-rootcoord-dml_0] [groupID=datanode-94-test1-rootcoord-dml_0_451792644505338701v0-true] [error="Local: Timed out"]
Aug 28, 2024 @ 11:49:31.700 [2024/08/28 03:49:31.700 +00:00] [WARN] [kafka/kafka_consumer.go:139] ["consume msg failed"] [topic=test1-rootcoord-dml_3] [groupID=datanode-94-test1-rootcoord-dml_3_451793181617565152v0-true] [error="Local: Timed out"]
Aug 28, 2024 @ 11:49:30.663 [2024/08/28 03:49:30.662 +00:00] [WARN] [kafka/kafka_consumer.go:139] ["consume msg failed"] [topic=test1-rootcoord-dml_3] [groupID=querynode-98-test1-rootcoord-dml_3_451793181617565152v0-true] [error="Local: Timed out"]
Aug 28, 2024 @ 11:49:30.378 [2024/08/28 03:49:30.378 +00:00] [WARN] [kafka/kafka_consumer.go:139] ["consume msg failed"] [topic=test1-rootcoord-dml_1] [groupID=datanode-94-test1-rootcoord-dml_1_451793181617364351v0-true] [error="Local: Timed out"]
Aug 28, 2024 @ 11:49:30.205 [2024/08/28 03:49:30.205 +00:00] [WARN] [kafka/kafka_consumer.go:139] ["consume msg failed"] [topic=test1-rootcoord-dml_0] [groupID=querynode-98-test1-rootcoord-dml_0_451792644505338701v0-true] [error="Local: Timed out"]

Anything else?

N/A

weiZhenkun commented 2 months ago

It may caused by the this invalidated seetings.

extraConfigFiles:
  user.yaml: |
    common:
      security:
        authorizationEnabled: true
      chanNamePrefix:
        cluster: test1

Actual: [channel=by-dev-rootcoord-dml_1_452146874032006318v0]

Expect: [channel=test1-rootcoord-dml_1_452146874032006318v0]

Please provide a workaround for upgrade bofore fix.

yanliang567 commented 2 months ago

/assign @congqixia /unassign

congqixia commented 2 months ago

@weiZhenkun

from the info you provided, the setting changes the Kafka topic name. IMO it's easy to use old prefix just by changing the configuration

Please ignore the previous missing understanding. The problem here is milvus cluster failed to read user.yaml setting. Could you please change the default milvus.yaml values?

congqixia commented 2 months ago

@weiZhenkun after some digging, we found that the config key your were using was a fallback key. Since we fill all exported default value in milvus.yaml, the fallback value in user.yaml will not be read.

// must init cluster prefix first
    p.ClusterPrefix = ParamItem{
        Key:          "msgChannel.chanNamePrefix.cluster",
        Version:      "2.1.0",
        FallbackKeys: []string{"common.chanNamePrefix.cluster"},
        DefaultValue: "by-dev",
        Doc: `Root name prefix of the channel when a message channel is created.
It is recommended to change this parameter before starting Milvus for the first time.
To share a Pulsar instance among multiple Milvus instances, consider changing this to a name rather than the default one for each Milvus instance before you start them.`,
        PanicIfEmpty: true,
        Forbidden:    true,
        Export:       true,
    }

The solution is to use msgChannel.chanNamePrefix.cluster key or the revised config below

extraConfigFiles:
  user.yaml: |
    common:
      security:
        authorizationEnabled: true
    msgChannel:
      chanNamePrefix:
        cluster: test1
weiZhenkun commented 2 months ago
extraConfigFiles:
  user.yaml: |
    common:
      security:
        authorizationEnabled: true
    msgChannel:
      chanNamePrefix:
        cluster: test1

or

extraConfigFiles:
  user.yaml: |
    common:
      security:
        authorizationEnabled: true
msgChannel:
  chanNamePrefix:
  cluster: test1

????

congqixia commented 2 months ago

@weiZhenkun msgChannel shall be at the root level of yaml file

weiZhenkun commented 2 months ago

The value does not work in milvus-helm-value.yaml from Milvus 2.4.8 to 2.4.9.

extraConfigFiles:
  user.yaml: |
    common:
      security:
        authorizationEnabled: true
      chanNamePrefix:
        cluster: test1

It works on Milvus 2.4.9 after my testing.

extraConfigFiles:
  user.yaml: |
    common:
      security:
        authorizationEnabled: true
    msgChannel:
      chanNamePrefix:
        cluster: test1
weiZhenkun commented 2 months ago

msgChannel shall be at the root level of yaml file,it does not work in the root level in 2.4.9

congqixia commented 2 months ago

@weiZhenkun

The FallbackKeys is not removed in Milvus 2.4.9, what happened here?

in v2.4.9, we put all exported default values in milvus.yaml. In this case, msgChannel.chanNamePrefix.cluster = by-dev If the value is found, the fallback key will not be read

Does the "milvus.yaml" is the milvus helm chart values? If yes, why didn’t I find this default configuration in the Helm values?

default yaml value is found in milvus repo. what helm chart do is try to override these default ones.

msgChannel shall be at the root level of yaml file,it does not work in the root level in 2.4.9

can get what your means. from what you said

extraConfigFiles:
  user.yaml: |
    common:
      security:
        authorizationEnabled: true
    msgChannel:
      chanNamePrefix:
        cluster: test1

this shall be working

stale[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.