milvus-io / milvus

A cloud-native vector database, storage for next generation AI applications
https://milvus.io
Apache License 2.0
29.29k stars 2.81k forks source link

[Bug]: Indexnode fails ( can't load collections) #34608

Closed sfisli closed 5 days ago

sfisli commented 1 month ago

Is there an existing issue for this?

Environment

- Milvus version: 2.2.13
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  kafka
- OS(Ubuntu or CentOS): kubernetes

Current Behavior

Expected Behavior

when i do load collection it loads without issues.

Steps To Reproduce

No response

Milvus Log

No response

Anything else?

No response

yanliang567 commented 1 month ago

I guess you did not access to aws s3 service successfully, please try to check the config is correct. https://milvus.io/docs/aws.md also I suggest you retry on latest milvus 2.3.18 or 2.4.5, which fixed a lot of issues. @sfisli /assign @sfisli /unassign

sfisli commented 1 month ago

@yanliang567
Milvus policy:

`        {
            "Action": [
                "s3:*"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::milvus-match-video-objects-bucket-prod",
                "arn:aws:s3:::milvus-match-video-objects-bucket-prod/*"
            ]
        }`

externalS3:

externalS3:
  enabled: true
  host: "s3.eu-west-3.amazonaws.com"
  port: "80"
  accessKey: "milvus-accesskey"
  secretKey: "milvus-secretkey"
  cloudProvider: "aws"
  useSSL: false
  bucketName: "milvus-match-video-objects-bucket-prod"
  rootPath: ""
  useIAM: false
  iamEndpoint: ""

here is my config & it used to work ! ``

yanliang567 commented 1 month ago

/assign @LoveEachDay could you please help to take a look /unassign

xiaofan-luan commented 1 month ago

Recommended to upgrade to 2.3.18 see if it can work

xiaofan-luan commented 1 month ago

we fixed many S3 related compatibility issue on early 2.3. But if it used to work I guess you need to check on change of your S3 dependency, especially if it is not a standard AWS S3

sfisli commented 1 month ago

@xiaofan-luan i got this error on querynode pod when i updated to 2.3.18 :

goroutine 346 [select]:
runtime.gopark(0xc000111780?, 0x2?, 0xa?, 0x2a?, 0xc000111744?)
    /usr/local/go/src/runtime/proc.go:381 +0xd6 fp=0xc0001115c8 sp=0xc0001115a8 pc=0x1af3256
runtime.selectgo(0xc000111780, 0xc000111740, 0x0?, 0x0, 0x0?, 0x1)
    /usr/local/go/src/runtime/select.go:327 +0x7be fp=0xc000111708 sp=0xc0001115c8 pc=0x1b03d7e
github.com/panjf2000/ants/v2.(*Pool).ticktock(0xc000129680, {0x57c4e30, 0xc0001274a0})
    /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/pool.go:125 +0x145 fp=0xc0001117b8 sp=0xc000111708 pc=0x3516b05
github.com/panjf2000/ants/v2.(*Pool).goTicktock.func1()
    /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/pool.go:154 +0x2e fp=0xc0001117e0 sp=0xc0001117b8 pc=0x3516f0e
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1598 +0x1 fp=0xc0001117e8 sp=0xc0001117e0 pc=0x1b29b21
created by github.com/panjf2000/ants/v2.(*Pool).goTicktock
    /go/pkg/mod/github.com/panjf2000/ants/v2@v2.7.2/pool.go:154 +0x115
sfisli commented 1 month ago

full error with 2.2.13 :

[2024/07/23 12:48:36.132 +00:00] [WARN] [querynode/cgo_helper.go:56] ["LoadFieldData failed, C Runtime Exception: [UnexpectedError] Error:GetObjectSize[errcode:400, exception:, errmessage:No response body.]\n"]
[2024/07/23 12:48:36.135 +00:00] [INFO] [gc/gc_tuner.go:84] ["GC Tune done"] ["previous GOGC"=200] ["heapuse "=19] ["total memory"=77] ["next GC"=44] ["new GOGC"=200] [gc-pause=63.852µs] [gc-pause-end=1721738916134442985]
[2024/07/23 12:48:36.135 +00:00] [ERROR] [querynode/segment_loader.go:204] ["load segment failed when load data into memory"] [collectionID=451049109224496362] [segmentType=Growing] [partitionID=451049109224496363] [segmentID=451049109224696374] [error="[UnexpectedError] Error:GetObjectSize[errcode:400, exception:, errmessage:No response body.]"] [stack="github.com/milvus-io/milvus/internal/querynode.(*segmentLoader).LoadSegment.func3\n\t/go/src/github.com/milvus-io/milvus/internal/querynode/segment_loader.go:204\ngithub.com/milvus-io/milvus/internal/util/funcutil.ProcessFuncParallel.func3\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/parallel.go:83"]
[2024/07/23 12:48:36.135 +00:00] [ERROR] [funcutil/parallel.go:85] [loadSegmentFunc] [error="[UnexpectedError] Error:GetObjectSize[errcode:400, exception:, errmessage:No response body.]"] [idx=0] [stack="github.com/milvus-io/milvus/internal/util/funcutil.ProcessFuncParallel.func3\n\t/go/src/github.com/milvus-io/milvus/internal/util/funcutil/parallel.go:85"]
[2024/07/23 12:48:36.136 +00:00] [INFO] [querynode/segment.go:320] ["delete segment from memory"] [collectionID=451049109224496362] [partitionID=451049109224496363] [segmentID=451049109224696374] [segmentType=Growing]
[2024/07/23 12:48:36.138 +00:00] [WARN] [querynode/watch_dm_channels_task.go:249] ["failed to load segment"] [collection=451049109224496362] [error="[UnexpectedError] Error:GetObjectSize[errcode:400, exception:, errmessage:No response body.]"]
[2024/07/23 12:48:36.138 +00:00] [INFO] [gc/gc_tuner.go:84] ["GC Tune done"] ["previous GOGC"=200] ["heapuse "=19] ["total memory"=78] ["next GC"=44] ["new GOGC"=200] [gc-pause=151.781µs] [gc-pause-end=1721738916137859832]
[2024/07/23 12:48:36.138 +00:00] [INFO] [querynode/shard_cluster.go:185] ["Close shard cluster"] [collectionID=451049109224496362] [channel=by-dev-rootcoord-dml_0_451049109224496362v0] [replicaID=451343238706757633]
[2024/07/23 12:48:36.138 +00:00] [INFO] [querynode/shard_cluster.go:394] ["Shard Cluster update state"] [collectionID=451049109224496362] [channel=by-dev-rootcoord-dml_0_451049109224496362v0] [replicaID=451343238706757633] ["old state"=2] ["new state"=2] [caller=github.com/milvus-io/milvus/internal/querynode.(*ShardCluster).Close.func1]
[2024/07/23 12:48:36.138 +00:00] [INFO] [querynode/collection.go:153] ["remove vChannel from collection"] [collectionID=451049109224496362] [channel=by-dev-rootcoord-dml_0_451049109224496362v0]
[2024/07/23 12:48:36.138 +00:00] [WARN] [querynode/impl.go:378] ["failed to subscribe channel"] [collectionID=451049109224496362] [nodeID=610] [channels="[by-dev-rootcoord-dml_0_451049109224496362v0]"] [error="failed to load growing segments, err: [UnexpectedError] Error:GetObjectSize[errcode:400, exception:, errmessage:No response body.]"]
[2024/07/23 12:48:37.128 +00:00] [INFO] [querynode/impl.go:357] ["watchDmChannels init"] [collectionID=451049109224496362] [nodeID=610] [channels="[by-dev-rootcoord-dml_1_451049109224496362v1]"]
[2024/07/23 12:48:37.128 +00:00] [INFO] [querynode/impl.go:360] ["watchDmChannels start "] [collectionID=451049109224496362] [nodeID=610] [channels="[by-dev-rootcoord-dml_1_451049109224496362v1]"] [timeInQueue=39.149µs]
[2024/07/23 12:48:37.128 +00:00] [INFO] [querynode/watch_dm_channels_task.go:84] ["Starting WatchDmChannels ..."] [collectionID=451049109224496362] [vChannels="[by-dev-rootcoord-dml_1_451049109224496362v1]"] [replicaID=451343238706757633] [loadType=LoadCollection] [collectionName=PILLOW] [metricType=IP]
[2024/07/23 12:48:37.128 +00:00] [INFO] [querynode/shard_cluster_service.go:81] ["successfully add shard cluster"] [collectionID=451049109224496362] [replica=451343238706757633] [vchan=by-dev-rootcoord-dml_1_451049109224496362v1]
[2024/07/23 12:48:37.128 +00:00] [INFO] [querynode/watch_dm_channels_task.go:243] ["loading growing segments in WatchDmChannels..."] [collectionID=451049109224496362] [unFlushedSegmentIDs="[451049109224696375]"]
[2024/07/23 12:48:37.128 +00:00] [INFO] [querynode/segment_loader.go:124] ["segmentLoader start loading..."] [collectionID=451049109224496362] [segmentType=Growing] [segmentNum=1] [msgID=1801]
[2024/07/23 12:48:37.130 +00:00] [INFO] [querynode/segment_loader.go:879] ["predict memory and disk usage while loading (in MiB)"] [collectionID=451049109224496362] [concurrency=1] [maxSegmentSize=2] [memUsage=77] [freeMemory=63420] [totalMemory=63497] [predictMemUsage=77] [predictPeakMemUsage=80] [diskUsage=0] [predictDiskUsage=0] [totalDisk=102387]
[2024/07/23 12:48:37.130 +00:00] [INFO] [querynode/segment.go:268] ["create segment"] [collectionID=451049109224496362] [partitionID=451049109224496363] [segmentID=451049109224696375] [segmentType=Growing] [vchannel=by-dev-rootcoord-dml_1_451049109224496362v1]
[2024/07/23 12:48:37.130 +00:00] [INFO] [querynode/segment_loader.go:219] ["start to load segments in parallel"] [collectionID=451049109224496362] [segmentType=Growing] [segmentNum=1] [concurrencyLevel=1]
[2024/07/23 12:48:37.130 +00:00] [INFO] [querynode/segment_loader.go:259] ["start loading segment data into memory"] [collectionID=451049109224496362] [partitionID=451049109224496363] [segmentID=451049109224696375] [segmentType=Growing]
2024-07-23 12:48:37,140 | INFO | default | [SEGCORE][ProcessFormattedStatement][milvus] [AWS LOG] [ERROR] 2024-07-23 12:48:37.140 AWSClient [139859884279552] HTTP response code: 400
Resolved remote host 
2024-07-23 12:48:37,140 | INFO | default | [SEGCORE][ProcessFormattedStatement][milvus] [AWS LOG] [ERROR] 2024-07-23 12:48:37.140 AWSClient [139859884279552] HTTP response code: 400
Resolved remote host IP address: 52.95.156.77
Request ID: 
Exception name: 
Error message: No response body.
6 response headers:
connection : close
content-type : application/xml
date : Tue, 23 Jul 2024 12:48:36 GMT
server : AmazonS3
x-amz-id-2 : 1/R7IiMoQexOA9a6UMaj+ncJ5oZncXS2+ZfcE4V69AXdMSOtG8ZL969bhHw4SWdMhZzFu+EG8Zw=
x-amz-request-id : H8A89DMXX83JF65N
xiaofan-luan commented 1 month ago

seems that S3 is returning 400 error

https://stackoverflow.com/questions/41132683/aws-s3-400-bad-request

guess this could be a url or region issue?

sfisli commented 1 month ago

@xiaofan-luan i updated it to 2.3.19 i got this error: panic: [UnknownError] precheck chunk manager client failed, error:Error in ListObjects[errcode:400, exception:AuthorizationHeaderMalformed, errmessage:Unable to parse ExceptionName: AuthorizationHeaderMalformed Message: The authorization header is malformed; the region 'us-east-1' is wrong; expecting 'eu-west-3', params:params, bucket=milvus-match-video-objects-bucket-prod, prefix=justforconnectioncheck], configuration:[address=s3.eu-west-3.amazonaws.com:80, bucket_name=milvus-match-video-objects-bucket-prod, root_path=., storage_type=remote, cloud_provider=aws, iam_endpoint=, log_level=fatal, region=, useSSL=false, sslCACert=19, useIAM=false, useVirtualHost=false, requestTimeoutMs=10000] s3 config:

externalS3:
  enabled: true
  region: "eu-west-3"
  host: "s3.eu-west-3.amazonaws.com"
  port: "80"
  accessKey: "AK.."
  secretKey: "6uJL/H9Y.."
  cloudProvider: "aws"
  useSSL: false
  bucketName: "milvus-match-video-objects-bucket-prod"
  rootPath: "/"
  useIAM: false
  iamEndpoint: ""
stale[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. Rotten issues close after 30d of inactivity. Reopen the issue with /reopen.