noobaa / noobaa-core

High-performance S3 application gateway to any backend - file / s3-compatible / multi-clouds / caching / replication ...
https://www.noobaa.io
Apache License 2.0
273 stars 80 forks source link

EnableMD5 is set true with HA (FOFB) saw the upload error with Error: mismatch part etag #8510

Open rkomandu opened 2 weeks ago

rkomandu commented 2 weeks ago

Environment info

noobaa d/s rpm = noobaa-core-5.17.1-20241104.el9 (standalone Noobaa)

Actual behavior

1. Ran the upload of an object with enablemd5 is set to true, then with HA functionality from one node to other CES IP moved , IO continued while it is in HA process but at the end it reported as shown below

"upload failed: ./file_50G to s3://newbucket-ha-reg/file_50G-obj An error occurred (InternalError) when calling the CompleteMultipartUpload operation (reached max retries: 4): We encountered an internal error. Please try again."

However in the noobaa.logs on the node (after all the parts are uploaded), the following mismatch of etag has been logged


Nov  5 03:31:57 node-gui0 [3516845]: [nsfs/3516845]    [L0] core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART newbucket-ha-reg file_50G-obj 5098
Nov  5 03:31:58 node-gui0 [3516845]: [nsfs/3516845]    [L0] core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART newbucket-ha-reg file_50G-obj 5108
Nov  5 03:33:07 node-gui0 [3516845]: [nsfs/3516845]    [L0] core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART newbucket-ha-reg file_50G-obj 6390
Nov  5 03:33:07 node-gui0 [3516845]: [nsfs/3516845]    [L0] core.endpoint.s3.ops.s3_put_object_uploadId:: PUT OBJECT PART newbucket-ha-reg file_50G-obj 6400
...
Nov  5 03:33:08 node-gui0 [3516845]: [nsfs/3516845] [ERROR] core.sdk.namespace_fs::  Error: mismatch part etag: {  num: 164,  etag: '96995b58d4cbf6aaa9041b4f00c7f6ae',  md_part_path: '/ibm/fvt_fs/s3user-17001-dir/newbucket-ha-reg/.noobaa-nsfs_6729d23a52a14216974196a6/multipart-uploads/27421b2a-f3ba-4755-9f8e-32cb11d85e85/part-164',  md_part_stat: { dev: 45, ino: 1273543, mode: 33200, nlink: 1, uid: 17001, gid: 17000, rdev: 0, size: 0, blksize: 4194304, blocks: 0, atimeMs: 1730795197969.724, ctimeMs: 1730795197969.724, mtimeMs: 1730795197969.724, birthtimeMs: 1730795197969.724, atime: 2024-11-05T08:26:37.970Z, mtime: 2024-11-05T08:26:37.970Z, ctime: 2024-11-05T08:26:37.970Z, birthtime: 2024-11-05T08:26:37.970Z, atimeNsBigint: 1730795197969723904n, ctimeNsBigint: 1730795197969723904n, mtimeNsBigint: 1730795197969723904n, xattr: { 'security.selinux': 'system_u:object_r:unlabeled_t:s0\x00' } },  params: {    obj_id: '27421b2a-f3ba-4755-9f8e-32cb11d85e85',    bucket: 'newbucket-ha-reg',    key: 'file_50G-obj',    md_conditions: undefined,    multiparts: [      { num: 1, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },  { num: 2, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 3, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },  { num: 4, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 5, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },  { num: 6, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 7, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },  { num: 8, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 9, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },  { num: 10, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 11, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 12, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 13, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 14, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 15, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 16, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 17, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 18, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 19, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 20, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 21, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 22, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 23, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 24, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 25, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 26, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 27, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 28, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 29, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 30, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 31, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 32, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 33, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 34, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 35, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 36, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 37, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 38, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 39, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 40, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 41, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 42, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 43, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 44, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 45, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 46, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 47, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 48, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 49, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 50, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 51, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 52, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 53, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 54, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 55, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 56, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 57, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 58, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 59, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 60, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 61, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 62, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 63, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 64, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 65, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 66, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 67, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 68, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },
...
m: 84, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 85, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 86, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 87, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 88
Error</Code><Message>We encountered an internal error. Please try again.</Message><Resource>/newbucket-ha-reg/file_50G-obj?uploadId=27421b2a-f3ba-4755-9f8e-32cb11d85e85</Resource><RequestId>m347079p-2anugj-1c5s</RequestId></Error> POST /newbucket-ha-reg/file_50G-obj?uploadId=27421b2a-f3ba-4755-9f8e-32cb11d85e85 {"host":"gpfs-p10-s3-ces.rtp.raleigh.ibm.com:6443","accept-encoding":"identity","user-agent":"aws-cli/1.29.62 md/Botocore#1.31.62 ua/2.0 os/linux#5.14.0-427.42.1.el9_4.ppc64le md/arch#ppc64le lang/python#3.9.18 md/pyimpl#CPython cfg/retry-mode#legacy botocore/1.31.62","x-amz-date":"20241105T083308Z","x-amz-content-sha256":"b5dd75d3efc4e6519e5cb8f1964de4384b98e6c525d13f90b534b46c45420530","authorization":"AWS4-HMAC-SHA256 Credential=KCxP4AN9937kVqoCrNIs/20241105/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=0aea3ca2daf5c83d13dda43f3894be499a63e300117e67239cb291aa670dd79a","amz-sdk-invocation-id":"65e60f6d-eeb5-4ad9-87ab-999b44c7acd4","amz-sdk-request":"attempt=1","content-length":"568592"} Error: mismatch part etag:
{  num: 164,  etag: '96995b58d4cbf6aaa9041b4f00c7f6ae',  md_part_path: '/ibm/fvt_fs/s3user-17001-dir/newbucket-ha-reg/.noobaa-nsfs_6729d23a52a14216974196a6/multipart-uploads/27421b2a-f3ba-4755-9f8e-32cb11d85e85/part-164',  md_part_stat: { dev: 45, ino: 1273543, mode: 33200, nlink: 1, uid: 17001, gid: 17000, rdev: 0, size: 0, blksize: 4194304, blocks: 0, atimeMs: 1730795197969.724, ctimeMs: 1730795197969.724, mtimeMs: 1730795197969.724, birthtimeMs: 1730795197969.724, atime: 2024-11-05T08:26:37.970Z, mtime: 2024-11-05T08:26:37.970Z, ctime: 2024-11-05T08:26:37.970Z, birthtime: 2024-11-05T08:26:37.970Z, atimeNsBigint: 1730795197969723904n, ctimeNsBigint: 1730795197969723904n, mtimeNsBigint: 1730795197969723904n, xattr: { 'security.selinux': 'system_u:object_r:unlabeled_t:s0\x00' } },  params: {    obj_id: '27421b2a-f3ba-4755-9f8e-32cb11d85e85',    bucket: 'newbucket-ha-reg',    key: 'file_50G-obj',    md_conditions: undefined,    multiparts: [      { num: 1, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },  { num: 2, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 3, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },  { num: 4, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 5, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },  { num: 6, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 7, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },  { num: 8, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 9, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },  { num: 10, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 11, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 12, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 13, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 14, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 15, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 16, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 17, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 18, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 19, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 20, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },      { num: 21, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' }, { num: 22, etag: '96995b58d4cbf6aaa9041b4f00c7f6ae' },

Expected behavior

1. What is the reason for the etag mismatch (the system is for a RR setup DNS) so the IO continues on the HA failover mechanism , it shouldn't get the error

Steps to reproduce

1. Upload a large object generate an assert for gpfs daemon (it stop all services, starts gpfs daemon, Start Services back) upload should be successful

More information - Screenshots / Logs / Other output

I am posting the logs of noobaa on the protocol nodes (2 of them) and gpfs logs as well.

https://ibm.ent.box.com/folder/292622193321

naveenpaul1 commented 1 day ago

Tried to reproduce the issue with both the local system and HA Scale system and could not reproduce the issue, will try a few more times. Wen last time the issue face reproduced the complete log was not enabled and need to reproduce the issue with all logs enabled. cc : @rkomandu @romayalon