noobaa / noobaa-core

High-performance S3 application gateway to any backend - file / s3-compatible / multi-clouds / caching / replication ...
https://www.noobaa.io
Apache License 2.0
269 stars 78 forks source link

NSFS | S3 | Versioning: CopyObject after multipart upload of a versioned object leads to invalid data content #8315

Closed hseipp closed 1 month ago

hseipp commented 1 month ago

Environment info

Actual behavior

Content of the copied object whose source was uploaded to a version-enabled bucket using multi-part upload is not identical to the source object:

E       AssertionError: assert 'bIMcUvjvnUme...cjoUdmaMHmNZD' == 'avLZIzDyrPsQ...cjoUdmaMHmNZD'
E         
E         - avLZIzDyrPsQTmrmiOZHrVdSZCtRFVsosTSLGjluRPBZGbXZNTKAKSWyhuegjyevNmYTqVxjrYlMllwCdCjefarPnMGBgGnnQebJjAdaIVSqUbhSkwakohDylIGMTJKgpjxJlmHEzycMbQdOvGdimZxcOFJnkBYBSvQEXYifdmBKfrgJOZXdVcvzbjsZFBOxvOQNqsAKqfUYXUBEBWzfQdGMgFBnXfiJWUrWOwLWsBaAPIAmkjzFVtHXETyLiLgZkKvjtbBkfDcFLxGBgnIeMjQBJYNzCVkzidUdaPqzaVYawKkIbRKqZsyjPSrpXrqIIECWMzhPQNkjyuEHfCSFeJPIqfBfxpPBooDkhsmDzmIMllnrYugjdonucmiOXipomGLWIshDfqhTWCZbjBXNusYadCgZWxyBpyloWHbDHGBGKFzqiFTvRemtGPKJSYwyDEFhjiJPTVeBGfbIaUcHoMDoIpSPFSkGgmOiNhHAluJAvymOtuOhvpobXaUbGoQCgLLAXBJvpBumNZJNBGsEyIoqnSxzoQlgqnZxGwZMvsGxoKkIOXlzLoAsscwzDB...

Expected behavior

Content of the copied object should be identical to the source object, assertion should pass.

Steps to reproduce

Run Ceph s3-tests test_object_copy_versioning_multipart_upload():

S3TEST_CONF=s3tests.conf tox -- s3tests_boto3/functional/test_s3.py::test_object_copy_versioning_multipart_upload

More information - Screenshots / Logs / Other output

See "all" logging level noobaa log attached - look at log entries at 15:34:00 and later:

noobaa_20240828_1748.log.gz

romayalon commented 1 month ago

@nadavMiz can you please take a look?

romayalon commented 1 month ago

@hseipp do you have libgpfs in this deployment? I saw this event in the logs -

"code":"noobaa_gpfslib_missing","message":"Noobaa GPFS library file is missing","description":"Noobaa GPFS library file is missing","entity_type":"NODE","event_type":"STATE_CHANGE","scope":"NODE","severity":"ERROR","state":"DEGRADED","arguments":{},"pid":1440}

Also adding here an error I saw in the logs, not sure it's causing the bug -

core.sdk.object_sdk:: Got an error while trying to omitBy param.xattr: { [Symbol(XATTR_SORT_SYMBOL)]: true } error: TypeError: name.startsWith is not a function    at /usr/local/noobaa-core/src/sdk/object_sdk.js:645:89    at /usr/local/noobaa-core/node_modules/lodash/lodash.js:10660:37    at /usr/local/noobaa-core/node_modules/lodash/lodash.js:13658:16    at basePickBy (/usr/local/noobaa-core/node_modules/lodash/lodash.js:3825:13)    at pickBy (/usr/local/noobaa-core/node_modules/lodash/lodash.js:13657:14)    at Function.omitBy (/usr/local/noobaa-core/node_modules/lodash/lodash.js:13607:14)    at NsfsObjectSDK.fix_copy_source_params (/usr/local/noobaa-core/src/sdk/object_sdk.js:645:48)    at async op_func (/usr/local/noobaa-core/src/sdk/object_sdk.js:730:41)    at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27)    at async Object.put_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_put_object.js:39:19)
Aug 28 15:34:47 m1.example.com node[9119]: Aug-28 15:34:47.626 [nsfs/9119]    [L3] core.sdk.object_sdk:: Got an error while trying to omitBy param.xattr: { [Symbol(XATTR_SORT_SYMBOL)]: true } error: TypeError: name.startsWith is not a function
Aug 28 15:34:47 m1.example.com node[9119]:     at /usr/local/noobaa-core/src/sdk/object_sdk.js:645:89
Aug 28 15:34:47 m1.example.com node[9119]:     at /usr/local/noobaa-core/node_modules/lodash/lodash.js:10660:37
Aug 28 15:34:47 m1.example.com node[9119]:     at /usr/local/noobaa-core/node_modules/lodash/lodash.js:13658:16
Aug 28 15:34:47 m1.example.com node[9119]:     at basePickBy (/usr/local/noobaa-core/node_modules/lodash/lodash.js:3825:13)
Aug 28 15:34:47 m1.example.com node[9119]:     at pickBy (/usr/local/noobaa-core/node_modules/lodash/lodash.js:13657:14)
Aug 28 15:34:47 m1.example.com node[9119]:     at Function.omitBy (/usr/local/noobaa-core/node_modules/lodash/lodash.js:13607:14)
Aug 28 15:34:47 m1.example.com node[9119]:     at NsfsObjectSDK.fix_copy_source_params (/usr/local/noobaa-core/src/sdk/object_sdk.js:645:48)
Aug 28 15:34:47 m1.example.com node[9119]:     at async op_func (/usr/local/noobaa-core/src/sdk/object_sdk.js:730:41)
Aug 28 15:34:47 m1.example.com node[9119]:     at async NsfsObjectSDK._call_op_and_update_stats (/usr/local/noobaa-core/src/sdk/object_sdk.js:543:27)
Aug 28 15:34:47 m1.example.com node[9119]:     at async Object.put_object [as handler] (/usr/local/noobaa-core/src/endpoint/s3/ops/s3_put_object.js:39:19)
Aug 28 15:34:47 m1.example.com node[9119]: Aug-28 15:34:47.626 [nsfs/9119]    [L2] core.sdk.namespace_fs:: NamespaceFS: is_server_side_copy: true
hseipp commented 1 month ago

Interesting. I am using a Storage Scale DME 5.2.1 GA installation.

$ sudo find / -name "libgpfs.so"
/usr/lib/libgpfs.so
/usr/lib64/libgpfs.so
/usr/lpp/mmfs/lib/i386/libgpfs.so
/usr/lpp/mmfs/lib/libgpfs.so
nadavMiz commented 1 month ago

we have an error in the code while trying to read the :

Aug 28 15:35:06 m1.example.com [9119]: [nsfs/9119]    [L0] core.sdk.namespace_fs:: NamespaceFS: read_object_stream { file_path: '/ibm/fs1/examples/buckets/s3tests-j2s4w9wx4zqvvi9wevnqs-1/dstmultipart', st
art: 0, end: Infinity, size: 31457280 }
Aug 28 15:35:06 m1.example.com [9119]: [nsfs/9119]    [L1] core.util.buffer_utils:: BufferPool.get_buffer: sem value 1342177280 waiting_value 0 buffers length 2
Aug 28 15:35:06 m1.example.com [9119]: [nsfs/9119]    [L1] core.util.buffer_utils:: BufferPool.get_buffer: sem value 1333788672 waiting_value 0 buffers length 1
Aug 28 15:35:06 m1.example.com [9119]: [nsfs/9119]    [L0] core.sdk.object_sdk:: request aborted /s3tests-j2s4w9wx4zqvvi9wevnqs-1/srcmultipart
Aug 28 15:35:06 m1.example.com node[9119]: Aug-28 15:35:06.573 [nsfs/9119]    [L0] core.sdk.object_sdk:: request aborted /s3tests-j2s4w9wx4zqvvi9wevnqs-1/srcmultipart
Aug 28 15:35:06 m1.example.com node[9119]: Aug-28 15:35:06.574 [nsfs/9119]    [L0] core.sdk.object_sdk:: request error: Error: aborted
Aug 28 15:35:06 m1.example.com node[9119]:     at connResetException (node:internal/errors:787:14)
Aug 28 15:35:06 m1.example.com node[9119]:     at abortIncoming (node:_http_server:793:17)
Aug 28 15:35:06 m1.example.com node[9119]:     at socketOnClose (node:_http_server:787:3)
Aug 28 15:35:06 m1.example.com node[9119]:     at TLSSocket.emit (node:events:530:35)
Aug 28 15:35:06 m1.example.com node[9119]:     at TLSSocket.emit (node:domain:488:12)
Aug 28 15:35:06 m1.example.com node[9119]:     at node:net:337:12
Aug 28 15:35:06 m1.example.com node[9119]:     at Socket.done (node:_tls_wrap:657:7)
Aug 28 15:35:06 m1.example.com node[9119]:     at Object.onceWrapper (node:events:633:26)
Aug 28 15:35:06 m1.example.com node[9119]:     at Socket.emit (node:events:530:35)
Aug 28 15:35:06 m1.example.com node[9119]:     at Socket.emit (node:domain:488:12) {
Aug 28 15:35:06 m1.example.com node[9119]:   code: 'ECONNRESET'
Aug 28 15:35:06 m1.example.com node[9119]: } /s3tests-j2s4w9wx4zqvvi9wevnqs-1/srcmultipart
Aug 28 15:35:06 m1.example.com [9119]: [nsfs/9119]    [L0] core.sdk.object_sdk:: request error: Error: aborted    at connResetException (node:internal/errors:787:14)    at abortIncoming (node:_http_server
:793:17)    at socketOnClose (node:_http_server:787:3)    at TLSSocket.emit (node:events:530:35)    at TLSSocket.emit (node:domain:488:12)    at node:net:337:12    at Socket.done (node:_tls_wrap:657:7)   
 at Object.onceWrapper (node:events:633:26)    at Socket.emit (node:events:530:35)    at Socket.emit (node:domain:488:12) {  code: 'ECONNRESET'} /s3tests-j2s4w9wx4zqvvi9wevnqs-1/srcmultipart
Aug 28 15:35:06 m1.example.com node[9119]: Aug-28 15:35:06.574 [nsfs/9119]    [L0] core.sdk.namespace_fs:: NamespaceFS: read_object_stream error file /ibm/fs1/examples/buckets/s3tests-j2s4w9wx4zqvvi9wevnqs-1/srcmultipart AbortError: The operation was aborted
Aug 28 15:35:06 m1.example.com node[9119]:     at EventTarget.abortListener (node:events:1006:14)
Aug 28 15:35:06 m1.example.com node[9119]:     at [nodejs.internal.kHybridDispatch] (node:internal/event_target:826:20)
Aug 28 15:35:06 m1.example.com node[9119]:     at EventTarget.dispatchEvent (node:internal/event_target:761:26)
Aug 28 15:35:06 m1.example.com node[9119]:     at abortSignal (node:internal/abort_controller:371:10)
Aug 28 15:35:06 m1.example.com node[9119]:     at AbortController.abort (node:internal/abort_controller:393:5)
Aug 28 15:35:06 m1.example.com node[9119]:     at IncomingMessage.<anonymous> (/usr/local/noobaa-core/src/sdk/object_sdk.js:135:35)
Aug 28 15:35:06 m1.example.com node[9119]:     at Object.onceWrapper (node:events:632:28)
Aug 28 15:35:06 m1.example.com node[9119]:     ... 2 lines matching cause stack trace ...
Aug 28 15:35:06 m1.example.com node[9119]:     at IncomingMessage._destroy (node:_http_incoming:224:10) {
Aug 28 15:35:06 m1.example.com node[9119]:   code: 'ABORT_ERR',
Aug 28 15:35:06 m1.example.com node[9119]:   [cause]: Error: request aborted /s3tests-j2s4w9wx4zqvvi9wevnqs-1/srcmultipart
Aug 28 15:35:06 m1.example.com node[9119]:       at IncomingMessage.<anonymous> (/usr/local/noobaa-core/src/sdk/object_sdk.js:135:41)
Aug 28 15:35:06 m1.example.com node[9119]:       at Object.onceWrapper (node:events:632:28)
Aug 28 15:35:06 m1.example.com node[9119]:       at IncomingMessage.emit (node:events:518:28)
Aug 28 15:35:06 m1.example.com node[9119]:       at IncomingMessage.emit (node:domain:488:12)
Aug 28 15:35:06 m1.example.com node[9119]:       at IncomingMessage._destroy (node:_http_incoming:224:10)
Aug 28 15:35:06 m1.example.com node[9119]:       at _destroy (node:internal/streams/destroy:121:10)
Aug 28 15:35:06 m1.example.com node[9119]:       at IncomingMessage.destroy (node:internal/streams/destroy:83:5)
Aug 28 15:35:06 m1.example.com node[9119]:       at abortIncoming (node:_http_server:793:9)
Aug 28 15:35:06 m1.example.com node[9119]:       at socketOnClose (node:_http_server:787:3)
Aug 28 15:35:06 m1.example.com node[9119]:       at TLSSocket.emit (node:events:530:35)
Aug 28 15:35:06 m1.example.com node[9119]: }

there are few things I observed when replicating in my environment:

  1. this issue doesn't always happens. it happens to me around 1 in 5 times.
  2. the issue doesn't happen if I change the test to not use versioning this likely happens because in non-versioned environment. this is likely because in non versioned scenario we usually set the copy mode to SAME_INODE or LINKED while in versioned scenario we always use FALLBACK and copy the file entirely. when changing the code to always use FALLBACK, the issue happens even when versioning is disabled. so it seems that the problem is with FALLBACK and more specifically in the read_object_stream function. looks we we get an abort signal while trying to read the file
  3. it happens only when trying to copy large files, when trying to copy files smaller than 25 MB the issue doesn't seem to happen
hseipp commented 1 month ago

@romayalon Regarding your observation on libgpfs not being loaded https://github.com/noobaa/noobaa-core/issues/8315#issuecomment-2325896368 it looks like an issue on my particular VM upon bootup. There seems to be a race condition where noobaa is trying to access a file on the cesShared Scale Filesystem before it is mounted. mmces service stop s3 && mmces service start s3 resolves the situation.

hseipp commented 1 month ago

Note that I can also produce the issue with Ceph s3-tests test_versioning_obj_create_overwrite_multipart(). This time I double-checked that libgpfs.so got properly loaded.