noobaa / noobaa-core

High-performance S3 application gateway to any backend - file / s3-compatible / multi-clouds / caching / replication ...
https://www.noobaa.io
Apache License 2.0
273 stars 80 forks source link

Enable support for Memory in HPA view #6737

Open akmithal opened 3 years ago

akmithal commented 3 years ago

Environment info

Actual behavior

  1. Download of large no. of files failed with error

Expected behavior

  1. Download shouldn't fail

Steps to reproduce

  1. In order to attempt verification of defect 6707, found this new set of errors.

From app-server, got error:

download failed: s3://bucket-3/obj_112983 to ./obj_112983 An error occurred (504) when calling the GetObject operation: Gateway Timeout

From endpoint logs:

Sep-30 6:09:18.035 [Endpoint/13] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>SlowDown</Code><Message>Reduce your request rate.</Message><Resource>
/bucket-3/obj_112711</Resource><RequestId>ku6jccni-1vil04-1bef</RequestId></Error> GET /bucket-3/obj_112711 {"accept-encoding":"identity","user-agent":"aws-cli/2.1.29 Python/3.8.8 Linux/4.18.0-305.
19.1.el8_4.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.sync","x-amz-date":"20210930T060718Z","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","authorizat
ion":"AWS4-HMAC-SHA256 Credential=BrVzxXwyhi470cqA3s0c/20210930/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=e49d840a75a56c9a982db4f536a8e48fdd3997c44b0e
4742a8c23e6de867abfd","host":"s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com","x-forwarded-host":"s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com","x-forwarded-port":"443","x-forwarded-proto":"https","for
warded":"for=10.17.66.193;host=s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com;proto=https","x-forwarded-for":"10.17.66.193"} Error: Semaphore Timeout
    at Semaphore._on_timeout (/root/node_modules/noobaa-core/src/util/semaphore.js:233:25)
    at Timeout.<anonymous> (/root/node_modules/noobaa-core/src/util/semaphore.js:229:53)
    at listOnTimeout (internal/timers.js:557:17)
    at processTimers (internal/timers.js:500:7)
Sep-30 6:09:18.060 [Endpoint/13]    [L0] core.endpoint.s3.ops.s3_get_object:: request aborted: undefined
Sep-30 6:09:18.068 [Endpoint/13]    [L0] core.sdk.object_sdk:: validate_non_nsfs_bucket:  { name: SENSITIVE-7fa8f881394f8a6d, email: SENSITIVE-3d288e97f8a66631, access_keys: [ { access_key: SENSITI
VE-d22de9a58fe9d068, secret_key: SENSITIVE-ebf6e1ecaa99829a } ], has_login: false, has_s3_access: true, allowed_buckets: { full_permission: true }, default_resource: 'nsfs-nsr-1', can_create_bucket
s: true, nsfs_account_config: { gid: 800, uid: 5000, nsfs_only: true, new_buckets_path: '/', backend: 'GPFS', warn_threshold_ms: 100 }, systems: [ { name: 'noobaa', roles: [ 'admin' ] } ], external
_connections: { count: 0, connections: [] }, preferences: { ui_theme: 'DARK' } } { fs_root_path: '/nsfs/nsfs-nsr-1', fs_backend: 'GPFS', id: '6151b1d55fd0290029f00859', name: 'nsfs-nsr-1' }
Sep-30 6:09:18.068 [Endpoint/13]    [L1] core.endpoint.s3.s3_rest:: S3 REQUEST GET /bucket-3/obj_113071 op get_object request_id ku6jex9w-2hugkd-14z7 { 'accept-encoding': 'identity', 'user-agent': 
'aws-cli/2.1.29 Python/3.8.8 Linux/4.18.0-305.19.1.el8_4.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.sync', 'x-amz-date': '20210930T060918Z', 'x-amz-content-sha256': 'e3b0c44298fc1c149afbf4c8996
fb92427ae41e4649b934ca495991b7852b855', authorization: 'AWS4-HMAC-SHA256 Credential=BrVzxXwyhi470cqA3s0c/20210930/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Sign
ature=e3c108699b6e59bcfa59fe7a0f987bce0fc6093e6e4bd342a9b5e6befac0a90d', host: 's3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com', 'x-forwarded-host': 's3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com', 'x-f
orwarded-port': '443', 'x-forwarded-proto': 'https', forwarded: 'for=10.17.66.193;host=s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com;proto=https', 'x-forwarded-for': '10.17.66.193' }
Sep-30 6:09:18.070 [Endpoint/13]    [L1] core.endpoint.s3.ops.s3_get_object:: reading object undefined 512
Sep-30 6:09:18.070 [Endpoint/13]    [L0] core.sdk.namespace_fs:: NamespaceFS: read_object_stream { file_path: '/nsfs/nsfs-nsr-1/bucket-3/obj_113071', start: 0, end: Infinity }
Sep-30 6:09:18.071 [Endpoint/13]    [L1] core.util.buffer_utils:: BufferPool.get_buffer BuffersPool {
  buf_size: 8388608,
  buffers: [],
  sem: Semaphore {
    _initial: 536870912,
    _value: 0,
    _waiting_value: 3347054592,
    _wq: WaitQueue {
      _name: '_wq_',
      _q: <ref *2> LinkedList {
        next: '_lln_',
        prev: '_llp_',
        head: '_llh_',
        length: 399,
        _lln_: <ref *7> {
          count: 8388608,
          time: 1632982038135,
          _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
          _lln_: <ref *6> {
            count: 8388608,
            time: 1632982057848,
            _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
            _lln_: <ref *5> {
              count: 8388608,
              time: 1632982058049,
              _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
              _lln_: <ref *4> {
                count: 8388608,
                time: 1632982058158,
                _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
                _lln_: <ref *3> { count: 8388608, time: 1632982058163, _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } }, _lln_: <ref *1> { count: 8388608, time: 1632982058180, _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: [Promise] }, _lln_: { count: 8388608, time: 1632982058528, _wq_: [Defer], _lln_: [Object], _llp_: [Circular *1], _llh_: [Circular *2] }, _llp_: [Circular *3], _llh_: [Circular *2] }, _llp_: [Circular *4], _llh_: [Circular *2] },
                _llp_: [Circular *5],
                _llh_: [Circular *2]
              },
              _llp_: [Circular *6],
              _llh_: [Circular *2]
            },
            _llp_: [Circular *7],
            _llh_: [Circular *2]
          },
          _llp_: [Circular *2],
          _llh_: [Circular *2]
        },
        _llp_: <ref *8> {
          count: 8388608,
          time: 1632982157888,
          _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
          _lln_: [Circular *2],
          _llp_: <ref *9> {
            count: 8388608,
            time: 1632982157784,
            _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
            _lln_: [Circular *8],
            _llp_: <ref *10> {
              count: 8388608,
              time: 1632982157778,
              _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
              _lln_: [Circular *9],
              _llp_: <ref *11> {
                count: 8388608,
                time: 1632982157631,
                _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } },
                _lln_: [Circular *10],
                _llp_: <ref *12> { count: 8388608, time: 1632982157630, _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: Promise { <pending> } }, _lln_: [Circular *11], _llp_: <ref *13> { count: 8388608, time: 1632982157629, _wq_: Defer { isPending: true, isResolved: false, isRejected: false, _promise_resolve: [Function (anonymous)], _promise_reject: [Function (anonymous)], promise: [Promise] }, _lln_: [Circular *12], _llp_: { count: 8388608, time: 1632982157618, _wq_: [Defer], _lln_: [Circular *13], _llp_: [Object], _llh_: [Circular *2] }, _llh_: [Circular *2] }, _llh_: [Circular *2] },
                _llh_: [Circular *2]
              },
              _llh_: [Circular *2]
            },
            _llh_: [Circular *2]
          },
          _llh_: [Circular *2]
        },
        _llh_: [Circular *2]
      }
    },
    _verbose: false,
    _timeout: 120000,
    _timeout_error_code: 'IO_STREAM_ITEM_TIMEOUT',
    _work_timeout_error_code: 'SEMAPHORE_WORKER_TIMEOUT',
    _timer: Timeout { _idleTimeout: 101, _idlePrev: [TimersList], _idleNext: [TimersList], _idleStart: 233845656, _onTimeout: [Function (anonymous)], _timerArgs: undefined, _repeat: null, _destroyed: false, [Symbol(refed)]: true, [Symbol(kHasPrimitive)]: false, [Symbol(asyncId)]: 495467329, [Symbol(triggerId)]: 495467301 }
  }
}
Sep-30 6:09:18.136 [Endpoint/13] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>SlowDown</Code><Message>Reduce your request rate.</Message><Resource>/bucket-3/obj_112712</Resource><RequestId>ku6jccqc-3k264q-4eu</RequestId></Error> GET /bucket-3/obj_112712 {"accept-encoding":"identity","user-agent":"aws-cli/2.1.29 Python/3.8.8 Linux/4.18.0-305.19.1.el8_4.x86_64 exe/x86_64.rhel.8 prompt/off command/s3.sync","x-amz-date":"20210930T060718Z","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","authorization":"AWS4-HMAC-SHA256 Credential=BrVzxXwyhi470cqA3s0c/20210930/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=42087996203a733867660ef587b9e8a5dbe321f355c3c6bcdde2c0b5d9ba9ac8","host":"s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com","x-forwarded-host":"s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com","x-forwarded-port":"443","x-forwarded-proto":"https","forwarded":"for=10.17.66.193;host=s3-noobaa.apps.ocp-akshat-1.cp.fyre.ibm.com;proto=https","x-forwarded-for":"10.17.66.193"} Error: Semaphore Timeout
    at Semaphore._on_timeout (/root/node_modules/noobaa-core/src/util/semaphore.js:233:25)
    at Timeout.<anonymous> (/root/node_modules/noobaa-core/src/util/semaphore.js:229:53)
    at listOnTimeout (internal/timers.js:557:17)
    at processTimers (internal/timers.js:500:7)

More information - Screenshots / Logs / Other output

akmithal commented 3 years ago

must-gather-3009_2.tar.gz

Hi @jeniawhite , @nimrod-becker - This is something new found just now.

akmithal commented 3 years ago

Instead of this failure, shouldn't there be delay introduced for serving this IO request ?

Please see from logs that memory and CPU utilization wasn't high when these errors got triggered.

jeniawhite commented 3 years ago

@akmithal Works as designed, we return the slow-down message to the client and expect it to reduce the rate of requests. This is returned due to a bottleneck on memory utilization for the NSFS flows. The next step would be to add memory utilization to the HPA and fine-tune the amount of memory that we allocate for IO in the NSFS flows.

akmithal commented 3 years ago

Hi @jeniawhite , are we going to trim down the error logs for this error. If yes, you can use this bug to track this work. If it is going to be like this only / PR already open, will close this bug then.

akmithal commented 2 years ago

Hi @nimrod-becker , Is there a PR/workitem created to track this work "add memory utilization to HPA"? Else, we can use this defect for it. FYI @romayalon

nimrod-becker commented 2 years ago

We can keep this issue until we have a task tracking this, basically, we wanted to go with something that eventually did not make it to openshift so we need to wait until the new way is available.

rkomandu commented 2 years ago

@romayalon , could you link the JIRA ticket here for reference ?

nimrod-becker commented 2 years ago

No Jira ticket, this needs to be scheduled for a certain release

github-actions[bot] commented 5 months ago

This issue had no activity for too long - it will now be labeled stale. Update it to prevent it from getting closed.

naveenpaul1 commented 5 months ago

dont close

github-actions[bot] commented 2 months ago

This issue had no activity for too long - it will now be labeled stale. Update it to prevent it from getting closed.