globaldothealth / list

Repository for Global.health: a data science initiative to enable rapid sharing of trusted and open public health data to advance the response to infectious diseases.
MIT License
39 stars 7 forks source link

Fix 60-second request timeout / socket hang up in dev/prod #961

Closed axmb closed 3 years ago

axmb commented 4 years ago

Describe the bug Requests in dev/prod time out after 60 seconds.

To Reproduce Send an API request, say to batch upsert, with a large amount of data (>10k rows).

Expected behavior Requests, at least batch upsert, should allow more time to complete.

Environment (please complete the following information): Only occurs in dev/prod -- can't repro locally. I've seen the issue both for bulk upload and ADI, so the issue isn't specific to the UI/browser.

I've previously tweaked our nginx config to avoid 504s, but in these cases, the API client gets 500: socket hang up.

axmb commented 3 years ago

I've loaded #962 into dev, but it's tough to say whether it had an impact. I'm still seeing hang ups after 1.1min -- specifically, the connection appears to be closed on the data service mid-batchValidate request from the curator service.

Error: socket hang up
    at connResetException (internal/errors.js:612:14)
    at Socket.socketOnEnd (_http_client.js:493:23)
    at Socket.emit (events.js:326:22)
    at Socket.EventEmitter.emit (domain.js:486:12)
    at endReadableNT (_stream_readable.js:1244:12)
    at processTicksAndRejections (internal/process/task_queues.js:80:21) {
  code: 'ECONNRESET',
  config: {
    url: 'http://data-dev/api/cases/batchValidate',
    method: 'post',
    data: '{"cases":[{
        [...]
  headers: {
      Accept: 'application/json, text/plain, */*',
      'Content-Type': 'application/json;charset=utf-8',
      'User-Agent': 'axios/0.19.2',
      'Content-Length': 6478905
    },
    transformRequest: [ [Function: transformRequest] ],
    transformResponse: [ [Function: transformResponse] ],
    timeout: 0,
    adapter: [Function: httpAdapter],
    xsrfCookieName: 'XSRF-TOKEN',
    xsrfHeaderName: 'X-XSRF-TOKEN',
    maxContentLength: Infinity,
    validateStatus: [Function: validateStatus]
  },
  request: <ref *1> Writable {
    _writableState: WritableState {
      objectMode: false,
      highWaterMark: 16384,
      finalCalled: false,
      needDrain: false,
      ending: false,
      ended: false,
      finished: false,
      destroyed: false,
      decodeStrings: true,
      defaultEncoding: 'utf8',
      length: 0,
      writing: false,
      corked: 0,
      sync: true,
      bufferProcessing: false,
      onwrite: [Function: bound onwrite],
      writecb: null,
      writelen: 0,
      afterWriteTickInfo: null,
      buffered: [],
      bufferedIndex: 0,
      allBuffers: true,
      allNoop: true,
      pendingcb: 0,
      prefinished: false,
      errorEmitted: false,
      emitClose: true,
      autoDestroy: true,
      errored: false,
      closed: false
    },
    _events: [Object: null prototype] {
      response: [Function: handleResponse],
      error: [Function: handleRequestError]
    },
    _eventsCount: 2,
    _maxListeners: undefined,
    _options: {
      protocol: 'http:',
      maxRedirects: 21,
      maxBodyLength: Infinity,
      path: '/api/cases/batchValidate',
      method: 'POST',
      headers: [Object],
      agent: undefined,
      agents: [Object],
      auth: undefined,
      hostname: 'data-dev',
      port: null,
      nativeProtocols: [Object],
      pathname: '/api/cases/batchValidate'
    },
    _redirectCount: 0,
    _redirects: [],
    _requestBodyLength: 6478905,
    _requestBodyBuffers: [ [Object] ],
    _onNativeResponse: [Function (anonymous)],
    _currentRequest: ClientRequest {
      _events: [Object: null prototype],
      _eventsCount: 6,
      _maxListeners: undefined,
      outputData: [],
      outputSize: 0,
      writable: true,
      destroyed: false,
      _last: true,
      chunkedEncoding: false,
      shouldKeepAlive: false,
      useChunkedEncodingByDefault: true,
      sendDate: false,
      _removedConnection: false,
      _removedContLen: false,
      _removedTE: false,
      _contentLength: null,
      _hasBody: true,
      _trailer: '',
      finished: true,
      _headerSent: true,
      socket: [Socket],
      _header: 'POST /api/cases/batchValidate HTTP/1.1\r\n' +
        'Accept: application/json, text/plain, */*\r\n' +
        'Content-Type: application/json;charset=utf-8\r\n' +
        'User-Agent: axios/0.19.2\r\n' +
        'Content-Length: 6478905\r\n' +
        'Host: data-dev\r\n' +
        'Connection: close\r\n' +
        '\r\n',
      _keepAliveTimeout: 0,
      _onPendingData: [Function: noopPendingOutput],
      agent: [Agent],
      socketPath: undefined,
      method: 'POST',
      maxHeaderSize: undefined,
      insecureHTTPParser: undefined,
      path: '/api/cases/batchValidate',
      _ended: false,
      res: null,
      aborted: false,
      timeoutCb: null,
      upgradeOrConnect: false,
      parser: null,
      maxHeadersCount: null,
      reusedSocket: false,
      host: 'data-dev',
      protocol: 'http:',
      _redirectable: [Circular *1],
      [Symbol(kCapture)]: false,
      [Symbol(kNeedDrain)]: false,
      [Symbol(corked)]: 0,
      [Symbol(kOutHeaders)]: [Object: null prototype]
    },
    _currentUrl: 'http://data-dev/api/cases/batchValidate',
    [Symbol(kCapture)]: false
  },
  response: undefined,
  isAxiosError: true,
  toJSON: [Function (anonymous)]
}

Not clear why we'd be terminating the connection on the data service side; I'll take a closer look.

attwad commented 3 years ago

Perhaps it's mongo related if happening in the data service. https://mongoosejs.com/docs/connections.html#keepAlive

https://mongoosejs.com/docs/api.html#query_Query-maxTimeMS

Could also be nginx related https://stackoverflow.com/questions/54366082/nginx-ingress-controller-timing-out-request-after-60s

axmb commented 3 years ago

Interestingly, I'm now seeing our parser Lambdas execute for about 4-5 minutes -- which makes me think, perhaps, that the above change fixed the first issue (but that another issue specific to our frontend, at 1.1min rather than 1min, emerged). The Lambdas still fail after these 4-5 minutes, but it's with a 502 from nginx.

The relevant nginx log is:

2020/09/08 20:58:18 [error] 2595#2595: *52950638 upstream prematurely closed connection while reading response header from upstream, client: 73.215.126.245, server: dev-curator.ghdsi.org, request: "POST /api/cases/batchUpsert HTTP/1.1", upstream: "http://192.168.6.214:3001/api/cases/batchUpsert", host: "dev-curator.ghdsi.org"

I've previously set proxy_read_timeout to 3600, but there are several layers of configuration, and it's possible another setting is impacting this.

Good thought on Mongo -- I'll take a look at that next.

axmb commented 3 years ago

Ah, interesting -- the failed Lambda invocations (resulting in nginx 502s) are real 502s. It looks like the curator service is OOM'ing. Hadn't noticed this before, but when viewing --previous logs:

➜  kubectl logs curator-dev-685c655455-grf9s --previous

[...]

<--- Last few GCs --->

[1:0x561b6718a400]  3014141 ms: Mark-sweep (reduce) 282.1 (294.3) -> 282.0 (293.3) MB, 1603.9 / 0.0 ms  (average mu = 0.471, current mu = 0.000) last resort GC in old space requested
[1:0x561b6718a400]  3015156 ms: Mark-sweep (reduce) 282.0 (292.3) -> 281.8 (293.3) MB, 1015.1 / 0.0 ms  (average mu = 0.338, current mu = 0.000) last resort GC in old space requested

<--- JS stacktrace --->

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory

Conversely, it seems that the bulk upload failures are also OOMs, though they're in the data service.

➜  kubectl logs data-dev-75ccb45dc9-zw9zc --previous

Data service listening on port 3000
  Press CTRL-C to stop

<--- Last few GCs --->

[1:0x55edbb1dff00]  1576083 ms: Mark-sweep 255.3 (258.0) -> 254.6 (258.0) MB, 1117.4 / 0.0 ms  (average mu = 0.099, current mu = 0.006) allocation failure scavenge might not succeed
[1:0x55edbb1dff00]  1577462 ms: Mark-sweep 255.5 (258.2) -> 254.9 (258.2) MB, 1001.1 / 0.0 ms  (average mu = 0.182, current mu = 0.274) allocation failure scavenge might not succeed

<--- JS stacktrace --->

==== JS stack trace =========================================

    0: ExitFrame [pc: 0x55edb950a939]
Security context: 0x280fb03808d1 <JSObject>
    1: markModified [0x2b3662557689] [/usr/src/app/node_modules/mongoose/lib/document.js:~1636] [pc=0x2201287d30b7](this=0x1c7b4aa72289 <model map = 0x1365ef4313f9>,0x1be7160aea49 <String[#8]: location>,0x2857fce004b1 <undefined>)
    2: $__set [0x2b3662557549] [/usr/src/app/node_modules/mongoose/lib/document.js:~1421] [pc=0x2201287d0839](this=0x1c7b4aa72289...

FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory

This is for processing all of Japan's data, for reference. It's 74k cases.

axmb commented 3 years ago

One more observation here:

attwad commented 3 years ago

Thanks for the investigation Alex, I think holding up a connection for 5 min even if we had enough memory to do so isn't really a great approach. What we could do to improve the situation is to:

This way we get a bit more latency but memory usage is easily predictable/configurable via the batch size.

WDYT?

attwad commented 3 years ago

Another thing we should try is reducing the number of cases sent per batch, I arbitrarily set it to 1K but going down to a more reasonable size like just a few hundreds would't hurt latency too much and might make things go smoother.

attwad commented 3 years ago

FYI I'm also seeing these errors in dev which are not socket hang up but happen at the same time:

Error: connect ECONNREFUSED 10.100.204.152:80
    at TCPConnectWrap.afterConnect [as oncomplete] (net.js:1145:16) {
  errno: -111,
  code: 'ECONNREFUSED',
  syscall: 'connect',
  address: '10.100.204.152',
  port: 80,
  config: {
    url: 'http://data-dev/api/cases/batchValidate',
    method: 'post',
    data: '{"cases":[{"caseReference":{"sourceId":"5f57ff7dc48afa4542ff2004","sourceUrl":"https://raw.githubusercontent.com/openZH/covid_19/master/fallzahlen_kanton_alter_geschlecht_csv/COVID19_Fallzahlen_Kanton_ZH_alter_geschlecht.csv","uploadIds":["5f60bf17a93deeb2a4d19510"]},"location":{"country":"Switzerland","administrativeAreaLevel1":"Zurich","geoResolution":"Admin1","name":"Zurich canton","geometry":{"longitude":8.651071,"latitude":47.42568}},"events":[{"name":"confirmed","dateRange":{"start":"04/01/2020Z","end":"04/01/2020Z"}}],"demographics":{"gender":"Female","ageRange":{"start":42,"end":42}}},{"caseReference":{"sourceId":"5f57ff7dc48afa4542ff2004","sourceUrl":"https://raw.githubusercontent.com/openZH/covid_19/master/fallzahlen_kanton_alter_geschlecht_csv/COVID19_Fallzahlen_Kanton_ZH_al
...
},
    _currentUrl: 'http://data-dev/api/cases/batchValidate',
    [Symbol(kCapture)]: false
  },
  response: undefined,
  isAxiosError: true,
  toJSON: [Function (anonymous)]
}
attwad commented 3 years ago

250 items seem to have helped, Estonia was failing in dev but now it succeeded...

attwad commented 3 years ago

For the OOM, curator and data services both request 256Mi of RAM and are capped to 512Mi, we could increase the prod requests and limits a bit, we have two nodes with 2Gi or RAM each so we have some capacity left.

axmb commented 3 years ago

So I've confirmed that the remaining issues are all OOMs. It's tough to see this in dev, but I was able to repro the most common case locally, which is the data service OOM'ing on batchValidate. This is roughly per above, though in that post I couldn't definitively associate the data service OOMs with the socket hang ups.

data_1 FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
[...]
curator_1 NET 40: _final: not yet connected
curator_1 Error: socket hang up
[...]

The threshold for this type of issue seems to be lower than for the curator OOM on batchUpsert mentioned in the same comment -- which should be more or less fixed by Tim's changes to batch size and job resourcing. Tomorrow I'll do a quick scan to see if the data service handler could be more efficient, but I suspect the issue will be mostly resolved by these changes.

axmb commented 3 years ago

I've tested some changes locally that alleviate the memory pressure from batchValidate (and with enough cases, shift the OOM to batchUpsert -- which I imagine we could also make less memory sensitive). With the increase in prod memory, though, I think we'll be set here. There's also a large change to the controller structure in the works by Tim.

I think it's safe to consider this fixed. To be doubly sure, I'll retest after Tim's change, but I'll close this pending new issues in prod.

attwad commented 3 years ago

From today's Peru parsing logs:

2020-09-23T01:26:03.127+02:00 | Raw file retrieved at /tmp/tmp4njrd0ny
2020-09-23T01:26:03.347+02:00 | Sending 250 cases, total so far: 0
2020-09-23T01:27:03.192+02:00 | Error sending cases to server, status=500, response=socket hang up

that's exactly 60 seconds and socket hang up which is basically what this bug is about.

attwad commented 3 years ago

I've bumped the RAM and CPU of prod data service, trying an import of Peru now and tailing the prod logs to see how it goes...

attwad commented 3 years ago

So the function was running for 900s and got killed, but it only imported 1500 cases in 30 seconds and then got stuck, no more logs are shown until the task gets killed:

  | 2020-09-23T15:42:28.953+02:00 | Raw file retrieved at /tmp/tmpvj6ro2u3
  | 2020-09-23T15:42:29.212+02:00 | Sending 250 cases, total so far: 0
  | 2020-09-23T15:42:46.014+02:00 | Current speed: 14 cases/sec
  | 2020-09-23T15:42:46.014+02:00 | Max estimate: 12472 in the remaining 873 sec
  | 2020-09-23T15:42:46.253+02:00 | Sending 250 cases, total so far: 250
  | 2020-09-23T15:42:52.013+02:00 | Current speed: 21 cases/sec
  | 2020-09-23T15:42:52.013+02:00 | Max estimate: 18707 in the remaining 867 sec
  | 2020-09-23T15:42:52.234+02:00 | Sending 250 cases, total so far: 500
  | 2020-09-23T15:42:57.366+02:00 | Current speed: 26 cases/sec
  | 2020-09-23T15:42:57.366+02:00 | Max estimate: 23136 in the remaining 861 sec
  | 2020-09-23T15:42:57.573+02:00 | Sending 250 cases, total so far: 750
  | 2020-09-23T15:43:00.619+02:00 | Current speed: 31 cases/sec
  | 2020-09-23T15:43:00.619+02:00 | Max estimate: 27598 in the remaining 858 sec
  | 2020-09-23T15:43:00.872+02:00 | Sending 250 cases, total so far: 1000
  | 2020-09-23T15:43:06.450+02:00 | Current speed: 33 cases/sec
  | 2020-09-23T15:43:06.450+02:00 | Max estimate: 29366 in the remaining 852 sec
  | 2020-09-23T15:43:06.672+02:00 | Sending 250 cases, total so far: 1250
  | 2020-09-23T15:57:19.418+02:00 | END RequestId: e0d90e9f-25e7-4b37-831c-0abddb072b27

The prod server hasn't been killed due to OOM or anything:

kubectl get pods
NAME                            READY   STATUS    RESTARTS   AGE
curator-prod-55b54dc866-jc7p5   1/1     Running   0          6h39m
data-prod-9fd49f9d7-2l4kv       1/1     Running   0          6h34m
<dev ommited>

Prod data logs show two interesting errors though:

(node:1) UnhandledPromiseRejectionWarning: BulkWriteError: E11000 duplicate key error collection: covid19.caserevisions index: case._id_1_case.revisionMetadata.revisionNumber_-1 dup key: { case._id: ObjectId('5f62a0b90142a772cbaf9beb'), case.revisionMetadata.revisionNumber: 0 }
    at UnorderedBulkOperation.handleWriteError (/usr/src/app/node_modules/mongodb/lib/bulk/common.js:1257:9)
    at UnorderedBulkOperation.handleWriteError (/usr/src/app/node_modules/mongodb/lib/bulk/unordered.js:117:18)
    at resultHandler (/usr/src/app/node_modules/mongodb/lib/bulk/common.js:521:23)
    at handler (/usr/src/app/node_modules/mongodb/lib/core/sdam/topology.js:942:24)
    at /usr/src/app/node_modules/mongodb/lib/cmap/connection_pool.js:350:13
    at handleOperationResult (/usr/src/app/node_modules/mongodb/lib/core/sdam/server.js:558:5)
    at MessageStream.messageHandler (/usr/src/app/node_modules/mongodb/lib/cmap/connection.js:277:5)
    at MessageStream.emit (events.js:315:20)
    at processIncomingData (/usr/src/app/node_modules/mongodb/lib/cmap/message_stream.js:144:12)
    at MessageStream._write (/usr/src/app/node_modules/mongodb/lib/cmap/message_stream.js:42:5)
    at doWrite (_stream_writable.js:403:12)
    at writeOrBuffer (_stream_writable.js:387:5)
    at MessageStream.Writable.write (_stream_writable.js:318:11)
    at TLSSocket.ondata (_stream_readable.js:716:22)
    at TLSSocket.emit (events.js:315:20)
    at addChunk (_stream_readable.js:295:12)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:1) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.
(node:1) UnhandledPromiseRejectionWarning: BulkWriteError: E11000 duplicate key error collection: covid19.caserevisions index: case._id_1_case.revisionMetadata.revisionNumber_-1 dup key: { case._id: ObjectId('5f62a0fe0142a772cbafb86e'), case.revisionMetadata.revisionNumber: 0 }
    at UnorderedBulkOperation.handleWriteError (/usr/src/app/node_modules/mongodb/lib/bulk/common.js:1257:9)
    at UnorderedBulkOperation.handleWriteError (/usr/src/app/node_modules/mongodb/lib/bulk/unordered.js:117:18)
    at resultHandler (/usr/src/app/node_modules/mongodb/lib/bulk/common.js:521:23)
    at handler (/usr/src/app/node_modules/mongodb/lib/core/sdam/topology.js:942:24)
    at /usr/src/app/node_modules/mongodb/lib/cmap/connection_pool.js:350:13
    at handleOperationResult (/usr/src/app/node_modules/mongodb/lib/core/sdam/server.js:558:5)
    at MessageStream.messageHandler (/usr/src/app/node_modules/mongodb/lib/cmap/connection.js:277:5)
    at MessageStream.emit (events.js:315:20)
    at processIncomingData (/usr/src/app/node_modules/mongodb/lib/cmap/message_stream.js:144:12)
    at MessageStream._write (/usr/src/app/node_modules/mongodb/lib/cmap/message_stream.js:42:5)
    at doWrite (_stream_writable.js:403:12)
    at writeOrBuffer (_stream_writable.js:387:5)
    at MessageStream.Writable.write (_stream_writable.js:318:11)
    at TLSSocket.ondata (_stream_readable.js:716:22)
    at TLSSocket.emit (events.js:315:20)
    at addChunk (_stream_readable.js:295:12)
(node:1) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 2)

Cases in question are https://curator.ghdsi.org/cases/view/5f62a0b90142a772cbaf9beb and https://curator.ghdsi.org/cases/view/5f62a0fe0142a772cbafb86e

Nothing seems off about them.

Nothing in the curator service logs.

attwad commented 3 years ago

Since yesterday's memory and cpu bump the data server in prod hasn't restarted once, calling this done.