External Host Error causing renovate to fail for monorepo

dbasilio commented 2 years ago

How are you running Renovate?

Mend Renovate hosted app on github.com

If you're self-hosting Renovate, tell us what version of Renovate you run.

No response

Please select which platform you are using if self-hosting.

github.com

If you're self-hosting Renovate, tell us what version of the platform you run.

No response

Was this something which used to work for you, and then stopped?

It used to work, and then stopped

Describe the bug

We maintain a monorepo with 138 packages in them. We used to get renovate PRs for the repo, but it stopped working a while ago (not exactly sure when). I finally got around to checking the logs and it's failing with the error below. The package itself is not that important as it fails on a different package pretty frequently (usually it fails on the first one listed in devDependencies on the global package.json).

The error is not consistent as we've had three successful runs in the last 3 days, but even on successful runs we haven't been getting new PRs. I'm going to guess we're hitting some sort of limit on requests to npm and being throttled, but if that's the case, are there config options we can make to increase the timeout or paginate more or something?

Relevant debug logs

Logs

``` WARN: Host error { "hostType": "npm", "packageName": "react-phone-number-input", "err": { "name": "TimeoutError", "code": "ETIMEDOUT", "timings": { "start": 1661790262126, "socket": 1661790262127, "lookup": 1661790262130, "connect": 1661790267339, "secureConnect": 1661790267369, "upload": 1661790267369, "response": 1661790267420, "error": 1661790323264, "abort": 1661790341747, "phases": { "wait": 1, "dns": 3, "tcp": 5209, "tls": 30, "request": 0, "firstByte": 51, "total": 79621 } }, "event": "request", "message": "Timeout awaiting 'request' for 60000ms", "stack": "RequestError: Timeout awaiting 'request' for 60000ms\n at ClientRequest. (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/index.js:970:65)\n at Object.onceWrapper (events.js:520:26)\n at ClientRequest.emit (events.js:412:35)\n at ClientRequest.emit (domain.js:475:12)\n at ClientRequest.origin.emit (/home/ubuntu/renovateapp/node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)\n at TLSSocket.socketErrorListener (_http_client.js:475:9)\n at TLSSocket.emit (events.js:400:28)\n at TLSSocket.emit (domain.js:475:12)\n at emitErrorNT (internal/streams/destroy.js:106:8)\n at emitErrorCloseNT (internal/streams/destroy.js:74:3)\n at processTicksAndRejections (internal/process/task_queues.js:82:21)\n at Timeout.timeoutHandler [as _onTimeout] (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/utils/timed-out.js:36:25)\n at listOnTimeout (internal/timers.js:559:11)\n at processTimers (internal/timers.js:500:7)", "options": { "headers": { "user-agent": "Renovate Bot (GitHub App 2740)", "accept": "application/json", "accept-encoding": "gzip, deflate, br" }, "url": "https://registry.npmjs.org/react-phone-number-input", "hostType": "npm", "username": "", "password": "", "method": "GET", "http2": false }, "response": { "statusCode": 200, "statusMessage": "OK", "headers": { "date": "Mon, 29 Aug 2022 16:24:27 GMT", "content-type": "application/json", "transfer-encoding": "chunked", "connection": "close", "cf-ray": "7426a313295ec378-SEA", "access-control-allow-origin": "*", "age": "173", "cache-control": "public, max-age=300", "etag": "W/\"baa4594aa8274c326bcf05b7e8f487db\"", "last-modified": "Sat, 27 Aug 2022 08:55:56 GMT", "vary": "accept-encoding, accept", "cf-cache-status": "HIT", "expect-ct": "max-age=604800, report-uri=\"https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct\"", "x-amz-replication-status": "PENDING", "server": "cloudflare", "content-encoding": "gzip" }, "httpVersion": "1.1" } } } ```

Have you created a minimal reproduction repository?

No reproduction repository

rarkins commented 2 years ago

We should analyse these timestamps above to work out if this timeout should be reduced. I'm not sure if that 60s is a timeout for the full file transfer or a timeout failing to connect, for no data received, etc.

Then we should check if we can retry such failures at least once.

viceice commented 2 years ago

We've a similar issue on our helm charts repo

WARN: Host error
{
  "hostType": "github-releases",
  "packageName": "norwoodj/helm-docs",
  "err": {
    "name": "TimeoutError",
    "code": "ETIMEDOUT",
    "timings": {
      "start": 1661933000729,
      "socket": 1661933000945,
      "lookup": 1661933108895,
      "error": 1661933108935,
      "phases": {
        "wait": 216,
        "dns": 107950,
        "total": 108206
      }
    },
    "event": "request",
    "message": "Timeout awaiting 'request' for 60000ms",
    "stack": "RequestError: Timeout awaiting 'request' for 60000ms\n    at ClientRequest.<anonymous> (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/index.js:970:65)\n    at Object.onceWrapper (events.js:520:26)\n    at ClientRequest.emit (events.js:412:35)\n    at ClientRequest.emit (domain.js:475:12)\n    at ClientRequest.origin.emit (/home/ubuntu/renovateapp/node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)\n    at TLSSocket.socketErrorListener (_http_client.js:475:9)\n    at TLSSocket.emit (events.js:400:28)\n    at TLSSocket.emit (domain.js:475:12)\n    at emitErrorNT (internal/streams/destroy.js:106:8)\n    at emitErrorCloseNT (internal/streams/destroy.js:74:3)\n    at processTicksAndRejections (internal/process/task_queues.js:82:21)\n    at Timeout.timeoutHandler [as _onTimeout] (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/utils/timed-out.js:36:25)\n    at listOnTimeout (internal/timers.js:559:11)\n    at processTimers (internal/timers.js:500:7)",
    "options": {
      "headers": {
        "user-agent": "Renovate Bot (GitHub App 2740)",
        "accept": "application/vnd.github.machine-man-preview+json",
        "authorization": "***********",
        "content-type": "application/json",
        "content-length": "621",
        "accept-encoding": "gzip, deflate, br"
      },
      "url": "https://api.github.com/graphql",
      "hostType": "github-releases",
      "username": "",
      "password": "",
      "method": "POST",
      "http2": false
    }
  }
}

viceice commented 2 years ago

created a fork at https://github.com/renovate-reproductions/17483-helm-charts-timeout

viceice commented 2 years ago

@dbasilio can you try to set a higher timeout^1 or limit concurrency^2 for npm registry?

dbasilio commented 2 years ago

@viceice We've tried a timeout of 120s, and a concurrency limit of 2 individually with no change. Going to try both together

dbasilio commented 2 years ago

Both settings together don't seem to help, we're still getting the external-host-error more often than note. Anecdotally, it does seem to fail on "later" packages alphabetically

{
    "hostType": "npm",
    "matchHost": "https://registry.npmjs.org/",
    "timeout": 120000,
    "concurrentRequestLimit": 2
}

rarkins commented 2 years ago

I'm going to use this issue as a general performance tuning issue. It seems some of the issues are environmental/app-related, because we can't replicate them when running locally yet they occur consistently in the hosted app. This may be because of:

Docker itself
The Ubuntu base image
AWS infra e.g. DNS
Memory or CPU limitations

rarkins commented 2 years ago

Some useful links:

rarkins commented 2 years ago

This tried so far:

[x] Custom DNS cache
[x] Custom DNS settings in env: RES_OPTIONS: 'ndots:3 retrans:1000 retry:3 rotate'
[x] Enable http2 for api.github.com
[x] Enable http2 for public npm registry

dbasilio commented 2 years ago

Since it's been running for a few days now with

{
    "hostType": "npm",
    "matchHost": "https://registry.npmjs.org/",
    "timeout": 120000,
    "concurrentRequestLimit": 2
}

I've noticed that it started hitting a timeout more than the external host error. We are still getting the occasional success but most often it fails.

rarkins commented 2 years ago

Looking at the last "done" run, here are the starts:

  "hostStats": {
    "api.github.com": {
      "requestCount": 36,
      "requestAvgMs": 460,
      "queueAvgMs": 0
    },
    "registry.npmjs.org": {
      "requestCount": 103,
      "requestAvgMs": 19024,
      "queueAvgMs": 0
    },
    "xxx.jfrog.io": {
      "requestCount": 4,
      "requestAvgMs": 23116,
      "queueAvgMs": 0
    }
  },

Even though concurrentRequestLimit was configured, we see no queueAvgMs for registry.npmjs.org plus very high requestAvgMs for both npm and jfrog.

rarkins commented 2 years ago

I found the queue problem, see #17713. Please remove hostType from your host rule

dbasilio commented 2 years ago

Updated my hostrule to:

{
    "matchHost": "registry.npmjs.org/",
    "concurrentRequestLimit": 2,
    "timeout": 120000
,

(and then the same for our jfrog registry as well in a separate host rule.

We're still getting mostly external host errors

rarkins commented 2 years ago

Try removing all limits again

dbasilio commented 2 years ago

No change

rarkins commented 2 years ago

What's the details for the errors you get now? Eg ETIMEDOUT, ECONNRESET, etc. pasting details here would be helpful

dbasilio commented 2 years ago

External Host Error

{
  "hostType": "npm",
  "packageName": "@datadog/browser-logs",
  "err": {
    "name": "TimeoutError",
    "code": "ETIMEDOUT",
    "timings": {
      "start": 1663081515075,
      "socket": 1663081515076,
      "lookup": 1663081519436,
      "connect": 1663081519436,
      "secureConnect": 1663081522929,
      "upload": 1663081522931,
      "response": 1663081523051,
      "error": 1663081595475,
      "abort": 1663081635691,
      "phases": {
        "wait": 1,
        "dns": 4360,
        "tcp": 0,
        "tls": 3493,
        "request": 2,
        "firstByte": 120,
        "total": 120616
      }
    },
    "event": "request",
    "message": "Timeout awaiting 'request' for 60000ms",
    "stack": "RequestError: Timeout awaiting 'request' for 60000ms\n    at ClientRequest.<anonymous> (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/index.js:970:65)\n    at Object.onceWrapper (node:events:628:26)\n    at ClientRequest.emit (node:events:525:35)\n    at ClientRequest.emit (node:domain:489:12)\n    at ClientRequest.origin.emit (/home/ubuntu/renovateapp/node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)\n    at TLSSocket.socketErrorListener (node:_http_client:481:9)\n    at TLSSocket.emit (node:events:513:28)\n    at TLSSocket.emit (node:domain:489:12)\n    at emitErrorNT (node:internal/streams/destroy:157:8)\n    at emitErrorCloseNT (node:internal/streams/destroy:122:3)\n    at processTicksAndRejections (node:internal/process/task_queues:83:21)\n    at runNextTicks (node:internal/process/task_queues:65:3)\n    at listOnTimeout (node:internal/timers:528:9)\n    at Timeout.timeoutHandler [as _onTimeout] (/home/ubuntu/renovateapp/node_modules/got/dist/source/core/utils/timed-out.js:36:25)\n    at listOnTimeout (node:internal/timers:561:11)\n    at processTimers (node:internal/timers:502:7)",
    "options": {
      "headers": {
        "user-agent": "Renovate Bot (GitHub App 2740)",
        "accept": "application/json",
        "accept-encoding": "gzip, deflate, br"
      },
      "url": "https://registry.npmjs.org/@datadog%2Fbrowser-logs",
      "hostType": "npm",
      "username": "",
      "password": "",
      "method": "GET",
      "http2": false
    },
    "response": {
      "statusCode": 200,
      "statusMessage": "OK",
      "headers": {
        "date": "Tue, 13 Sep 2022 15:05:22 GMT",
        "content-type": "application/json",
        "transfer-encoding": "chunked",
        "connection": "close",
        "cf-ray": "74a1c7de2db7680a-SEA",
        "access-control-allow-origin": "*",
        "age": "77",
        "cache-control": "public, max-age=300",
        "etag": "W/\"3786e68d0270801401b11f9b4ac6d624\"",
        "last-modified": "Tue, 13 Sep 2022 11:35:42 GMT",
        "vary": "accept-encoding, accept",
        "cf-cache-status": "HIT",
        "x-amz-replication-status": "COMPLETED",
        "server": "cloudflare",
        "content-encoding": "gzip"
      },
      "httpVersion": "1.1"
    }
  }
}

And this is all the INFO level logs for the timeout I posted above

INFO: Repository started
{
  "renovateVersion": "32.194.5"
}
INFO: Dependency extraction complete
{
  "baseBranch": "master",
  "stats": {
    "managers": {
      "dockerfile": {
        "fileCount": 1,
        "depCount": 1
      },
      "npm": {
        "fileCount": 140,
        "depCount": 4456
      },
      "nvm": {
        "fileCount": 1,
        "depCount": 1
      }
    },
    "total": {
      "fileCount": 142,
      "depCount": 4458
    }
  }
}
WARN: Job timeout

rarkins commented 2 years ago

Have you configured a 60s timeout?

dbasilio commented 2 years ago

We do not have a timeout configured anywhere in our config

rarkins commented 2 years ago

We've had a 4 minute timeout configured for a number of days. Strange that your log shows 60s - when was it captured?

Note: the file in question shouldn't take so long either way

dbasilio commented 2 years ago

These are the logs for the last day, with no change to our config:

The errors were respectively:

Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
Fetching @axe-core/react - Timeout awaiting 'request' for 60000ms
Fetching @axe-core/react - Timeout awaiting 'request' for 60000ms
Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
Fetching @axe-core/react - Timeout awaiting 'request' for 60000ms
Fetching @sentry/types - Client network socket disconnected before secure TLS connection was established
Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
Fetching @datadog/browser-logs - Timeout awaiting 'request' for 60000ms
Fetching @sentry/types - Timeout awaiting 'request' for 60000ms

All runs used renovate version 32.194.5. I also double confirmed that we are not specifying a timeout anywhere in our shared configs.

rarkins commented 2 years ago

Strange, it seems to be inheriting the default 60s hostRules.timeout instead of the 240s rule we've set in the app. Any change if you then try to apply a longer timeout in your own config? e.g.

"hostRules": [{
  "timeout": 240000
}]

dbasilio commented 2 years ago

We've had it running all weekend with a 4m timeout and this is what it looks like

The failures were (in order):

@types/svg-sprite-loader - Client network socket disconnected before secure TLS connection was established
@yarnpkg/shell - socket hang up
@types/event-emitter - Client network socket disconnected before secure TLS connection was established
@datadog/browser-logs - socket hang up
@juggle/resize-observer - socket hang up
monodeploy - Client network socket disconnected before secure TLS connection was established
@types/react-tooltip - Client network socket disconnected before secure TLS connection was established
@svgr/plugin-svgo - socket hang up
the last two errors were before the 4m change

dbasilio commented 2 years ago

We've been running with the 4m change for a week now, we're still seeing intermittent failures but it's now completing often enough that we're getting PRs through so most of my concern has been alleviated.

Is there a problem with increasing the timeout further? Like 5 or 6 minutes? Is there a max on the timeout? I don't want to further increase the timeout if it's going to negatively impact the running of the app.

rarkins commented 2 years ago

No problem with you trying a higher number. Something is clearly going wrong somewhere in the chain and causing rate limiting, but it's good if it eventually completes.

rarkins commented 2 years ago

@dbasilio I'm seeing only occasional failures for your repo now, can you confirm the same from your side?

Only ~1 failure in the last day, for the record it was still ECONNRESET, and I'm surprised to see that Renovate did retry 2 times:

{"level":20,"msg":"GET https://registry.npmjs.org/@yarnpkg%2Fcore = (code=ECONNRESET, statusCode=-1 retryCount=2, duration=17249)","time":"2022-10-22T14:06:04.321Z"}
{"level":20,"msg":"GET https://registry.npmjs.org/@yarnpkg%2Ffslib = (code=ECONNRESET, statusCode=-1 retryCount=2, duration=17446)","time":"2022-10-22T14:06:04.515Z"}

dbasilio commented 2 years ago

We're still seeing a decent amount of failures, but it's completing frequently enough that we're at least getting PRs put up

dbasilio commented 1 year ago

We're seeing success often enough now that I think we can close this issue out Thanks for all the help!

renovatebot / renovate