Help debugging ECONNRESET problems

rarkins commented 4 years ago

What would you like to discuss?

My tool uses got@9.6.0 and regularly get these types of error from npmjs:

{
    "name": "RequestError",
    "code": "ECONNRESET",
    "host": "registry.npmjs.org",
    "hostname": "registry.npmjs.org",
    "method": "GET",
    "path": "/@typescript-eslint%2Feslint-plugin",
    "protocol": "https:",
    "url": "https://registry.npmjs.org/@typescript-eslint%2Feslint-plugin",
    "gotOptions": {
      "path": "/@typescript-eslint%2Feslint-plugin",
      "protocol": "https:",
      "slashes": true,
      "auth": null,
      "host": "registry.npmjs.org",
      "port": null,
      "hostname": "registry.npmjs.org",
      "hash": null,
      "search": null,
      "query": null,
      "pathname": "/@typescript-eslint%2Feslint-plugin",
      "href": "https://registry.npmjs.org/@typescript-eslint%2Feslint-plugin",
      "headers": {
        "user-agent": "Renovate Bot (GitHub App 2740)",
        "authorization": "** redacted **",
        "cache-control": "no-cache",
        "accept": "application/json",
        "accept-encoding": "gzip, deflate"
      },
      "hooks": {
        "beforeError": [],
        "init": [],
        "beforeRequest": [],
        "beforeRedirect": [],
        "beforeRetry": [],
        "afterResponse": []
      },
      "retry": {
        "methods": {},
        "statusCodes": {},
        "errorCodes": {}
      },
      "decompress": true,
      "throwHttpErrors": true,
      "followRedirect": true,
      "stream": false,
      "form": false,
      "json": true,
      "cache": false,
      "useElectronNet": false,
      "method": "GET"
    },
    "message": "read ECONNRESET",
    "stack": "RequestError: read ECONNRESET\n    at ClientRequest.request.once.error (/home/ubuntu/renovateapp/node_modules/got/source/request-as-event-emitter.js:178:14)\n    at Object.onceWrapper (events.js:286:20)\n    at ClientRequest.emit (events.js:203:15)\n    at ClientRequest.EventEmitter.emit (domain.js:448:20)\n    at ClientRequest.origin.emit.args (/home/ubuntu/renovateapp/node_modules/@szmarczak/http-timer/source/index.js:37:11)\n    at TLSSocket.socketErrorListener (_http_client.js:392:9)\n    at TLSSocket.emit (events.js:198:13)\n    at TLSSocket.EventEmitter.emit (domain.js:448:20)\n    at emitErrorNT (internal/streams/destroy.js:91:8)\n    at emitErrorAndCloseNT (internal/streams/destroy.js:59:3)"
  },

I'm wondering, does the retry part above indicate that there's actually no retrying? Or could it indicate an incomplete stringifying of the object, etc?

Additionally, is there any additional debugging of the err object I can do once such errors are thrown?

Checklist

[x] I have read the documentation.

szmarczak commented 4 years ago

I'm wondering, does the retry part above indicate that there's actually no retrying? Or could it indicate an incomplete stringifying of the object, etc?

Could be both. Try

console.log('methods', [...error.gotOptions.retry.methods]);
console.log('statusCodes', [...error.gotOptions.retry.statusCodes);
console.log('errorCodes', [...error.gotOptions.retry.errorCodes]);

Also the stack trace seems to be incomplete. Can you try executing Error.stackTraceLimit = Infinity; (or just use longjohn)?

Additionally, is there any additional debugging of the err object I can do once such errors are thrown?

Unfortunately no.

I've got some questions anyway:

Can you reproduce the program locally?
Are you sure there are no small network outages?
Are you using an HTTP proxy?
Does it happen only for specific requests? If so, which one? Please provide the most accurate information you can (request path, etc.)
What Node.js version are you running? Have you tried the latest one?

Please try the following:

Set your DNS server to 1.1.1.1 or 8.8.8.8 or
Disable caching if present or
Pass agent: false option.

If none of the above helps, you may want to try Got 10. There are some breaking changes compared to Got 9, but it's got some nice features: fixed retry functionality (twice), DNS cache, Brotli support, Node.js 13 support, better async stacktraces, experimental pagination API and other bug fixes. I'm willing to send a PR with the upgrade if necessary :)

szmarczak commented 4 years ago

Got 10 has some bugs but I don't think they would affect Renovate. Anyway half of the issues is gonna be fixed in an upcoming rewrite of Got because the current design has some limitations and the performance can be improved.

szmarczak commented 4 years ago

Also Got 10 provides native TypeScript typings :)

rarkins commented 4 years ago

Hi @szmarczak, I appreciate the very quick response!

FYI this is only happening in production, so I am a little constrained as to how "debuggy" I can get or how much risk I can take, because there are well over 100k installed repos and any mistake can have large consequences (ask me how..)

I am adding specific logging for errorCodes and statusCodes like you suggest, and will wait for some results.

Also the stack trace seems to be incomplete. Can you try executing Error.stackTraceLimit = Infinity; (or just use longjohn)?

I need to check that this is OK to do in production. I notice longjohn say not to, but do you think they're being overly cautious? Do you expect got to throw a more detailed error?

Can you reproduce the program locally?

No. I've run Renovate locally with some of the commonly failing ones, but they download fine.

Are you sure there are no small network outages?

This would seem unlikely. The problem occurs hour after hour, and has "clusters" of particular package failures that indicate it's not random.

Are you using an HTTP proxy?

No

Does it happen only for specific requests? If so, which one? Please provide the most accurate information you can (request path, etc.)

There seems to be a correlation that the ones that fail are mostly large. e.g. https://registry.npmjs.org/@typescript-eslint%2Feslint-plugin is 14MB. However there are other even larger ones that have had zero failures.

What Node.js version are you running? Have you tried the latest one?

Running Node 12. Node 13 is not really an option consider it's unstable. I am a few releases behind on Node 12 though so will upgrade that in production today.

Set your DNS server to 1.1.1.1 or 8.8.8.8 or

I'd prefer not to, unless we have a reason to think it's the problem.

Disable caching if present or

No got caching present.

Pass agent: false option.

Can you give an example of this in a got@9 command? e.g. is it got(url, { agent: false })?

rarkins commented 4 years ago

Unfortunately my logger doesn't want to share errorCodes and statusCodes:

    "errorCodes": "[Circular]",
    "statusCodes": "[Circular]",

Any trick for this? Here is the code in case that helps: https://github.com/renovatebot/renovate/blob/237eeffe6fa74335c817d308431ee9551b431c3b/lib/datasource/npm/get.ts#L226-L235

We use bunyan logger but have a custom serializer for err, largely for customizing got output: https://github.com/renovatebot/renovate/blob/master/lib/logger/err-serializer.ts

rarkins commented 4 years ago

Something else surprising to me now. I have increased retries to 5: https://github.com/renovatebot/renovate/blob/237eeffe6fa74335c817d308431ee9551b431c3b/lib/datasource/npm/get.ts#L111

How likely is it that we still end up with the same number of ECONNRESET errors? i.e. 5 times in total it connects, starts transfer, and then gets terminated by the remote end? This is why I am wondering if it's really retrying or not.

szmarczak commented 4 years ago

I need to check that this is OK to do in production. I notice longjohn say not to, but do you think they're being overly cautious? Do you expect got to throw a more detailed error?

Yup. If you believe that unlimited stack trace is bad, you can do Error.stackTraceLimit = 20; which should be enough. (what it does is just it increases the maximum number of lines in the stack trace)

I've run Renovate locally with some of the commonly failing ones, but they download fine.

This could be related to the network speed and/or the highWaterMark option. Try setting the highWaterMark (it may be missing in the types) option to 104857600 (100MB) and see if that resolves the problem.

There seems to be a correlation that the ones that fail are mostly large.

Do they look like @scope/package? Or there are some global packages too?

Running Node 12. Node 13 is not really an option consider it's unstable. I am a few releases behind on Node 12 though so will upgrade that in production today.

Well, I know from the experience that http modules are the most stable in the newest Node.js versions...

I'd prefer not to, unless we have a reason to think it's the problem.

My point is that your current DNS provider can point to overloaded servers or has outdated entries. Try providing a host option (npm's IP).

is it got(url, { agent: false })?

Yep, it is.

Any trick for this?

Normalized errorCodes is a Set in Got 9. Convert it to an array:

      logger.warn(
        {
          err,
          errorCodes: [...err.gotOptions?.retry?.errorCodes],
          statusCodes: [...err.gotOptions?.retry?.statusCodes],
          regUrl,
          depName: name,
        },
        'npm registry failure'
      );

How likely is it that we still end up with the same number of ECONNRESET errors?

Assuming the NPM uptime is ~100%, the chance is almost 0%.

This is why I am wondering if it's really retrying or not.

Add a beforeRetry hook and log something in there.

I'll send you a PR to upgrade to Got 10 today. That should fix the problem with retries.

rarkins commented 4 years ago

Yup. If you believe that unlimited stack trace is bad, you can do Error.stackTraceLimit = 20;

Thanks, I'll do that.

This could be related to the network speed and/or the highWaterMark option.

I'm not using got.stream - do you know if it's still applicable?

Do they look like @scope/package? Or there are some global packages too?

It's mostly scoped packages and mostly large ones, but still a few counterexamples too. I think scoped packages have a different "backend" within npmjs so scoped vs non-scoped can definitely explain some differences at times. I saw the next package failing frequently too for example (another large one).

Well, I know from the experience that http modules are the most stable in the newest Node.js versions...

FYI updating to Node v12.15.0 did not appear to have any impact on the stats.

I worked something else out that may be making the problem worse. It appears that the custom got cache I added a long time ago blocked my own attempts at ECONNRESET retries. It's possible that I inserted it at a point in got's flow that it is also blocking got's own "built in" retries. FYI the point of that caching is to catch when multiple queries to the same URL are all "in flight" at once - the goal/effect was to limit it to one by caching the promise.

rarkins commented 4 years ago

After adding in "manual" retries (catching got errors and recursing) I was finally able to lower the error rate. Anecdotally it seems like most are recovering first-time, as I think we had both thought should happen.

My best guess is that Renovate's custom got caching behaviour prevents got's own retry mechanism from succeeding. I wonder if it's possible to add a catch here and delete the cache in case of failure?

szmarczak commented 4 years ago

do you know if it's still applicable?

Yes, because http is based on streams.

After adding in "manual" retries (catching got errors and recursing) I was finally able to lower the error rate. Anecdotally it seems like most are recovering first-time, as I think we had both thought should happen.

So it seems that retries in Got 9 are buggy. I'd prefer to switch to Got 10.

My best guess is that Renovate's custom got caching behaviour prevents got's own retry mechanism from succeeding.

Dunno. Possibly.

I wonder if it's possible to add a catch here and delete the cache in case of failure?

Should be possible. Just remember to rethrow the error.

I'll send you a PR with Got 10 soon.

szmarczak commented 4 years ago

I did some basic work here: https://github.com/renovatebot/renovate/pull/5469 but you still need to fix the types and run the tests. If something fails please let me know straight away :)

rarkins commented 4 years ago

I discovered that unfortunately our ECONNRESET errors, once retried, then often become parseError errors on the next attempt. I would like to try increasing the highWaterMark as you suggested but am not sure how to do that as I can't find any code samples. Could you advise?

szmarczak commented 4 years ago

I discovered that unfortunately our ECONNRESET errors, once retried, then often become parseError errors on the next attempt.

What does the ParseError look like?

I would like to try increasing the highWaterMark as you suggested but am not sure how to do that as I can't find any code samples. Could you advise?

Note that it may be missing in the TS types:

// 10MB
got(url, {...options, readableHighWaterMark: 1024 * 1024 * 10})

It just stores (max) 10MB of the upcoming response data instead of the default 16KB.

rarkins commented 4 years ago

What does the ParseError look like?

Unfortunately, it looks like a multi-megabyte dump in my log files whenever it happens! I checked a few and they appear to just be truncated JSON responses from npmjs, as if receiving data finished early.

It just stores (max) 10MB of the upcoming response data instead of the default 16KB.

Thanks. I did not find many references to this in my searches. Do you know if there are any references for people implementing this type of increase in a client and it not causing something else to break?

szmarczak commented 4 years ago

Do you know if there are any references for people implementing this type of increase in a client and it not causing something else to break?

I did the research on my own.

As you can see, the options are passed all along, so got(url, {...options, readableHighWaterMark: 1024 * 1024 * 10}) should work.

You don't need to set it manually on the IncomingMessage (response) instance, because Node.js already does that for you.

Unfortunately, it looks like a multi-megabyte dump in my log files whenever it happens! I checked a few and they appear to just be truncated JSON responses from npmjs, as if receiving data finished early.

That's what I expected. Please let me know if the readableHighWaterMark option does the trick.

rarkins commented 4 years ago

Assuming I inserted the code correctly, it unfortunately has had zero effect on the number of errors observed in production: https://github.com/renovatebot/renovate/commit/8598c5edead7f643a16516ef34d17599cc8e72b6

The errors are still about 75% ECONNRESET and 25% ParseError. Retrying solves the problem in each case though.

szmarczak commented 4 years ago

Weird. I'll try to reproduce the issue locally.

szmarczak commented 4 years ago

I think you should upgrade to Got 10 and see if it's gone.

rarkins commented 4 years ago

Got 10 doesn’t seem stable/ battle tested enough yet to deploy to something of the size of Renovate’s production (1m+ requests per hour). I think I’d be jumping out of the course on and into the fire..

szmarczak commented 4 years ago

Well, if you have many instances of Renovate you can just deploy to one and roll back if something doesn't work as it should. If that's still a no, I'll try to reproduce the issue locally :)

rarkins commented 4 years ago

I would like to soon, but I need to think how I can measure errors before and after to try to detect the things that would go wrong

viceice commented 4 years ago

got 10 does not acceppt readableHighWaterMark. Where to put it now?

rarkins commented 4 years ago

I think we can remove it, as it seems to have no effect (assuming I configured it correctly)

szmarczak commented 4 years ago

as it seems to have no effect (assuming I configured it correctly)

It has no effect because it didn't work. Seems like net sockets are immutable :(

One more question:

Do the requests fail in bulk? Like if request A fails, then request B fails even 1s later?

rarkins commented 4 years ago

There is some clustering in time, yes. Certainly the problems tend to be more likely to be together than separate, including when one repository depends on multiple of the problem libraries. I'll try to sort a snapshot of logs and see how pronounced the clustering is.

rarkins commented 4 years ago

Is there any way to determine if we have too many got instances at once, or too many sockets, or other resource problems like that?

The past two days has seen a big problem in our production app that I've tentatively traced down to got never returning from await calls, even though we're supplying a timeout. I'm wondering if it's some type of "silent" problem such as resource exhaustion. Referring to got@9 still.

szmarczak commented 4 years ago

Is there any way to determine if we have too many got instances at once

By looking at your code I can say that it's impossible unless you're using https://github.com/sindresorhus/import-fresh

or too many sockets

If you're using a custom instance of Agent you can use https://runkit.com/szmarczak/5e5e64d2c38f7e0013896198 or if you use the global one you can do https://runkit.com/szmarczak/5e5e6a5010e7d500137cb9ae

or other resource problems like that

Well, if there was a memory leak then the Node.js app would have crashed. If your CPU spins up to 100% then I'm pretty sure there's an infinite loop. It's best to use some resource monitoring tools like https://prometheus.io/

The past two days has seen a big problem in our production app that I've tentatively traced down to got never returning from await calls, even though we're supplying a timeout.

That's definitely a bug. As of now I cannot say whether it's in the Got core or the Got instance was configured improperly. When did this start to occur? Was it before https://github.com/renovatebot/renovate/commit/18d2c52327b3a275401f5beec2bd9d3e619ccb2f or after?

viceice commented 4 years ago

Definitely before that commit, it has started without any change. Maybe we can try the custom agent pattern to log sockets

rarkins commented 4 years ago

CPU sits at 0%. Adding copious logging messages appears to show the last thing that happened is a got() call. This is part of a layer Promise.all so if one doesn't return/resolve, then that seems to explain the hanging.

Renovate does support the global proxy agent, although it's not in use for the hosted app as it has direct internet connectivity. I'd need to do some code tracing to work out if it's ever called/initialized or not.

szmarczak commented 4 years ago

Renovate does support the global proxy agent, although it's not in use for the hosted app as it has direct internet connectivity. I'd need to do some code tracing to work out if it's ever called/initialized or not.

What are the Got options for these requests? Are they the same as in the first post?

rarkins commented 4 years ago

FYI I was finally able to verify got v11 in production and the npm errors immediately stopped. Thank you for the assistance.

szmarczak commented 4 years ago

@rarkins By the way, did you upgrade Node.js also when switching to Got v11?

viceice commented 4 years ago

Nope, we switched to v12 long time ago.

kirillgroshkov commented 2 years ago

I know posting comments on closed issues are not very helpful, but we're seeing many read ECONNRESET errors from latest got@11 on Node 16.10. Stacktrace:

RequestError: read ECONNRESET
    at ClientRequest.<anonymous> (node_modules/got/dist/source/core/index.js:962:111)
    at Object.onceWrapper (node:events:510:26)
    at ClientRequest.emit (node:events:402:35)
    at ClientRequest.emit (node:domain:475:12)
    at ClientRequest.origin.emit (node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)
    at TLSSocket.socketErrorListener (node:_http_client:447:9)
    at TLSSocket.emit (node:events:390:28)
    at TLSSocket.emit (node:domain:475:12)
    at emitErrorNT (node:internal/streams/destroy:157:8)
    at emitErrorCloseNT (node:internal/streams/destroy:122:3)
    at TLSWrap.onStreamRead (node:internal/stream_base_commons:220:20)
    at TLSWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
  code: 'ECONNRESET',
  timings: {
    start: 1645706450837,
    socket: 1645706450839,
    lookup: 1645706450840,
    connect: 1645706450843,
    secureConnect: 1645706450858,
    upload: 1645706450858,
    response: 1645706451042,
    end: 1645706451088,
    error: 1645706451113,
    abort: undefined,
    phases: {
      wait: 2,
      dns: 1,
      tcp: 3,
      tls: 15,
      request: 0,
      firstByte: 184,
      download: 46,
      total: 276
    }
  }
}

szmarczak commented 2 years ago

@kirillgroshkov It comes directly from the TCP socket, meaning either your network was disrupted or the end server is having issues.

sindresorhus / got

Help debugging ECONNRESET problems #1062

What would you like to discuss?

Checklist