Closed rarkins closed 4 years ago
I'm wondering, does the retry part above indicate that there's actually no retrying? Or could it indicate an incomplete stringifying of the object, etc?
Could be both. Try
console.log('methods', [...error.gotOptions.retry.methods]);
console.log('statusCodes', [...error.gotOptions.retry.statusCodes);
console.log('errorCodes', [...error.gotOptions.retry.errorCodes]);
Also the stack trace seems to be incomplete. Can you try executing Error.stackTraceLimit = Infinity;
(or just use longjohn)?
Additionally, is there any additional debugging of the err object I can do once such errors are thrown?
Unfortunately no.
I've got some questions anyway:
Please try the following:
1.1.1.1
or 8.8.8.8
oragent: false
option.If none of the above helps, you may want to try Got 10. There are some breaking changes compared to Got 9, but it's got some nice features: fixed retry functionality (twice), DNS cache, Brotli support, Node.js 13 support, better async stacktraces, experimental pagination API and other bug fixes. I'm willing to send a PR with the upgrade if necessary :)
Got 10 has some bugs but I don't think they would affect Renovate. Anyway half of the issues is gonna be fixed in an upcoming rewrite of Got because the current design has some limitations and the performance can be improved.
Also Got 10 provides native TypeScript typings :)
Hi @szmarczak, I appreciate the very quick response!
FYI this is only happening in production, so I am a little constrained as to how "debuggy" I can get or how much risk I can take, because there are well over 100k installed repos and any mistake can have large consequences (ask me how..)
I am adding specific logging for errorCodes
and statusCodes
like you suggest, and will wait for some results.
Also the stack trace seems to be incomplete. Can you try executing
Error.stackTraceLimit = Infinity;
(or just use longjohn)?
I need to check that this is OK to do in production. I notice longjohn
say not to, but do you think they're being overly cautious? Do you expect got
to throw a more detailed error?
- Can you reproduce the program locally?
No. I've run Renovate locally with some of the commonly failing ones, but they download fine.
- Are you sure there are no small network outages?
This would seem unlikely. The problem occurs hour after hour, and has "clusters" of particular package failures that indicate it's not random.
- Are you using an HTTP proxy?
No
- Does it happen only for specific requests? If so, which one? Please provide the most accurate information you can (request path, etc.)
There seems to be a correlation that the ones that fail are mostly large. e.g. https://registry.npmjs.org/@typescript-eslint%2Feslint-plugin is 14MB. However there are other even larger ones that have had zero failures.
- What Node.js version are you running? Have you tried the latest one?
Running Node 12. Node 13 is not really an option consider it's unstable. I am a few releases behind on Node 12 though so will upgrade that in production today.
- Set your DNS server to
1.1.1.1
or8.8.8.8
or
I'd prefer not to, unless we have a reason to think it's the problem.
- Disable caching if present or
No got
caching present.
- Pass
agent: false
option.
Can you give an example of this in a got@9
command? e.g. is it got(url, { agent: false })
?
Unfortunately my logger doesn't want to share errorCodes and statusCodes:
"errorCodes": "[Circular]",
"statusCodes": "[Circular]",
Any trick for this? Here is the code in case that helps: https://github.com/renovatebot/renovate/blob/237eeffe6fa74335c817d308431ee9551b431c3b/lib/datasource/npm/get.ts#L226-L235
We use bunyan logger but have a custom serializer for err
, largely for customizing got
output: https://github.com/renovatebot/renovate/blob/master/lib/logger/err-serializer.ts
Something else surprising to me now. I have increased retries to 5: https://github.com/renovatebot/renovate/blob/237eeffe6fa74335c817d308431ee9551b431c3b/lib/datasource/npm/get.ts#L111
How likely is it that we still end up with the same number of ECONNRESET
errors? i.e. 5 times in total it connects, starts transfer, and then gets terminated by the remote end? This is why I am wondering if it's really retrying or not.
I need to check that this is OK to do in production. I notice longjohn say not to, but do you think they're being overly cautious? Do you expect got to throw a more detailed error?
Yup. If you believe that unlimited stack trace is bad, you can do Error.stackTraceLimit = 20;
which should be enough. (what it does is just it increases the maximum number of lines in the stack trace)
I've run Renovate locally with some of the commonly failing ones, but they download fine.
This could be related to the network speed and/or the highWaterMark
option. Try setting the highWaterMark
(it may be missing in the types) option to 104857600
(100MB) and see if that resolves the problem.
There seems to be a correlation that the ones that fail are mostly large.
Do they look like @scope/package
? Or there are some global packages too?
Running Node 12. Node 13 is not really an option consider it's unstable. I am a few releases behind on Node 12 though so will upgrade that in production today.
Well, I know from the experience that http modules are the most stable in the newest Node.js versions...
I'd prefer not to, unless we have a reason to think it's the problem.
My point is that your current DNS provider can point to overloaded servers or has outdated entries. Try providing a host
option (npm's IP).
is it got(url, { agent: false })?
Yep, it is.
Any trick for this?
Normalized errorCodes
is a Set in Got 9. Convert it to an array:
logger.warn(
{
err,
errorCodes: [...err.gotOptions?.retry?.errorCodes],
statusCodes: [...err.gotOptions?.retry?.statusCodes],
regUrl,
depName: name,
},
'npm registry failure'
);
How likely is it that we still end up with the same number of ECONNRESET errors?
Assuming the NPM uptime is ~100%, the chance is almost 0%.
This is why I am wondering if it's really retrying or not.
Add a beforeRetry
hook and log something in there.
I'll send you a PR to upgrade to Got 10 today. That should fix the problem with retries.
Yup. If you believe that unlimited stack trace is bad, you can do
Error.stackTraceLimit = 20;
Thanks, I'll do that.
This could be related to the network speed and/or the
highWaterMark
option.
I'm not using got.stream
- do you know if it's still applicable?
Do they look like
@scope/package
? Or there are some global packages too?
It's mostly scoped packages and mostly large ones, but still a few counterexamples too. I think scoped packages have a different "backend" within npmjs so scoped vs non-scoped can definitely explain some differences at times. I saw the next
package failing frequently too for example (another large one).
Well, I know from the experience that http modules are the most stable in the newest Node.js versions...
FYI updating to Node v12.15.0 did not appear to have any impact on the stats.
I worked something else out that may be making the problem worse. It appears that the custom got cache I added a long time ago blocked my own attempts at ECONNRESET
retries. It's possible that I inserted it at a point in got's flow that it is also blocking got's own "built in" retries. FYI the point of that caching is to catch when multiple queries to the same URL are all "in flight" at once - the goal/effect was to limit it to one by caching the promise.
After adding in "manual" retries (catching got errors and recursing) I was finally able to lower the error rate. Anecdotally it seems like most are recovering first-time, as I think we had both thought should happen.
My best guess is that Renovate's custom got caching behaviour prevents got's own retry mechanism from succeeding. I wonder if it's possible to add a catch
here and delete the cache in case of failure?
do you know if it's still applicable?
Yes, because http
is based on streams.
After adding in "manual" retries (catching got errors and recursing) I was finally able to lower the error rate. Anecdotally it seems like most are recovering first-time, as I think we had both thought should happen.
So it seems that retries in Got 9 are buggy. I'd prefer to switch to Got 10.
My best guess is that Renovate's custom got caching behaviour prevents got's own retry mechanism from succeeding.
Dunno. Possibly.
I wonder if it's possible to add a catch here and delete the cache in case of failure?
Should be possible. Just remember to rethrow the error.
I'll send you a PR with Got 10 soon.
I did some basic work here: https://github.com/renovatebot/renovate/pull/5469 but you still need to fix the types and run the tests. If something fails please let me know straight away :)
I discovered that unfortunately our ECONNRESET
errors, once retried, then often become parseError
errors on the next attempt. I would like to try increasing the highWaterMark
as you suggested but am not sure how to do that as I can't find any code samples. Could you advise?
I discovered that unfortunately our ECONNRESET errors, once retried, then often become parseError errors on the next attempt.
What does the ParseError
look like?
I would like to try increasing the highWaterMark as you suggested but am not sure how to do that as I can't find any code samples. Could you advise?
Note that it may be missing in the TS types:
// 10MB
got(url, {...options, readableHighWaterMark: 1024 * 1024 * 10})
It just stores (max) 10MB of the upcoming response data instead of the default 16KB.
What does the
ParseError
look like?
Unfortunately, it looks like a multi-megabyte dump in my log files whenever it happens! I checked a few and they appear to just be truncated JSON responses from npmjs, as if receiving data finished early.
It just stores (max) 10MB of the upcoming response data instead of the default 16KB.
Thanks. I did not find many references to this in my searches. Do you know if there are any references for people implementing this type of increase in a client and it not causing something else to break?
Do you know if there are any references for people implementing this type of increase in a client and it not causing something else to break?
I did the research on my own.
agent.addRequest(request, options)
agent.createSocket(request, options, handler)
agent.createConnection(options, handler)
net.createConnection(options, handler)
All sockets are Duplex streams
readableHighWaterMark
optionAs you can see, the options are passed all along, so got(url, {...options, readableHighWaterMark: 1024 * 1024 * 10})
should work.
You don't need to set it manually on the IncomingMessage
(response) instance, because Node.js already does that for you.
Unfortunately, it looks like a multi-megabyte dump in my log files whenever it happens! I checked a few and they appear to just be truncated JSON responses from npmjs, as if receiving data finished early.
That's what I expected. Please let me know if the readableHighWaterMark
option does the trick.
Assuming I inserted the code correctly, it unfortunately has had zero effect on the number of errors observed in production: https://github.com/renovatebot/renovate/commit/8598c5edead7f643a16516ef34d17599cc8e72b6
The errors are still about 75% ECONNRESET and 25% ParseError. Retrying solves the problem in each case though.
Weird. I'll try to reproduce the issue locally.
I think you should upgrade to Got 10 and see if it's gone.
Got 10 doesn’t seem stable/ battle tested enough yet to deploy to something of the size of Renovate’s production (1m+ requests per hour). I think I’d be jumping out of the course on and into the fire..
Well, if you have many instances of Renovate you can just deploy to one and roll back if something doesn't work as it should. If that's still a no, I'll try to reproduce the issue locally :)
I would like to soon, but I need to think how I can measure errors before and after to try to detect the things that would go wrong
got 10 does not acceppt readableHighWaterMark
. Where to put it now?
I think we can remove it, as it seems to have no effect (assuming I configured it correctly)
as it seems to have no effect (assuming I configured it correctly)
It has no effect because it didn't work. Seems like net
sockets are immutable :(
One more question:
Do the requests fail in bulk? Like if request A fails, then request B fails even 1s later?
There is some clustering in time, yes. Certainly the problems tend to be more likely to be together than separate, including when one repository depends on multiple of the problem libraries. I'll try to sort a snapshot of logs and see how pronounced the clustering is.
Is there any way to determine if we have too many got
instances at once, or too many sockets, or other resource problems like that?
The past two days has seen a big problem in our production app that I've tentatively traced down to got
never returning from await
calls, even though we're supplying a timeout
. I'm wondering if it's some type of "silent" problem such as resource exhaustion. Referring to got@9
still.
Is there any way to determine if we have too many got instances at once
By looking at your code I can say that it's impossible unless you're using https://github.com/sindresorhus/import-fresh
or too many sockets
If you're using a custom instance of Agent you can use https://runkit.com/szmarczak/5e5e64d2c38f7e0013896198 or if you use the global one you can do https://runkit.com/szmarczak/5e5e6a5010e7d500137cb9ae
or other resource problems like that
Well, if there was a memory leak then the Node.js app would have crashed. If your CPU spins up to 100% then I'm pretty sure there's an infinite loop. It's best to use some resource monitoring tools like https://prometheus.io/
The past two days has seen a big problem in our production app that I've tentatively traced down to got never returning from await calls, even though we're supplying a timeout.
That's definitely a bug. As of now I cannot say whether it's in the Got core or the Got instance was configured improperly. When did this start to occur? Was it before https://github.com/renovatebot/renovate/commit/18d2c52327b3a275401f5beec2bd9d3e619ccb2f or after?
Definitely before that commit, it has started without any change. Maybe we can try the custom agent pattern to log sockets
CPU sits at 0%. Adding copious logging messages appears to show the last thing that happened is a got()
call. This is part of a layer Promise.all
so if one doesn't return/resolve, then that seems to explain the hanging.
Renovate does support the global proxy agent, although it's not in use for the hosted app as it has direct internet connectivity. I'd need to do some code tracing to work out if it's ever called/initialized or not.
Renovate does support the global proxy agent, although it's not in use for the hosted app as it has direct internet connectivity. I'd need to do some code tracing to work out if it's ever called/initialized or not.
What are the Got options for these requests? Are they the same as in the first post?
FYI I was finally able to verify got v11 in production and the npm errors immediately stopped. Thank you for the assistance.
@rarkins By the way, did you upgrade Node.js also when switching to Got v11?
Nope, we switched to v12 long time ago.
I know posting comments on closed issues are not very helpful, but we're seeing many read ECONNRESET
errors from latest got@11
on Node 16.10.
Stacktrace:
RequestError: read ECONNRESET
at ClientRequest.<anonymous> (node_modules/got/dist/source/core/index.js:962:111)
at Object.onceWrapper (node:events:510:26)
at ClientRequest.emit (node:events:402:35)
at ClientRequest.emit (node:domain:475:12)
at ClientRequest.origin.emit (node_modules/@szmarczak/http-timer/dist/source/index.js:43:20)
at TLSSocket.socketErrorListener (node:_http_client:447:9)
at TLSSocket.emit (node:events:390:28)
at TLSSocket.emit (node:domain:475:12)
at emitErrorNT (node:internal/streams/destroy:157:8)
at emitErrorCloseNT (node:internal/streams/destroy:122:3)
at TLSWrap.onStreamRead (node:internal/stream_base_commons:220:20)
at TLSWrap.callbackTrampoline (node:internal/async_hooks:130:17) {
code: 'ECONNRESET',
timings: {
start: 1645706450837,
socket: 1645706450839,
lookup: 1645706450840,
connect: 1645706450843,
secureConnect: 1645706450858,
upload: 1645706450858,
response: 1645706451042,
end: 1645706451088,
error: 1645706451113,
abort: undefined,
phases: {
wait: 2,
dns: 1,
tcp: 3,
tls: 15,
request: 0,
firstByte: 184,
download: 46,
total: 276
}
}
}
@kirillgroshkov It comes directly from the TCP socket, meaning either your network was disrupted or the end server is having issues.
What would you like to discuss?
My tool uses
got@9.6.0
and regularly get these types of error fromnpmjs
:I'm wondering, does the
retry
part above indicate that there's actually no retrying? Or could it indicate an incomplete stringifying of the object, etc?Additionally, is there any additional debugging of the
err
object I can do once such errors are thrown?Checklist