Open jandppw opened 7 months ago
Can you try to use the undici package instead of the bundled version? To see if the problem still occurs in newer versions of undici.
Because the issues are intermittent, and do not occur locally, we cannot really test for them. We can only get log information from failed invocations from production realistically, over a number of hours or days. We just deployed a new version that should log the stack trace of the cause
of the errors, to get a bit more information.
I understand that rolling out a version that uses the latest version of undici
directly would help in isolating the issues. Yet, we believe it would be unwise to expose our users to more experiments than necessary for several days, without addressing the issue they have.
Because this is an annoying production issue (~3,5% of interactions failing), we must focus on stability first. Our plan is to revert to using node-fetch
(on Node 20) asap. fetch
is still experimental in Node 20, after all, and maybe we jumped the gun.
We will report further information when the errors appear with more log information in the mean time.
We understand the report is not very concrete, and regret we cannot be of anymore help. Yet we believed the observations we made are important enough to share here, if only for reference if other people encounter similar issues (e.g., https://github.com/nodejs/undici/issues/2964).
@jandppw Have you considered simply using undici.request
which is our recommended API? fetch should only be used if you require compatibility with other runtimes than NodeJS.
@ronag No, we haven’t. We weren’t aware of the fact that undici.request
is the recommended API, nor that fetch
should only be used if you require compatibility with other runtimes than NodeJS. Is there any documentation on that?
We were under the impression that fetch
is the generic way to go in the future in JavaScript. One common API. The whole idea is to minimize dependencies, isn’t it?
Is there any documentation on that?
Not sure. Where would you expect that to be documented?
We were under the impression that fetch is the generic way to go in the future in JavaScript. One common API. The whole idea is to minimize dependencies, isn’t it?
fetch was added due to popular demand not due to technical merits. It's never going to be as fast as undici.request
. I would stay away from fetch unless writing a isomorphic code or if your team is mostly comfortable with fetch and performance is not critical.
But that would require adding undici
as a direct dependency in the project, no? In that situation, undici
is a contender next to node-fetch
, axios
, got
, superagent
, …, no?
If that is the case, in the current project, at this time, we’d rather revert to proven safe ground (i.e., node-fetch
).
Before the above occurred, we were not familiar with undici
. Until your comment, to us undici
was the thing used inside Node for fetch
. Doing some web searches, blogs anno 2024 still refer to the usual suspects, and do not mention undici
.
For future work, no dependency trumps any dependency, unless there are very good reasons (such was working with promises instead of consuming events ourselves). In that space, now that we have learned about undici
, we will certainly take it into account in future work.
But, to answer the question “Where would you expect that to be documented?”: well, first of all in blogs that I hit with a Google search about doing API calls inside a Node application. This would point me to undici @ npm. None of this is mentioned there. Next, we would have to choose between the contenders. For us, this is probably a bit late. We've been using axios
and node-fetch
for years now, and would need good reasons to change away from things that are proven in usage.
Now, you mentioning that we should stay away from using fetch
in a Node-only application, such as a Lambda, is of a different order. That should be clear first, because, no dependency trumps any dependency, unless there are very good reasons. I have not encountered that point before. So documenting that, and getting the message out, should then be the first step. If this is the official Node point-of-view, maybe the Node fetch
documentation should mention it. But this is probably not the forum for that :-).
I would really love that the AWS Lambda team helped out on this one. We lack visibility on the Node.js distribution they are running. Essentially they are likely doing certain things to the Node.js runtime that are causing this.
This makes us believe these cases are bugs in undici, or in the AWS Lambda x86_64 nodejs20.x OS.
Given you are using Node.js for free but paying for AWS Lambda, I recommend you to contact AWS support.
In all samples we examined, the error occurs in the first call to fetch in the invocation. We assume these calls never left the Lambda environment, and they might occur on the first call to fetch after spawning a Lambda instance.
I suspect it's due to keep alive connections. Lambda suspends the process and all its state after a response is sent. Undici keep alive the connections, but when does Lambda resume the process we do not know that if we need to reset all the connections. There is no event for us to hook into on resume or suspend.
Try adding a connection: 'close'
inside your headers. Alternatives: create a new undici.Client
for each request and pass it as a dispatcher
property to undici.fetch()
. Then .close()
the client.
@mcollina You are absolutely right that this might very well be an AWS issue, or an interop issue (as noted in the report). We just thought that sharing our observations publicly might help other people that might encounter the same issue (as in “no, I’m not crazy”).
AWS support will be made aware of the issue.
Your hint about a Lambda resuming seems plausible.
@mcollina Actually, it turns out: no. There seems to be no easy and 0€ way to make AWS aware of the issue for this client.
Our main focus is getting stability back to our users.
I share your frustration about AWS not being involved, but there seems to be no easy way for us to get AWS on the case without paying extra €. The only way to get AWS involved seems to be be public side channels, such as social media. This public report might be a first effort.
Any response from AWS now (if I were in their place anyway) would be that native fetch
on Node 20 is still experimental, so go away.
Apart from all that, there are no ill feelings here regarding undici
or node
. In any case, by the time AWS will offer Node 22, there will have to be some resolution, since native fetch
is no longer experimental since Node 21.
In the mean time, this report might help other people in realizing that it probably is not a good idea to use native fetch
in AWS Lambda’s for the time being.
Out of curiosity, I attempted to replicate this in us-east-1 and couldn't.
I set up a CloudFront distribution with a Lambda named PrivateLambda (invoked via HTTPS via a function URL) as the origin. Then set up a second lambda named PublicLambda (with a function URL) that calls the CloudFront distribution 25 times in a loop.
I called PublicLambda repeatedly using k6, ramping concurrency up and down to suss out out cold/warm start issues. Every request succeeded:
% k6 run test.js
/\ |‾‾| /‾‾/ /‾‾/
/\ / \ | |/ / / /
/ \/ \ | ( / ‾‾\
/ \ | |\ \ | (‾) |
/ __________ \ |__| \__\ \_____/ .io
execution: local
script: test.js
output: -
scenarios: (100.00%) 1 scenario, 50 max VUs, 35m30s max duration (incl. graceful stop):
* default: Up to 50 looping VUs for 35m0s over 7 stages (gracefulRampDown: 30s, gracefulStop: 30s)
✓ is status 200
✓ is body 25
checks.........................: 100.00% ✓ 99290 ✗ 0
data_received..................: 17 MB 8.1 kB/s
data_sent......................: 7.4 MB 3.5 kB/s
http_req_blocked...............: avg=149.91µs min=1µs med=7µs max=170.28ms p(90)=9µs p(95)=9µs
http_req_connecting............: avg=59.64µs min=0s med=0s max=60.26ms p(90)=0s p(95)=0s
http_req_duration..............: avg=668.06ms min=465.52ms med=625.45ms max=5.7s p(90)=825.18ms p(95)=900.9ms
{ expected_response:true }...: avg=668.06ms min=465.52ms med=625.45ms max=5.7s p(90)=825.18ms p(95)=900.9ms
http_req_failed................: 0.00% ✓ 0 ✗ 49645
http_req_receiving.............: avg=136.27µs min=16µs med=117µs max=204.25ms p(90)=167µs p(95)=199µs
http_req_sending...............: avg=36.73µs min=6µs med=34µs max=10.91ms p(90)=48µs p(95)=53µs
http_req_tls_handshaking.......: avg=81.24µs min=0s med=0s max=130.21ms p(90)=0s p(95)=0s
http_req_waiting...............: avg=667.89ms min=465.38ms med=625.3ms max=5.7s p(90)=824.94ms p(95)=900.73ms
http_reqs......................: 49645 23.638233/s
iteration_duration.............: avg=668.46ms min=465.76ms med=625.79ms max=5.76s p(90)=825.95ms p(95)=901.25ms
iterations.....................: 49645 23.638233/s
vus............................: 1 min=0 max=50
vus_max........................: 50 min=50 max=50
Over 35 minutes this made 49,645 requests to to PublicLambda, which means we made successfully made 1,241,125 fetch calls.
Hopefully I did my math right and this little experiment only cost me a few dollars 😅
Attached cloudformation template for set-up and k6 test file: replicate-undici-3133.zip
The issues shared seem to follow a pattern of constant connection and abrupt disconnections, I believe I'm seeing a similar issue as @mcollina states. I'd suggest to give it a try to his suggestion and start from there.
The lambda runtime might be doing constant cleanups while resuming causing intermittent connect issues.
We yesterday deployed a version that logs the stack trace of the cause
, while working on a revert to node-fetch
. Up until now 2 errors are logged of the ETIMEDOUT
variant, both with the same stack trace:
TypeError: fetch failed
at node:internal/deps/undici/undici:12345:11
at async getImplementation (/var/task/lib/dal/get.js:75:20)
{ cause: { errno: -110, code: 'ETIMEDOUT', syscall: 'write' } }
Error: write ETIMEDOUT
at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
at handleWriteReq (node:internal/stream_base_commons:58:21)
at writeGeneric (node:internal/stream_base_commons:149:15)
at Socket._writeGeneric (node:net:952:11)
at Socket._write (node:net:964:8)
at writeOrBuffer (node:internal/streams/writable:564:12)
at _write (node:internal/streams/writable:493:10)
at Writable.write (node:internal/streams/writable:502:10)
at write (node:internal/deps/undici/undici:9269:18)
at _resume (node:internal/deps/undici/undici:9172:33)
at resume (node:internal/deps/undici/undici:9082:7)
at [dispatch] (node:internal/deps/undici/undici:8327:11)
at Client.Intercept (node:internal/deps/undici/undici:8000:20)
at Client.dispatch (node:internal/deps/undici/undici:6606:44)
at [dispatch] (node:internal/deps/undici/undici:6837:32)
at Pool.dispatch (node:internal/deps/undici/undici:6606:44)
at [dispatch] (node:internal/deps/undici/undici:9957:27)
at Agent.Intercept (node:internal/deps/undici/undici:8000:20)
at Agent.dispatch (node:internal/deps/undici/undici:6606:44)
at node:internal/deps/undici/undici:10929:55
at new Promise (<anonymous>)
at dispatch (node:internal/deps/undici/undici:10929:16)
at httpNetworkFetch (node:internal/deps/undici/undici:10824:73)
at httpNetworkOrCacheFetch (node:internal/deps/undici/undici:10712:39)
at httpFetch (node:internal/deps/undici/undici:10545:43)
at schemeFetch (node:internal/deps/undici/undici:10469:18)
at node:internal/deps/undici/undici:10342:26
at mainFetch (node:internal/deps/undici/undici:10361:11)
at fetching (node:internal/deps/undici/undici:10309:7)
at fetch (node:internal/deps/undici/undici:10173:20)
at fetch (node:internal/deps/undici/undici:12344:10)
at fetch (node:internal/bootstrap/web/exposed-window-or-worker:79:16)
at getImplementation (/var/task/lib/dal/get.js:75:26)
…
These again were the first fetch
calls during the invocation. The time between the fetch
call and the logging of the error was <4ms.
We are using amazon-cognito-identity-js
package inside of the AWS Lambda function to talk to AWS Cognito. The AWS lambda function makes multiple AWS cognito calls. We see this error randomly for some of the calls.
{
"cause": {
"name": "ConnectTimeoutError",
"code": "UND_ERR_CONNECT_TIMEOUT",
"message": "Connect Timeout Error"
}
}
Error: Network error
at /var/task/node_modules/amazon-cognito-identity-js/lib/Client.js:117:15
at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
code: 'NetworkError'
}
The AWS lambda function is configured to use Node.js 18.x
runtime. The earliest occurrence of this error seems to be around six months ago but we need to investigate the AWS Cloudwatch logs if it occurred earlier too.
@ajinkyashukla29dj amazon-cognito-identity-js
uses isomorphic-unfetch
which uses the globally defined fetch if available. Can you disable experimental fetch by setting NODE_OPTIONS
[1] environment variable in your Lambdas to --no-experimental-fetch
[2] and see if your issue persists?
[1] https://docs.aws.amazon.com/lambda/latest/dg/runtimes-modify.html [2] https://nodejs.org/api/cli.html#--no-experimental-fetch
One more error came in, this time of the of the UND_ERR_SOCKET
variant:
TypeError: fetch failed
at node:internal/deps/undici/undici:12345:11
at async getImplementation (/var/task/lib/dal/get.js:75:20)
{
cause: {
name: 'SocketError',
code: 'UND_ERR_SOCKET',
socket: {
localAddress: '169.NNN.NN.N',
localPort: NNNNN,
remoteAddress: undefined,
remotePort: undefined,
remoteFamily: undefined,
timeout: undefined,
bytesWritten: 1185,
bytesRead: 69979
}
}
}
cause
stack:
SocketError: other side closed
at TLSSocket.onSocketEnd (node:internal/deps/undici/undici:8903:26)
at TLSSocket.emit (node:events:530:35)
at endReadableNT (node:internal/streams/readable:1696:12)
at process.processTicksAndRejections (node:internal/process/task_queues:82:21)
ThIs again was the first fetch call during the invocation. The time between the fetch call and the logging of the error was <3ms.
@jandppw I think your best course of action is to swap out native fetch for node-fetch
, keeping everything else the same. If the errors go away, it might be an undici issue. If the issues persist, it is either a remote server issue, or at a lower level than undici.
btw: not an undici code contributor, just curious at this point.
I use Node.js 21.7.2 for a simple scraper project that use await fetch inside try catch but I get really bad stability where internal errors of undici somehow cause ENETUNREACH but my try catch never trigger but instead Node.js just dies. It's super unstable and I need another implementation to get this working
Edit: I simply ported it to Deno and now it is stable.
3 more errors of the ETIMEDOUT
variant came in this morning, with the same stack trace.
A new version is now rolled out on production that no longer uses native fetch
, so this is the last observation we can share.
@mweberxyz Thank you for your inputs. I don't see the ConnectTimeoutError
error after setting the NODE_OPTIONS
env variable to --no-experimental-fetch
. I am seeing the following response even though the username and password is correct
{
"size": 0,
"timeout": 0
}
ERROR NotAuthorizedException: Incorrect username or password.
at /var/task/node_modules/amazon-cognito-identity-js/lib/Client.js:136:19
at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
code: 'NotAuthorizedException'
}
@ajinkyashukla29dj sorry can't help with that error it's coming from the amazon-cognito-identity-js
library, looks like the HTTP request succeeded though!
Can you provide some more details about your lambda so I can try to replicate UND_ERR_CONNECT_TIMEOUT
you encountered when using native fetch? Such as:
UND_ERR_CONNECT_TIMEOUT
occur? Every request or just intermittently?For me this is not AWS related, it happens on macOS "all the time" (very often) for a very simple scraper project that does about 500 fetch calls or something. To confirm I ported it to another impl. of fetch and it worked without stability issues there.
The issue I have is that ETIMEDOUT and similar is not being propagated to my try/catch but instead just crashes Node.js somewhere inside undici.
It can be reproduced with await fetch inside try/catch leading to crashed Node.js process.
@uNetworkingAB Can you give a code sample? Knowing which remote servers you are having trouble with would help with replication.
Please verify if undici v6.14.1 fixes this.
@jandppw my name is Luca, I work for AWS as Severless Specialist, I'd love to connect with you to review your challenge with the service team. Would you mind ping me in private so we can arrange a chat?
@lucamezzalira Sure, Luca. Sending you an email now.
It might be helpful to collect logs with NODE_DEBUG=net
set.
Chiming in to say I get "cause EPIPE" on macOS when issuing fetch requests in a tight loop. So this appears to be a buffering issue:
(async () => {
while (packet != null) {
await new Promise(resolve => setTimeout(resolve, 0.2));
overrideNow(packet.date);
if (packet.type === 'client') {
replayClientCommand(packet.command);
} else {
replayServerMessage(packet.message);
}
packet = reader.next();
}
})();
Basically the above is some logic I have to replay sessions from captured session files. If I remove the timeout/await then I run into EPIPE
issues 100% of the time, seemingly immediately. With the 200ms wait between requests, I'm fine. Worth noting I'm not awaiting individual requests, so this code will indeed spam fetch requests in parallel.
We observe consistent intermittent errors when using native
fetch
in an _AWS Lambdax86_64
nodejs20.x
_. We were unable to find the particular place inundici
code that is the culprit of this behavior.The intention of this submission is to report the issues and to record our observations as accurately as possible, to enable people with better detailed knowledge to isolate and possibly fix the issues, if they are not already fixed (
fetch
is still experimental in Node 20).Stable precursor
Versions of this service have been in production since 2022-05-19 on _AWS Lambda x8664 nodejs16.x using
node-fetch@2.6.7
without many issues. Notably, in the last 6 months of 2023 exactly 1 communication error calling the remote API was logged. It handles ~700 invocations per day.Where are the issues observed?
The functionality of the service requires it to do ±25 remote API calls per invocation. The remote API is hosted on a Windows VM through IIS 8, and is accessed through a Cloudfront distribution, setup to use
http2and3
.We observe this issue, or these issues, only in _AWS Lambda x8664 nodejs20.x, and have not observed the issue during development on macOS (This does not guarantee that the issues are limited to _AWS Lambda x8664 nodejs20.x, but that it is possible that the issues are limited to _AWS Lambda x8664 nodejs20.x).
A new version of the code was released at 2024-03-25, but this did not have any impact on the issues. It seems _AWS Lambda x8664 nodejs20.x changed Node version at 2024-04-03. We cannot report on the exact Node version (and thus
undici
version) because sadly we do not log that. Given that the latest LTS version of node usesundici@5.28.4
, this reports on versionsundici@<=5.28.4
. The issues might have been fixed already in later releases, although we don’t see any relevant mentioning of it in the release notes since v5.25.1. Between 2024-02-06 and 2024-04-03 the errors are reported onnode:internal/deps/undici/undici:12344:11
. Between 2024-04-03 and 2024-04-15 the errors are reported onnode:internal/deps/undici/undici:12345:11
.~1% of invocations fail with 4 ≠ types of failures
The version of this service that shows the issues has been in production since 2024-02-06. It is deployed on _AWS Lambda x8664 nodejs20.x, and replaced
node-fetch
with nativefetch
. Between 2024-02-06 and 2024-04-15 about 49000 Lambda invocations have been made and logged, of which 338 failed. That is about 0.7%. When an invocation does not fail, it callsfetch
~25 times. This means ~0,03% offetch
calls fail, but, as you will see, the impact might be greater, because most failures appear to happen on the firstfetch
call of an invocation of a fresh Lambda instance.All failures are logged as 1 of 4 “fetch failed”
cause
variants:{ cause: { errno: -110, code: 'ETIMEDOUT', syscall: 'write' } }
(163)UND_ERR_SOCKET
(152){ cause: { name: 'ConnectTimeoutError', code: 'UND_ERR_CONNECT_TIMEOUT', message: 'Connect Timeout Error' } }
(16){ cause: { errno: -32, code: 'EPIPE', syscall: 'write' } }
(7)The
UND_ERR_SOCKET
causes are of the general formSometimes the remote address and remote port is filled out, sometimes not.
Sadly, we cannot pinpoint where the exact errors occur, since our logs do not store the stack trace of
cause
. Finding out through code inspection did not turn up anything — the usage ofprocessResponse
is a bit opaque.ETIMEDOUT
andEPIPE
All samples of the
ETIMEDOUT
andEPIPE
cause
we examined show < 15ms between theFETCHING…
log and theERROR
log, and these calls are not registered in the remote API access log. We see theFETCHING…
log, but never theRESPONSE_RECEIVED
log in these cases (see below).In all samples we examined, the error occurs in the first call to
fetch
in the invocation. We assume these calls never left the Lambda environment, and they might occur on the first call tofetch
after spawning a Lambda instance.This makes us believe these cases are bugs in
undici
, or in the _AWS Lambda x8664 nodejs20.x OS.UND_ERR_CONNECT_TIMEOUT
andUND_ERR_SOCKET
The
UND_ERR_CONNECT_TIMEOUT
andUND_ERR_SOCKET
do not follow this pattern. Multiple successful calls of the remote API withfetch
are successful before the error occurs. We assume these calls do go out to the remote API.Yet, such errors rarely occured when we used this code on _AWS Lambda x8664 nodejs16.x with
node-fetch@2.6.7
(1 such error in the last 6 months of 2023).Code
fetch
is used twice in this code base, but all errors spawn from the same use:The
log.info
code is:Note that
now
in the loggedJSON
is determined immediately before the call tofetch
synchronously.The
try
-catch
around the call tofetch
is several levels higher in the call stack:The
log.error
code is:Note that
now
in the loggedJSON
is determined here.From the logged
now
in theFETCHING…
log and the loggednow
in theERROR
log we can determine an upper bound of the duration between ourfetch
call and the occurrence of the error.Possibly related issues