Observations regarding instability of native fetch in AWS Lambda x86_64 nodejs20.x

jandppw commented 7 months ago

We observe consistent intermittent errors when using native fetch in an _AWS Lambda x86_64 nodejs20.x_. We were unable to find the particular place in undici code that is the culprit of this behavior.

The intention of this submission is to report the issues and to record our observations as accurately as possible, to enable people with better detailed knowledge to isolate and possibly fix the issues, if they are not already fixed (fetch is still experimental in Node 20).

Stable precursor

Versions of this service have been in production since 2022-05-19 on _AWS Lambda x8664 nodejs16.x using node-fetch@2.6.7 without many issues. Notably, in the last 6 months of 2023 exactly 1 communication error calling the remote API was logged. It handles ~700 invocations per day.

Where are the issues observed?

The functionality of the service requires it to do ±25 remote API calls per invocation. The remote API is hosted on a Windows VM through IIS 8, and is accessed through a Cloudfront distribution, setup to use http2and3.

We observe this issue, or these issues, only in _AWS Lambda x8664 nodejs20.x, and have not observed the issue during development on macOS (This does not guarantee that the issues are limited to _AWS Lambda x8664 nodejs20.x, but that it is possible that the issues are limited to _AWS Lambda x8664 nodejs20.x).

A new version of the code was released at 2024-03-25, but this did not have any impact on the issues. It seems _AWS Lambda x8664 nodejs20.x changed Node version at 2024-04-03. We cannot report on the exact Node version (and thus undici version) because sadly we do not log that. Given that the latest LTS version of node uses undici@5.28.4, this reports on versions undici@<=5.28.4. The issues might have been fixed already in later releases, although we don’t see any relevant mentioning of it in the release notes since v5.25.1. Between 2024-02-06 and 2024-04-03 the errors are reported on node:internal/deps/undici/undici:12344:11. Between 2024-04-03 and 2024-04-15 the errors are reported on node:internal/deps/undici/undici:12345:11.

~1% of invocations fail with 4 ≠ types of failures

The version of this service that shows the issues has been in production since 2024-02-06. It is deployed on _AWS Lambda x8664 nodejs20.x, and replaced node-fetch with native fetch. Between 2024-02-06 and 2024-04-15 about 49000 Lambda invocations have been made and logged, of which 338 failed. That is about 0.7%. When an invocation does not fail, it calls fetch ~25 times. This means ~0,03% of fetch calls fail, but, as you will see, the impact might be greater, because most failures appear to happen on the first fetch call of an invocation of a fresh Lambda instance.

All failures are logged as 1 of 4 “fetch failed” cause variants:

{ cause: { errno: -110, code: 'ETIMEDOUT', syscall: 'write' } } (163)
UND_ERR_SOCKET (152)
{ cause: { name: 'ConnectTimeoutError', code: 'UND_ERR_CONNECT_TIMEOUT', message: 'Connect Timeout Error' } } (16)
{ cause: { errno: -32, code: 'EPIPE', syscall: 'write' } } (7)

The UND_ERR_SOCKET causes are of the general form

{
  cause: {
    name: 'SocketError',
    code: 'UND_ERR_SOCKET',
    socket: {
      localAddress: 'NNN.NNN.NN.N',
      localPort: NNNNN,
      remoteAddress: undefined,
      remotePort: undefined,
      remoteFamily: undefined,
      timeout: undefined,
      bytesWritten: 580,
      bytesRead: 1915
    }
  }
}

Sometimes the remote address and remote port is filled out, sometimes not.

Sadly, we cannot pinpoint where the exact errors occur, since our logs do not store the stack trace of cause. Finding out through code inspection did not turn up anything — the usage of processResponse is a bit opaque.

`ETIMEDOUT` and `EPIPE`

All samples of the ETIMEDOUT and EPIPE cause we examined show < 15ms between the FETCHING… log and the ERROR log, and these calls are not registered in the remote API access log. We see the FETCHING… log, but never the RESPONSE_RECEIVED log in these cases (see below).

In all samples we examined, the error occurs in the first call to fetch in the invocation. We assume these calls never left the Lambda environment, and they might occur on the first call to fetch after spawning a Lambda instance.

This makes us believe these cases are bugs in undici, or in the _AWS Lambda x8664 nodejs20.x OS.

`UND_ERR_CONNECT_TIMEOUT` and `UND_ERR_SOCKET`

The UND_ERR_CONNECT_TIMEOUT and UND_ERR_SOCKET do not follow this pattern. Multiple successful calls of the remote API with fetch are successful before the error occurs. We assume these calls do go out to the remote API.

Yet, such errors rarely occured when we used this code on _AWS Lambda x8664 nodejs16.x with node-fetch@2.6.7 (1 such error in the last 6 months of 2023).

Code

fetch is used twice in this code base, but all errors spawn from the same use:

  …
  log.info(module, 'get', flowId, 'FETCHING …', { uri })
  const response = await fetch(uri, {
    headers: { accept: 'application/json', authorization }
  })
  log.info(module, 'get', flowId, 'RESPONSE_RECEIVED', { uri, statusCode: response.status })
  …

The log.info code is:

/**
 * @param {NodeModule} callingModule
 * @param {string} functionName
 * @param {UUID} flowId
 * @param {string} phase
 * @param {object} [details]
 */
function info(callingModule, functionName, flowId, phase, details) {
  strictEqual(typeof flowId, 'string')
  try {
    const now = Date.now()
    const message = {
      lambda: staticLambdaData(),
      flowId,
      now,
      level: 'INFO',
      module: callingModule.filename,
      function: functionName,
      phase,
      details: masked(details, [])
    }
    console.log(JSON.stringify(message))
  } catch (ignore) /* istanbul ignore next */ {
    errorLogging(callingModule, functionName, flowId, phase)
  }
}

Note that now in the logged JSON is determined immediately before the call to fetch synchronously.

The try - catch around the call to fetch is several levels higher in the call stack:

…
} catch (err) {
  …
  log.error(module, 'handler', responseFlowId, err)
  return {
    statusCode: 500,
    …
  }
}

The log.error code is:

/**
 * @param {NodeModule} callingModule
 * @param {string} functionName
 * @param {UUID} flowId
 * @param {*} err
 */
function error(callingModule, functionName, flowId, err) {
  strictEqual(typeof flowId, 'string')
  try {
    const now = Date.now()
    const message = {
      lambda: staticLambdaData(),
      flowId,
      now,
      level: 'ERROR',
      module: callingModule.filename,
      function: functionName,
      message: err && err.message,
      stack: err && err.stack,
      details: util.inspect(masked(err, []), { depth: 9, maxArrayLength: 5 })
    }
    console.log(JSON.stringify(message))
  } catch (ignore) /* istanbul ignore next */ {
    errorLogging(callingModule, functionName, flowId, 'ERROR')
  }
}

Note that now in the logged JSON is determined here.

From the logged now in the FETCHING… log and the logged now in the ERROR log we can determine an upper bound of the duration between our fetch call and the occurrence of the error.

Possibly related issues

ronag commented 7 months ago

Can you try to use the undici package instead of the bundled version? To see if the problem still occurs in newer versions of undici.

jandppw commented 7 months ago

Because the issues are intermittent, and do not occur locally, we cannot really test for them. We can only get log information from failed invocations from production realistically, over a number of hours or days. We just deployed a new version that should log the stack trace of the cause of the errors, to get a bit more information.

I understand that rolling out a version that uses the latest version of undici directly would help in isolating the issues. Yet, we believe it would be unwise to expose our users to more experiments than necessary for several days, without addressing the issue they have.

Because this is an annoying production issue (~3,5% of interactions failing), we must focus on stability first. Our plan is to revert to using node-fetch (on Node 20) asap. fetch is still experimental in Node 20, after all, and maybe we jumped the gun.

We will report further information when the errors appear with more log information in the mean time.

We understand the report is not very concrete, and regret we cannot be of anymore help. Yet we believed the observations we made are important enough to share here, if only for reference if other people encounter similar issues (e.g., https://github.com/nodejs/undici/issues/2964).

ronag commented 7 months ago

@jandppw Have you considered simply using undici.request which is our recommended API? fetch should only be used if you require compatibility with other runtimes than NodeJS.

jandppw commented 7 months ago

@ronag No, we haven’t. We weren’t aware of the fact that undici.request is the recommended API, nor that fetch should only be used if you require compatibility with other runtimes than NodeJS. Is there any documentation on that?

We were under the impression that fetch is the generic way to go in the future in JavaScript. One common API. The whole idea is to minimize dependencies, isn’t it?

ronag commented 7 months ago

Is there any documentation on that?

Not sure. Where would you expect that to be documented?

We were under the impression that fetch is the generic way to go in the future in JavaScript. One common API. The whole idea is to minimize dependencies, isn’t it?

fetch was added due to popular demand not due to technical merits. It's never going to be as fast as undici.request. I would stay away from fetch unless writing a isomorphic code or if your team is mostly comfortable with fetch and performance is not critical.

jandppw commented 7 months ago

But that would require adding undici as a direct dependency in the project, no? In that situation, undici is a contender next to node-fetch, axios, got, superagent, …, no?

If that is the case, in the current project, at this time, we’d rather revert to proven safe ground (i.e., node-fetch).

Before the above occurred, we were not familiar with undici. Until your comment, to us undici was the thing used inside Node for fetch. Doing some web searches, blogs anno 2024 still refer to the usual suspects, and do not mention undici.

For future work, no dependency trumps any dependency, unless there are very good reasons (such was working with promises instead of consuming events ourselves). In that space, now that we have learned about undici, we will certainly take it into account in future work.

But, to answer the question “Where would you expect that to be documented?”: well, first of all in blogs that I hit with a Google search about doing API calls inside a Node application. This would point me to undici @ npm. None of this is mentioned there. Next, we would have to choose between the contenders. For us, this is probably a bit late. We've been using axios and node-fetch for years now, and would need good reasons to change away from things that are proven in usage.

Now, you mentioning that we should stay away from using fetch in a Node-only application, such as a Lambda, is of a different order. That should be clear first, because, no dependency trumps any dependency, unless there are very good reasons. I have not encountered that point before. So documenting that, and getting the message out, should then be the first step. If this is the official Node point-of-view, maybe the Node fetch documentation should mention it. But this is probably not the forum for that :-).

mcollina commented 7 months ago

I would really love that the AWS Lambda team helped out on this one. We lack visibility on the Node.js distribution they are running. Essentially they are likely doing certain things to the Node.js runtime that are causing this.

This makes us believe these cases are bugs in undici, or in the AWS Lambda x86_64 nodejs20.x OS.

Given you are using Node.js for free but paying for AWS Lambda, I recommend you to contact AWS support.

In all samples we examined, the error occurs in the first call to fetch in the invocation. We assume these calls never left the Lambda environment, and they might occur on the first call to fetch after spawning a Lambda instance.

I suspect it's due to keep alive connections. Lambda suspends the process and all its state after a response is sent. Undici keep alive the connections, but when does Lambda resume the process we do not know that if we need to reset all the connections. There is no event for us to hook into on resume or suspend.

Try adding a connection: 'close' inside your headers. Alternatives: create a new undici.Client for each request and pass it as a dispatcher property to undici.fetch(). Then .close() the client.

jandppw commented 7 months ago

@mcollina You are absolutely right that this might very well be an AWS issue, or an interop issue (as noted in the report). We just thought that sharing our observations publicly might help other people that might encounter the same issue (as in “no, I’m not crazy”).

AWS support will be made aware of the issue.

Your hint about a Lambda resuming seems plausible.

jandppw commented 7 months ago

@mcollina Actually, it turns out: no. There seems to be no easy and 0€ way to make AWS aware of the issue for this client.

Our main focus is getting stability back to our users.

I share your frustration about AWS not being involved, but there seems to be no easy way for us to get AWS on the case without paying extra €. The only way to get AWS involved seems to be be public side channels, such as social media. This public report might be a first effort.

Any response from AWS now (if I were in their place anyway) would be that native fetch on Node 20 is still experimental, so go away.

Apart from all that, there are no ill feelings here regarding undici or node. In any case, by the time AWS will offer Node 22, there will have to be some resolution, since native fetch is no longer experimental since Node 21.

In the mean time, this report might help other people in realizing that it probably is not a good idea to use native fetch in AWS Lambda’s for the time being.

mweberxyz commented 7 months ago

Out of curiosity, I attempted to replicate this in us-east-1 and couldn't.

I set up a CloudFront distribution with a Lambda named PrivateLambda (invoked via HTTPS via a function URL) as the origin. Then set up a second lambda named PublicLambda (with a function URL) that calls the CloudFront distribution 25 times in a loop.

I called PublicLambda repeatedly using k6, ramping concurrency up and down to suss out out cold/warm start issues. Every request succeeded:

% k6 run test.js

          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: test.js
        output: -

     scenarios: (100.00%) 1 scenario, 50 max VUs, 35m30s max duration (incl. graceful stop):
              * default: Up to 50 looping VUs for 35m0s over 7 stages (gracefulRampDown: 30s, gracefulStop: 30s)

     ✓ is status 200
     ✓ is body 25

     checks.........................: 100.00% ✓ 99290     ✗ 0
     data_received..................: 17 MB   8.1 kB/s
     data_sent......................: 7.4 MB  3.5 kB/s
     http_req_blocked...............: avg=149.91µs min=1µs      med=7µs      max=170.28ms p(90)=9µs      p(95)=9µs
     http_req_connecting............: avg=59.64µs  min=0s       med=0s       max=60.26ms  p(90)=0s       p(95)=0s
     http_req_duration..............: avg=668.06ms min=465.52ms med=625.45ms max=5.7s     p(90)=825.18ms p(95)=900.9ms
       { expected_response:true }...: avg=668.06ms min=465.52ms med=625.45ms max=5.7s     p(90)=825.18ms p(95)=900.9ms
     http_req_failed................: 0.00%   ✓ 0         ✗ 49645
     http_req_receiving.............: avg=136.27µs min=16µs     med=117µs    max=204.25ms p(90)=167µs    p(95)=199µs
     http_req_sending...............: avg=36.73µs  min=6µs      med=34µs     max=10.91ms  p(90)=48µs     p(95)=53µs
     http_req_tls_handshaking.......: avg=81.24µs  min=0s       med=0s       max=130.21ms p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=667.89ms min=465.38ms med=625.3ms  max=5.7s     p(90)=824.94ms p(95)=900.73ms
     http_reqs......................: 49645   23.638233/s
     iteration_duration.............: avg=668.46ms min=465.76ms med=625.79ms max=5.76s    p(90)=825.95ms p(95)=901.25ms
     iterations.....................: 49645   23.638233/s
     vus............................: 1       min=0       max=50
     vus_max........................: 50      min=50      max=50

Over 35 minutes this made 49,645 requests to to PublicLambda, which means we made successfully made 1,241,125 fetch calls.

Hopefully I did my math right and this little experiment only cost me a few dollars 😅

Attached cloudformation template for set-up and k6 test file: replicate-undici-3133.zip

metcoder95 commented 7 months ago

The issues shared seem to follow a pattern of constant connection and abrupt disconnections, I believe I'm seeing a similar issue as @mcollina states. I'd suggest to give it a try to his suggestion and start from there.

The lambda runtime might be doing constant cleanups while resuming causing intermittent connect issues.

jandppw commented 7 months ago

We yesterday deployed a version that logs the stack trace of the cause, while working on a revert to node-fetch. Up until now 2 errors are logged of the ETIMEDOUT variant, both with the same stack trace:

TypeError: fetch failed
    at node:internal/deps/undici/undici:12345:11
    at async getImplementation (/var/task/lib/dal/get.js:75:20)

{ cause: { errno: -110, code: 'ETIMEDOUT', syscall: 'write' } }

Error: write ETIMEDOUT
    at WriteWrap.onWriteComplete [as oncomplete] (node:internal/stream_base_commons:94:16)
    at handleWriteReq (node:internal/stream_base_commons:58:21)
    at writeGeneric (node:internal/stream_base_commons:149:15)
    at Socket._writeGeneric (node:net:952:11)
    at Socket._write (node:net:964:8)
    at writeOrBuffer (node:internal/streams/writable:564:12)
    at _write (node:internal/streams/writable:493:10)
    at Writable.write (node:internal/streams/writable:502:10)
    at write (node:internal/deps/undici/undici:9269:18)
    at _resume (node:internal/deps/undici/undici:9172:33)
    at resume (node:internal/deps/undici/undici:9082:7)
    at [dispatch] (node:internal/deps/undici/undici:8327:11)
    at Client.Intercept (node:internal/deps/undici/undici:8000:20)
    at Client.dispatch (node:internal/deps/undici/undici:6606:44)
    at [dispatch] (node:internal/deps/undici/undici:6837:32)
    at Pool.dispatch (node:internal/deps/undici/undici:6606:44)
    at [dispatch] (node:internal/deps/undici/undici:9957:27)
    at Agent.Intercept (node:internal/deps/undici/undici:8000:20)
    at Agent.dispatch (node:internal/deps/undici/undici:6606:44)
    at node:internal/deps/undici/undici:10929:55
    at new Promise (<anonymous>)
    at dispatch (node:internal/deps/undici/undici:10929:16)
    at httpNetworkFetch (node:internal/deps/undici/undici:10824:73)
    at httpNetworkOrCacheFetch (node:internal/deps/undici/undici:10712:39)
    at httpFetch (node:internal/deps/undici/undici:10545:43)
    at schemeFetch (node:internal/deps/undici/undici:10469:18)
    at node:internal/deps/undici/undici:10342:26
    at mainFetch (node:internal/deps/undici/undici:10361:11)
    at fetching (node:internal/deps/undici/undici:10309:7)
    at fetch (node:internal/deps/undici/undici:10173:20)
    at fetch (node:internal/deps/undici/undici:12344:10)
    at fetch (node:internal/bootstrap/web/exposed-window-or-worker:79:16)
    at getImplementation (/var/task/lib/dal/get.js:75:26)
    …

These again were the first fetch calls during the invocation. The time between the fetch call and the logging of the error was <4ms.

ajinkyashukla29dj commented 7 months ago

We are using amazon-cognito-identity-js package inside of the AWS Lambda function to talk to AWS Cognito. The AWS lambda function makes multiple AWS cognito calls. We see this error randomly for some of the calls.

{
    "cause": {
        "name": "ConnectTimeoutError",
        "code": "UND_ERR_CONNECT_TIMEOUT",
        "message": "Connect Timeout Error"
    }
}

Error: Network error
    at /var/task/node_modules/amazon-cognito-identity-js/lib/Client.js:117:15
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  code: 'NetworkError'
}

The AWS lambda function is configured to use Node.js 18.x runtime. The earliest occurrence of this error seems to be around six months ago but we need to investigate the AWS Cloudwatch logs if it occurred earlier too.

mweberxyz commented 7 months ago

@ajinkyashukla29dj amazon-cognito-identity-js uses isomorphic-unfetch which uses the globally defined fetch if available. Can you disable experimental fetch by setting NODE_OPTIONS [1] environment variable in your Lambdas to --no-experimental-fetch [2] and see if your issue persists?

[1] https://docs.aws.amazon.com/lambda/latest/dg/runtimes-modify.html [2] https://nodejs.org/api/cli.html#--no-experimental-fetch

jandppw commented 7 months ago

One more error came in, this time of the of the UND_ERR_SOCKET variant:

TypeError: fetch failed
    at node:internal/deps/undici/undici:12345:11
    at async getImplementation (/var/task/lib/dal/get.js:75:20)

{
  cause: {
    name: 'SocketError',
    code: 'UND_ERR_SOCKET',
    socket: {
      localAddress: '169.NNN.NN.N',
      localPort: NNNNN,
      remoteAddress: undefined,
      remotePort: undefined,
      remoteFamily: undefined,
      timeout: undefined,
      bytesWritten: 1185,
      bytesRead: 69979
    }
  }
}

cause stack:

SocketError: other side closed
    at TLSSocket.onSocketEnd (node:internal/deps/undici/undici:8903:26)
    at TLSSocket.emit (node:events:530:35)
    at endReadableNT (node:internal/streams/readable:1696:12)
    at process.processTicksAndRejections (node:internal/process/task_queues:82:21)

ThIs again was the first fetch call during the invocation. The time between the fetch call and the logging of the error was <3ms.

mweberxyz commented 7 months ago

@jandppw I think your best course of action is to swap out native fetch for node-fetch, keeping everything else the same. If the errors go away, it might be an undici issue. If the issues persist, it is either a remote server issue, or at a lower level than undici.

btw: not an undici code contributor, just curious at this point.

uNetworkingAB commented 7 months ago

I use Node.js 21.7.2 for a simple scraper project that use await fetch inside try catch but I get really bad stability where internal errors of undici somehow cause ENETUNREACH but my try catch never trigger but instead Node.js just dies. It's super unstable and I need another implementation to get this working

Edit: I simply ported it to Deno and now it is stable.

jandppw commented 7 months ago

3 more errors of the ETIMEDOUT variant came in this morning, with the same stack trace.

A new version is now rolled out on production that no longer uses native fetch, so this is the last observation we can share.

ajinkyashukla29dj commented 7 months ago

@mweberxyz Thank you for your inputs. I don't see the ConnectTimeoutError error after setting the NODE_OPTIONS env variable to --no-experimental-fetch. I am seeing the following response even though the username and password is correct

{
    "size": 0,
    "timeout": 0
}


ERROR   NotAuthorizedException: Incorrect username or password.
    at /var/task/node_modules/amazon-cognito-identity-js/lib/Client.js:136:19
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5) {
  code: 'NotAuthorizedException'
}

mweberxyz commented 7 months ago

@ajinkyashukla29dj sorry can't help with that error it's coming from the amazon-cognito-identity-js library, looks like the HTTP request succeeded though!

Can you provide some more details about your lambda so I can try to replicate UND_ERR_CONNECT_TIMEOUT you encountered when using native fetch? Such as:

How often did the UND_ERR_CONNECT_TIMEOUT occur? Every request or just intermittently?
What AWS region?
Is the lambda running in a VPC? If so, public or private subnet? and is there any privatelink to AWS services defined in the vpc?
How is the lambda invoked (api gateway, manually from console, api gateway, manual test from console, etc)?

uNetworkingAB commented 7 months ago

For me this is not AWS related, it happens on macOS "all the time" (very often) for a very simple scraper project that does about 500 fetch calls or something. To confirm I ported it to another impl. of fetch and it worked without stability issues there.

The issue I have is that ETIMEDOUT and similar is not being propagated to my try/catch but instead just crashes Node.js somewhere inside undici.

It can be reproduced with await fetch inside try/catch leading to crashed Node.js process.

mweberxyz commented 7 months ago

@uNetworkingAB Can you give a code sample? Knowing which remote servers you are having trouble with would help with replication.

mcollina commented 7 months ago

Please verify if undici v6.14.1 fixes this.

lucamezzalira commented 6 months ago

@jandppw my name is Luca, I work for AWS as Severless Specialist, I'd love to connect with you to review your challenge with the service team. Would you mind ping me in private so we can arrange a chat?

jandppw commented 6 months ago

@lucamezzalira Sure, Luca. Sending you an email now.

krk commented 5 months ago

It might be helpful to collect logs with NODE_DEBUG=net set.

Benjamin-Dobell commented 6 days ago

Chiming in to say I get "cause EPIPE" on macOS when issuing fetch requests in a tight loop. So this appears to be a buffering issue:

(async () => {
  while (packet != null) {
    await new Promise(resolve => setTimeout(resolve, 0.2));

    overrideNow(packet.date);

    if (packet.type === 'client') {
      replayClientCommand(packet.command);
    } else {
      replayServerMessage(packet.message);
    }

    packet = reader.next();
  }
})();

Basically the above is some logic I have to replay sessions from captured session files. If I remove the timeout/await then I run into EPIPE issues 100% of the time, seemingly immediately. With the 200ms wait between requests, I'm fine. Worth noting I'm not awaiting individual requests, so this code will indeed spam fetch requests in parallel.

nodejs / undici