oven-sh / bun

Incredibly fast JavaScript runtime, bundler, test runner, and package manager – all in one
https://bun.sh
Other
74.38k stars 2.78k forks source link

Wildly bad networking performance with fetch in bun vs nodejs when switching between similar endpoints #9034

Open iamacup opened 9 months ago

iamacup commented 9 months ago

What version of Bun is running?

bun --revision 1.0.28+705638470

What platform is your computer?

Darwin 23.2.0 arm64 arm

What steps can reproduce the bug?

Here is some code:

var executorRest = async (url) => {
  const response = await fetch(url);
  if (!response.ok) {
    throw new Error("Network response was not ok");
  }
  return await response.json();
};

var executorGraphQL = async (query, variables) => {
  const endpoint = "https://example.com/api/graphql";
  const response = await fetch(endpoint, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      query,
      variables,
    }),
  });
  const res = await response.json();
  if (res.errors && res.errors.length > 0) {
    throw new Error(
      `GraphQL errors in executor: \r\n\r\n ${JSON.stringify(
        res.errors
      )}\r\n\r\n${JSON.stringify(query)}`
    );
  }
  return res.data;
};

var runGraphQLQuery = async () => {
  const query = `
  query GetDeployments(\$id: String!, \$locale: LocaleInputType!, \$fallbackLocale: FallbackLocaleInputType!) {
    Deployments(
      locale: \$locale,
      fallbackLocale: \$fallbackLocale,
      where: { publicIdentifier: { equals: \$id } }) {
      docs {
        id,
        organisation {
          id
        }
      }
    }
  }`;
  const startTime = performance.now();
  const data1 = await executorGraphQL(query, {
    id: "local-dev-deployment-id",
    locale: "en",
    fallbackLocale: "none",
  });
  console.log(`GRAPHQL: ${performance.now() - startTime}ms`);
};

var runRestQuery = async () => {
  const startTime = performance.now();
  const data2 = await executorRest(
    `https://example.com/api/deployments/65cd3ea793b4b8e35739641a?locale=en&fallback-locale=none&depth=0`
  );
  console.log(`REST: ${performance.now() - startTime}ms`);
};

var combinedQueries = async () => {
  console.log(``);
  console.log(`STARTING COMBINED`);
  const startTime = performance.now();
  await runGraphQLQuery();
  await runRestQuery();
  await runGraphQLQuery();
  await runRestQuery();
  console.log(`COMBINED: ${performance.now() - startTime}ms`);
};

var graphQLOnly = async () => {
  console.log(``);
  console.log(`STARTING GRAPHQL ONLY`);
  const startTime = performance.now();
  await runGraphQLQuery();
  await runGraphQLQuery();
  await runGraphQLQuery();
  console.log(`GRAPHQL ONLY: ${performance.now() - startTime}ms`);
};

var restOnly = async () => {
  console.log(``);
  console.log(`STARTING REST ONLY`);
  const startTime = performance.now();
  await runRestQuery();
  await runRestQuery();
  await runRestQuery();
  console.log(`REST ONLY: ${performance.now() - startTime}ms`);
};

(async () => {
  const startTime = performance.now();

  await restOnly();
  await graphQLOnly();
  await combinedQueries();

  console.log(``);
  console.log(
    `TOTAL TIME TO RUN ALL QUERIES: ${performance.now() - startTime}ms`
  );
  console.log(`run on: ${process.version}`);
})();

I have run several tests on local machines as well as AWS boxes - all the same variance.

The problem comes when you switch between endpoints and has something to do with how the graphQL is being accessed - if you run 50 concurrent rest requests there is no problem, sprinkle a graphQL in there and it grinds to a complete halt.

i.e.

STARTING REST ONLY
REST: 56.523458999999995ms
REST: 12.175667000000004ms
REST: 14.924125000000004ms
REST ONLY: 84.060209ms

is fine

but mix it with some graphQL and (note the lines below for rest are the same endpoints as in the above test):

STARTING COMBINED
GRAPHQL: 12003.096000000001ms
REST: 11998.207834ms
GRAPHQL: 45.02695899999526ms
REST: 11949.27375ms
COMBINED: 35996.477375ms

You can test the graphQL endpoint seperately, its responsive (although on our dev environment so not optimised or reliable) but you can see the vastly different times in execution if you run this on node vs bun.

summary of execution

Machine 1: 
Node: 0.5s
Bun: 36s

Machine 2: 
Node: 0.35s
Bun: 36s

Machine 3: 
Node: 0.5s
Bun: 36s

bun (run machine 1).txt node (run machine 1).txt bun (run machine 2).txt node (run machine 2).txt node (run machine 3).txt bun (run machine 3).txt

I do apologise in advance if somehow this is a networking issue - but i have tested it on multiple independant machines - I don't think its the API.

What is the expected behavior?

No response

What do you see instead?

No response

Additional information

I can't replicate this on HTTP connections - when running this fully locally (i.e. the server is on the same machine with not https) there is no problem - it may be something to do with HTTPS

iamacup commented 9 months ago

@Jarred-Sumner when you have managed to identify this as an issuse or not can you let me know as i would like to remove the endpoint in the example code routing to the internet :)

cirospaciari commented 9 months ago

@iamacup I did a quick test using verbose: true and debug version and I see the log bellow showing that we are trying to reuse the connection in /api/graphql but the server did not send any response, we timeout retry and reconnect and we get the response. If you turn keepalive: false in graphQL we get better results (I replaced in this comment your domain by domain.com).

verbose + debug logs:

GRAPHQL: 12020.494139ms
[MEM] malloc(49) = 49
[MEM] malloc(128) = 177
[MEM] malloc(28) = 205
[MEM] malloc(800) = 1005
[MEM] malloc(2688) = 3693
[Loop] ref
[MEM] report(3693)
[fetch] + Keep-Alive reuse domain.com:443
[fetch] Connected https://domain.com/api/graphql 
[uws] us_socket_write(src.deps.boringssl.translated.SSL@20002052e80, 662) = 662
Request: POST /api/graphql
        content-type: application/json
        Connection: keep-alive
        User-Agent: Bun/1.0.28-debug
        Accept: */*
        Host: domain.com
        Accept-Encoding: gzip, deflate, br
        Content-Length: 439

[fetch] onStart: 12.712s
[fetch] Processed 1 tasks
[fetch] Closed  https://domain.com/api/graphql
[uws] connect(domain.com, 443)
[fetch] Connected https://domain.com/api/graphql 
[uws] us_socket_write(src.deps.boringssl.translated.SSL@20002052c40, 662) = 0
[uws] us_socket_write(src.deps.boringssl.translated.SSL@20002052c40, 662) = 0
[fetch] onHandshake(0x00007FAEBC004F00) authorized: true error: 
[uws] us_socket_write(src.deps.boringssl.translated.SSL@20002052c40, 662) = 662
Request: POST /api/graphql
        content-type: application/json
        Connection: keep-alive
        User-Agent: Bun/1.0.28-debug
        Accept: */*
        Host: domain.com
        Accept-Encoding: gzip, deflate, br
        Content-Length: 439

[fetch] onData 792
Response: < 200 OK
<       Date: Wed, 21 Feb 2024 17:42:05 GMT
<       Content-Type: application/json; charset=utf-8
<       Content-Length: 106
<       Connection: keep-alive
<       set-cookie: lng=en; Path=/; Expires=Fri, 21 Feb 2025 17:42:05 GMT; SameSite=Strict
<       Content-Language: en
<       Vary: X-HTTP-Method-Override, Accept-Encoding
<       Access-Control-Allow-Methods: PUT, PATCH, POST, GET, DELETE, OPTIONS
<       Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Authorization, Content-Encoding, x-apollo-tracing
<       Content-Encoding: gzip
<       X-Powered-By: Express
<       X-RateLimit-Limit: 
<       X-RateLimit-Remaining: 
<       X-RateLimit-Reset: 

[fetch] handleResponseMetadata: content_length is 106 and transfer_encoding src.http.Encoding.identity
[MEM] malloc(590) = 4283
[MEM] malloc(448) = 4731
[fetch] Decompressing 106 bytes
[fetch] progressUpdate true
[fetch] releaseSocket(0x00007FAEBC004F00)
[fetch] Keep-Alive release domain.com:443 (0x140388455173888)
[fetch] onAsyncHTTPCallback: 11.894s
[FetchTasklet] callback success true has_more false bytes 118
[FetchTasklet] added callback metadata
[MEM] malloc(165) = 4896
[FetchTasklet] onProgressUpdate
[FetchTasklet] onResolve
[FetchTasklet] toResponse
[MEM] discard(165) = 4731
[FetchTasklet] onProgressUpdate: promise_value is not null
[Loop] sub 1 - 1 = 0
[FetchTasklet] clearData
[MEM] free(49) = 4682
[MEM] free(128) = 4554
[MEM] free(28) = 4526
[MEM] free(590) = 3936
[MEM] free(448) = 3488
[FetchTasklet] deinit
[MEM] free(2688) = 800
[MEM] free(800) = 0

example of keepalive: false

var executorGraphQL = async (query, variables) => {
  const endpoint = "https://domain.com/api/graphql";
  const response = await fetch(endpoint, {
    method: "POST",
    headers: {
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      query,
      variables,
    }),
    keepalive: false,
  });
  const res = await response.json();
  if (res.errors && res.errors.length > 0) {
    throw new Error(
      `GraphQL errors in executor: \r\n\r\n ${JSON.stringify(
        res.errors
      )}\r\n\r\n${JSON.stringify(query)}`
    );
  }
  return res.data;
};

Results on my machine using keepalive: false only in the graphQL endpoint (linux Debian testing):

STARTING REST ONLY
REST: 639.719363ms
REST: 152.13388499999996ms
REST: 154.44626000000005ms
REST ONLY: 946.564288ms

STARTING GRAPHQL ONLY
GRAPHQL: 616.221718ms
GRAPHQL: 606.956001ms
GRAPHQL: 611.792657ms
GRAPHQL ONLY: 1835.3458449999998ms

STARTING COMBINED
GRAPHQL: 606.5677440000004ms
REST: 149.94005100000004ms
GRAPHQL: 626.6172189999997ms
REST: 152.2027710000002ms
COMBINED: 1535.4897110000006ms

TOTAL TIME TO RUN ALL QUERIES: 4318.191898ms
run on: v21.6.0

So the problem is that we are trying to reuse a socket that is not responding anymore after /api/graphql because we still receive from the server Connection: keep-alive we need to investigate what is causing the socket to not be responsive after the first request.

Probably nodejs is using h2 here (upgrading sending the h2 ALPN) and this is making it more reliable than using http/1.1 in this server, using curl --verbose we can see that we can auto upgrade to h2 using ALPN.

iamacup commented 9 months ago

thanks @cirospaciari looks like keepalive does stop this from happening. I will leave the longer term fix in your capeable hands and have edited my initial to example.com so if anyone else comes along FYI this won't function exactly as reported because the original domain will not be accessible.

re: the stuff about keep-alive - the server itself is express but i am wondering if something in the AWS stack is making this more pronounced / problematic but i am not an expect here (PayloadCMS (which wraps express AFIK) running on Fargate behind ALB)

for note - the server is running on bun.