Open iamacup opened 8 months ago
@Jarred-Sumner when you have managed to identify this as an issuse or not can you let me know as i would like to remove the endpoint in the example code routing to the internet :)
@iamacup
I did a quick test using verbose: true and debug version and I see the log bellow showing that we are trying to reuse the connection in /api/graphql
but the server did not send any response, we timeout retry and reconnect and we get the response.
If you turn keepalive: false
in graphQL we get better results (I replaced in this comment your domain by domain.com).
verbose + debug logs:
GRAPHQL: 12020.494139ms
[MEM] malloc(49) = 49
[MEM] malloc(128) = 177
[MEM] malloc(28) = 205
[MEM] malloc(800) = 1005
[MEM] malloc(2688) = 3693
[Loop] ref
[MEM] report(3693)
[fetch] + Keep-Alive reuse domain.com:443
[fetch] Connected https://domain.com/api/graphql
[uws] us_socket_write(src.deps.boringssl.translated.SSL@20002052e80, 662) = 662
Request: POST /api/graphql
content-type: application/json
Connection: keep-alive
User-Agent: Bun/1.0.28-debug
Accept: */*
Host: domain.com
Accept-Encoding: gzip, deflate, br
Content-Length: 439
[fetch] onStart: 12.712s
[fetch] Processed 1 tasks
[fetch] Closed https://domain.com/api/graphql
[uws] connect(domain.com, 443)
[fetch] Connected https://domain.com/api/graphql
[uws] us_socket_write(src.deps.boringssl.translated.SSL@20002052c40, 662) = 0
[uws] us_socket_write(src.deps.boringssl.translated.SSL@20002052c40, 662) = 0
[fetch] onHandshake(0x00007FAEBC004F00) authorized: true error:
[uws] us_socket_write(src.deps.boringssl.translated.SSL@20002052c40, 662) = 662
Request: POST /api/graphql
content-type: application/json
Connection: keep-alive
User-Agent: Bun/1.0.28-debug
Accept: */*
Host: domain.com
Accept-Encoding: gzip, deflate, br
Content-Length: 439
[fetch] onData 792
Response: < 200 OK
< Date: Wed, 21 Feb 2024 17:42:05 GMT
< Content-Type: application/json; charset=utf-8
< Content-Length: 106
< Connection: keep-alive
< set-cookie: lng=en; Path=/; Expires=Fri, 21 Feb 2025 17:42:05 GMT; SameSite=Strict
< Content-Language: en
< Vary: X-HTTP-Method-Override, Accept-Encoding
< Access-Control-Allow-Methods: PUT, PATCH, POST, GET, DELETE, OPTIONS
< Access-Control-Allow-Headers: Origin, X-Requested-With, Content-Type, Accept, Authorization, Content-Encoding, x-apollo-tracing
< Content-Encoding: gzip
< X-Powered-By: Express
< X-RateLimit-Limit:
< X-RateLimit-Remaining:
< X-RateLimit-Reset:
[fetch] handleResponseMetadata: content_length is 106 and transfer_encoding src.http.Encoding.identity
[MEM] malloc(590) = 4283
[MEM] malloc(448) = 4731
[fetch] Decompressing 106 bytes
[fetch] progressUpdate true
[fetch] releaseSocket(0x00007FAEBC004F00)
[fetch] Keep-Alive release domain.com:443 (0x140388455173888)
[fetch] onAsyncHTTPCallback: 11.894s
[FetchTasklet] callback success true has_more false bytes 118
[FetchTasklet] added callback metadata
[MEM] malloc(165) = 4896
[FetchTasklet] onProgressUpdate
[FetchTasklet] onResolve
[FetchTasklet] toResponse
[MEM] discard(165) = 4731
[FetchTasklet] onProgressUpdate: promise_value is not null
[Loop] sub 1 - 1 = 0
[FetchTasklet] clearData
[MEM] free(49) = 4682
[MEM] free(128) = 4554
[MEM] free(28) = 4526
[MEM] free(590) = 3936
[MEM] free(448) = 3488
[FetchTasklet] deinit
[MEM] free(2688) = 800
[MEM] free(800) = 0
example of keepalive: false
var executorGraphQL = async (query, variables) => {
const endpoint = "https://domain.com/api/graphql";
const response = await fetch(endpoint, {
method: "POST",
headers: {
"Content-Type": "application/json",
},
body: JSON.stringify({
query,
variables,
}),
keepalive: false,
});
const res = await response.json();
if (res.errors && res.errors.length > 0) {
throw new Error(
`GraphQL errors in executor: \r\n\r\n ${JSON.stringify(
res.errors
)}\r\n\r\n${JSON.stringify(query)}`
);
}
return res.data;
};
Results on my machine using keepalive: false
only in the graphQL endpoint (linux Debian testing):
STARTING REST ONLY
REST: 639.719363ms
REST: 152.13388499999996ms
REST: 154.44626000000005ms
REST ONLY: 946.564288ms
STARTING GRAPHQL ONLY
GRAPHQL: 616.221718ms
GRAPHQL: 606.956001ms
GRAPHQL: 611.792657ms
GRAPHQL ONLY: 1835.3458449999998ms
STARTING COMBINED
GRAPHQL: 606.5677440000004ms
REST: 149.94005100000004ms
GRAPHQL: 626.6172189999997ms
REST: 152.2027710000002ms
COMBINED: 1535.4897110000006ms
TOTAL TIME TO RUN ALL QUERIES: 4318.191898ms
run on: v21.6.0
So the problem is that we are trying to reuse a socket that is not responding anymore after /api/graphql
because we still receive from the server Connection: keep-alive
we need to investigate what is causing the socket to not be responsive after the first request.
Probably nodejs is using h2
here (upgrading sending the h2 ALPN) and this is making it more reliable than using http/1.1
in this server, using curl --verbose
we can see that we can auto upgrade to h2 using ALPN.
thanks @cirospaciari looks like keepalive does stop this from happening. I will leave the longer term fix in your capeable hands and have edited my initial to example.com so if anyone else comes along FYI this won't function exactly as reported because the original domain will not be accessible.
re: the stuff about keep-alive - the server itself is express but i am wondering if something in the AWS stack is making this more pronounced / problematic but i am not an expect here (PayloadCMS (which wraps express AFIK) running on Fargate behind ALB)
for note - the server is running on bun.
What version of Bun is running?
bun --revision 1.0.28+705638470
What platform is your computer?
Darwin 23.2.0 arm64 arm
What steps can reproduce the bug?
Here is some code:
I have run several tests on local machines as well as AWS boxes - all the same variance.
The problem comes when you switch between endpoints and has something to do with how the graphQL is being accessed - if you run 50 concurrent rest requests there is no problem, sprinkle a graphQL in there and it grinds to a complete halt.
i.e.
is fine
but mix it with some graphQL and (note the lines below for rest are the same endpoints as in the above test):
You can test the graphQL endpoint seperately, its responsive (although on our dev environment so not optimised or reliable) but you can see the vastly different times in execution if you run this on node vs bun.
summary of execution
bun (run machine 1).txt node (run machine 1).txt bun (run machine 2).txt node (run machine 2).txt node (run machine 3).txt bun (run machine 3).txt
I do apologise in advance if somehow this is a networking issue - but i have tested it on multiple independant machines - I don't think its the API.
What is the expected behavior?
No response
What do you see instead?
No response
Additional information
I can't replicate this on HTTP connections - when running this fully locally (i.e. the server is on the same machine with not https) there is no problem - it may be something to do with HTTPS