jupyterhub / configurable-http-proxy

node-http-proxy plus a REST API
BSD 3-Clause "New" or "Revised" License
242 stars 130 forks source link

Introduce keep alive and fix AWS Load Balancer 502 errors #491

Closed a3626a closed 1 year ago

a3626a commented 1 year ago

I'm running Z2JH based service with about 1,000 DAU. It is deployed in AWS EKS attached to AWS ALB. image

As DAU grows, users started to get 502 Responses from the LB. image

This is well-known problem related to keep-alive setting. (AWS Article) Unfortunately, configurable-http-proxy does not support keep-alive. So I implemented, and tested in production environment.

After the deployment the number of 502 errors descreased. image

Technical/Implementation detail

1) It is very important to allow keep-alive both client side and server side. That's why Agent and keepAliveTimeout are both needed.

2) The jupyter hub and jupyter server support keep-alive by default, because they are Tornado servers.

3) chp is given these parameters. They are AWS specific values.

"--server-keep-alive-timeout=61000"
"--agent-free-socket-timeout=62000"
welcome[bot] commented 1 year ago

Thanks for submitting your first pull request! You are awesome! :hugs:
If you haven't done so already, check out Jupyter's Code of Conduct. Also, please make sure you followed the pull request template, as this will help us review your contribution more quickly. welcome You can meet the other Jovyans by joining our Discourse forum. There is also a intro thread there where you can stop by and say Hi! :wave:
Welcome to the Jupyter community! :tada:

minrk commented 1 year ago

Thanks! I think enabling keep-alive makes sense, and exposing the timeout as an option is sensible as well. What I'm trying to understand is the addition of the keepaliveagent package instead of using the standard-library server.keepAliveTimeout. Can you speak to that as to why it's needed beyond setting keepAlive: true, keepAliveTimeout: 62000?

minrk commented 1 year ago

481 enables keep-alive, which I think makes sense, and combining it with this to expose the keep-alive timeout as an option seems like the right way to go, unless the keepaliveagent solves a problem I'm not quite seeing.

a3626a commented 1 year ago

I have done some experiments and concluded that keep alive should be supported for the both directions (client side - Load Balancer, and server side - Jupyter Hub or Jupyter Server)

But my experiment was not well organized to be shared. I haved used curl to check keep-alive support, and nc(Netcat) to verify timeout. I have found that without agent keep-alive connections are closed after 5 seconds, even though I have set server.keepAliveTimeout to 60 seconds.

I will do the experiment again, and share it here.

a3626a commented 1 year ago

I have done simple experiment again.

I opened a shell inside the proxy pod which is deployed by Z2JH. Then I executed curl -v localhost:8000.

CASE 1) --server-keep-alive-timeout=15000 && --agent-free-socket-timeout=16000

Check the arguments using ps

/srv/configurable-http-proxy $ ps | grep node
    1 nobody    0:03 node /srv/configurable-http-proxy/bin/configurable-http-proxy --ip= --api-ip= --api-port=8001 --default-target=http://jupyterhub1-hub:8081 --error-target=http://jupyterhub1-hub:8081/hub/error --port=8000 --log-level=debug --metrics-port=8080 --server-keep-alive-timeout=15000 --agent-free-socket-timeout=16000
/srv/configurable-http-proxy $ curl -v localhost:8000
*   Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/8.1.2
> Accept: */*
> 
< HTTP/1.1 302 Found
< server: TornadoServer/6.2
< content-type: text/html
< date: Thu, 17 Aug 2023 09:24:46 GMT
< access-control-allow-origin: *
< access-control-allow-methods: GET, POST, PUT, DELETE, OPTIONS
< content-security-policy: frame-ancestors self codle.io dev.codle.io
< x-jupyterhub-version: 3.0.0
< access-control-allow-headers: accept, content-type, authorization
< location: /hub/
< content-length: 0
< connection: keep-alive
< 
* Connection #0 to host localhost left intact

-> left intact means keep-alive works.

CASE 2) --server-keep-alive-timeout=15000

Check the arguments, too.

/srv/configurable-http-proxy $ ps | grep node
    1 nobody    0:04 node /srv/configurable-http-proxy/bin/configurable-http-proxy --ip= --api-ip= --api-port=8001 --default-target=http://jupyterhub2-hub:8081 --error-target=http://jupyterhub2-hub:8081/hub/error --port=8000 --log-level=debug --metrics-port=8080 --server-keep-alive-timeout=15000
/srv/configurable-http-proxy $ curl -v localhost:8000
*   Trying 127.0.0.1:8000...
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET / HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/8.1.2
> Accept: */*
> 
< HTTP/1.1 302 Found
< server: TornadoServer/6.2
< content-type: text/html
< date: Thu, 17 Aug 2023 09:24:38 GMT
< access-control-allow-origin: *
< access-control-allow-methods: GET, POST, PUT, DELETE, OPTIONS
< content-security-policy: frame-ancestors self codle.io dev.codle.io
< x-jupyterhub-version: 3.0.0
< access-control-allow-headers: accept, content-type, authorization
< location: /hub/
< content-length: 0
< connection: close
< 
* Closing connection 0

No keep-alive.

minrk commented 1 year ago

Can you test with #492? It seems to enable keep-alive all the way through from proxied requests from tornado.

minrk commented 1 year ago

Actually, there seems to be something weird where we can't use a single agent for keep-alive on both http or https with the standard library (bizarre), so I think maybe this PR is the way to go.

a3626a commented 1 year ago

Actually, there seems to be something weird where we can't use a single agent for keep-alive on both http or https with the standard library (bizarre), so I think maybe this PR is the way to go.

For the agentkeepalive library, I followed this example. But there're no particular reason or cases that this library must be used. http.Agent could work, I am not sure.

Can you test with https://github.com/jupyterhub/configurable-http-proxy/pull/492? It seems to enable keep-alive all the way through from proxied requests from tornado.

Ok. I will post curl result and also nc result. I think #492 will do keep-alive just for 5 seconds, won't respect the given timeout argument. Because timeout is not passed to the agent.

Also, I set up TLS termination on LB, so all my tests are done using HTTP.

a3626a commented 1 year ago

Fix

https://github.com/jupyterhub/configurable-http-proxy/pull/492 has issue.

01:45:21.756 [ConfigProxy] info: Adding route / -> http://jupyterhub2-hub:8081
node:internal/validators:96
      throw new ERR_INVALID_ARG_TYPE(name, 'number', value);
      ^

TypeError [ERR_INVALID_ARG_TYPE]: The "keepAliveTimeout" argument must be of type number. Received type string ('15000')
    at Server.storeHTTPOptions (node:_http_server:464:5)
    at new Server (node:_http_server:507:20)
    at Object.createServer (node:http:61:10)
    at new ConfigurableProxy (/srv/configurable-http-proxy/lib/configproxy.js:234:31)
    at Object.<anonymous> (/srv/configurable-http-proxy/bin/configurable-http-proxy:320:13)
    at Module._compile (node:internal/modules/cjs/loader:1254:14)
    at Module._extensions..js (node:internal/modules/cjs/loader:1308:10)
    at Module.load (node:internal/modules/cjs/loader:1117:32)
    at Module._load (node:internal/modules/cjs/loader:958:12)
    at Function.executeUserEntryPoint [as runMain] (node:internal/modules/run_main:81:12) {
  code: 'ERR_INVALID_ARG_TYPE'
}

Node.js v18.16.0
Stream closed EOF for jupyter-hub/jupyterhub2-proxy-58b65b6b87-drnfn (chp)

I added parseInt and test again. (I substituted 5000 to parseInt)

Curl Test

ps result

/srv/configurable-http-proxy $ ps | grep node
    1 nobody    0:01 node /srv/configurable-http-proxy/bin/configurable-http-proxy --ip= --api-ip= --api-port=8001 --default-target=http://jupyterhub2-hub:8081 --error-target=http://jupyterhub2-hub:8081/hub/error --port=8000 --log-level=debug --metrics-port=8080 --keep-alive-timeout=15000
/srv/configurable-http-proxy $ curl -v localhost:8000
* processing: localhost:8000
*   Trying [::1]:8000...
* Connected to localhost (::1) port 8000
> GET / HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/8.2.1
> Accept: */*
> 
< HTTP/1.1 302 Found
< server: TornadoServer/6.2
< content-type: text/html
< date: Fri, 18 Aug 2023 02:20:21 GMT
< access-control-allow-origin: *
< access-control-allow-methods: GET, POST, PUT, DELETE, OPTIONS
< content-security-policy: frame-ancestors self codle.io dev.codle.io
< x-jupyterhub-version: 3.0.0
< access-control-allow-headers: accept, content-type, authorization
< location: /hub/
< content-length: 0
< connection: keep-alive
< 
* Connection #0 to host localhost left intact

Keep alive works.

Netcat Test

nc test is done manully, very naive.

CASE 1 : Timeout=15000, Request, Wait 10 seconds, Request again

ps result

/srv/configurable-http-proxy $ ps | grep node
    1 nobody    0:01 node /srv/configurable-http-proxy/bin/configurable-http-proxy --ip= --api-ip= --api-port=8001 --default-target=http://jupyterhub2-hub:8081 --error-target=http://jupyterhub2-hub:8081/hub/error --port=8000 --log-level=debug --metrics-port=8080 --keep-alive-timeout=15000
/srv/configurable-http-proxy $ nc localhost 8000
GET / HTTP/1.1

HTTP/1.1 302 Found
server: TornadoServer/6.2
content-type: text/html
date: Fri, 18 Aug 2023 02:24:23 GMT
access-control-allow-origin: *
access-control-allow-methods: GET, POST, PUT, DELETE, OPTIONS
content-security-policy: frame-ancestors self codle.io dev.codle.io
x-jupyterhub-version: 3.0.0
access-control-allow-headers: accept, content-type, authorization
location: /hub/
content-length: 0
connection: keep-alive

< Wait 10 Seconds >

GET / HTTP/1.1

HTTP/1.1 302 Found
server: TornadoServer/6.2
content-type: text/html
date: Fri, 18 Aug 2023 02:24:34 GMT
access-control-allow-origin: *
access-control-allow-methods: GET, POST, PUT, DELETE, OPTIONS
content-security-policy: frame-ancestors self codle.io dev.codle.io
x-jupyterhub-version: 3.0.0
access-control-allow-headers: accept, content-type, authorization
location: /hub/
content-length: 0
connection: keep-alive

It should keep alive after 10 seconds, it actually does.

CASE 2 : Timeout=15000, Request, Wait 20 seconds, Request again

ps result

/srv/configurable-http-proxy $ ps | grep node
    1 nobody    0:03 node /srv/configurable-http-proxy/bin/configurable-http-proxy --ip= --api-ip= --api-port=8001 --default-target=http://jupyterhub2-hub:8081 --error-target=http://jupyterhub2-hub:8081/hub/error --port=8000 --log-level=debug --metrics-port=8080 --keep-alive-timeout=15000
/srv/configurable-http-proxy $ nc localhost 8000
GET / HTTP/1.1

HTTP/1.1 302 Found
server: TornadoServer/6.2
content-type: text/html
date: Fri, 18 Aug 2023 02:25:55 GMT
access-control-allow-origin: *
access-control-allow-methods: GET, POST, PUT, DELETE, OPTIONS
content-security-policy: frame-ancestors self codle.io dev.codle.io
x-jupyterhub-version: 3.0.0
access-control-allow-headers: accept, content-type, authorization
location: /hub/
content-length: 0
connection: keep-alive

< Wait 20 Seconds >

GET / HTTP/1.1

< Connection Closed >

It should close connection after 20 seconds, it actually does.

Conclusion

I think #492 works with a parseInt fix.

I thought #492 would not respect the given timeout. However, standard library http.Agent seems like closing connections when the number of connections exceeds its limit. It does not close idle connections. So it works like infinite timeout when the number of active connection is below the limit.