varnish / hitch

A scalable TLS proxy by Varnish Software.
https://www.varnish-software.com/
Other
1.87k stars 157 forks source link

Memory not released #344

Open Zabrane opened 3 years ago

Zabrane commented 3 years ago

Hi guys,

I'm facing the same issue using Hitch 1.7.0 on Ubuntu 20.04 LTS. While stress testing (with vegeta) our backend app which sits behind Hitch, we noticed that Hitch's memory never gets released back to the system.

This is Hitch's memory usage before starting the benchmark (using ps_mem.py to track memory usage)

 Private  +   Shared  =  RAM used       Program
 5.2 MiB +   1.7 MiB =   6.9 MiB       hitch (10)

And this is Hitch's memory usage when the benchmark was done:

  Private  +   Shared  =  RAM used       Program
2.51 GiB +   192.1 MiB =   2.7 GiB       hitch (10)

The memory is still not released yet (24h later).

My config:

  1. Ubuntu 20.04 LTS
  2. Hitch 1.7.0
  3. OpenSSL 1.1.1f
  4. GCC 9.3.0
  5. Only one SSL certificate
daghf commented 3 years ago

Hi @Zabrane

Thanks for the report, I will take a look.

Could you share some details of the benchmark you ran? Is this a handshake oriented or a throughput oriented test? HTTP kee-alive? Number of clients/request rate?

Also, is there anything else special about your config? Could you perhaps share your hitch command line and hitch.conf?

Zabrane commented 3 years ago

Hi @daghf

Thanks for taking the time to look at this. Here are the steps to reproduce the issue:

  1. install Express to run the NodeJS backend sample server (file srv.js.zip)

    $ unzip -a srv.js.zip
    $ npm install express
    $ node srv.js
    ::: listening on http://localhost:7200/
  2. Use the latest Hitch 1.7.0 with the following hitch.conf (point pem-file to yours). We were able to reproduce this memory issue from version 1.5.0 to 1.7.0.

## Listening
frontend   = "[0.0.0.0]:8443"
## https://ssl-config.mozilla.org/
ciphers    = "ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384"

tls-protos = TLSv1.2

## TLS for HTTP/2 traffic
alpn-protos = "http/1.1"

## Send traffic to the backend without the PROXY protocol
backend        = "[127.0.0.1]:7200"
write-proxy-v1 = off
write-proxy-v2 = off
write-ip       = off

## List of PEM files, each with key, certificates and dhparams
pem-file = "hitch.pem"

## set it to number of cores
workers = 10
backlog = 1024
keepalive = 30

## Logging / Verbosity
quiet = on
log-filename = "/dev/null"

## Automatic OCSP staple retrieval
ocsp-verify-staple = off
ocsp-dir = ""

Then, run it:

$ hitch -V
hitch 1.7.0
$ hitch --config=./hitch.conf 
  1. Check if the pieces are successfully connected:

    $ curl -k -D- -q -sS "https://localhost:8443/" --output /dev/null
    HTTP/1.1 200 OK
    X-Powered-By: Express
    Content-Type: application/json; charset=utf-8
    Content-Length: 6604
    Date: Tue, 22 Dec 2020 12:01:33 GMT
    Connection: keep-alive
  2. Get vegeta binary for your distribution. No need to compile it, releases are available here.

Finally, run it like this:

$ echo "GET https://localhost:8443/" | vegeta attack -insecure -header 'Connection: keep-alive' -timeout=2s -rate=1000 -duration=1m | vegeta encode | vegeta report
Requests      [total, rate, throughput]         60000, 1000.02, 1000.02
Duration      [total, attack, wait]             59.999s, 59.999s, 219.979µs
Latencies     [min, mean, 50, 90, 95, 99, max]  165.935µs, 262.688µs, 230.6µs, 333.352µs, 375.975µs, 502.351µs, 16.373ms
Bytes In      [total, mean]                     396240000, 6604.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      200:60000
Error Set:

During the stress test with vegeta, check hitch memory usage (top, htop or ps_mem):

$ sudo su
root$ ps_mem.py -p `pgrep -d, hitch | sed -e 's|,$||'`
root$ watch -n 3 "ps_mem.py -p `pgrep -d, hitch | sed -e 's|,$||'`"

You can set vegeta's -duration option to a larger value (ex. 15m) to see the memory effect on Hitch.

Please let me know if you need anything else.

NOTE: on MacOS, top shows that hitch 1.7.0 uses only 02 workers despite the fact they are set to 10

Zabrane commented 3 years ago

Hi @daghf and Happy New Year.

Any update on this :-) ?

daghf commented 3 years ago

Hi @Zabrane

I haven't had any luck in reproducing this.

Even trying to set up something identical to your setup (Ubuntu 20.04, gcc9.3, openssl 1.1.1f), and running vegeta with your Express as backend - I still did not see memory usage creep much above 50M.

I did find a few inconsequential small memory leaks relating to a config file update, which I fixed in a commit just pushed. However, these are not the kind of memory leaks that would incur growing memory usage relating to traffic or running time.

Zabrane commented 3 years ago

@daghf thanks for your time looking at this issue.

We are still seeing this behaviour in 2 different products behind hitch. It's a bit sad you weren't able to reproduce it.

One last question before i close this issue if you don't mind: if the backend server decides to close the connection after servicing some requests, will hitch reopen it immeditaley? Or will it wait till a new client connection is established?

Thanks

robinbohnen commented 3 years ago

Have the same problem here; hitch is currently taking up to 24GB of ram until it was killed (Out of memory: Kill process # (hitch) score 111 or sacrifice child. Seems to only start happening after our latest update to 1.7.0. Not sure what version we were running before.

Zabrane commented 3 years ago

@robinbohnen thanks for confirming the issue. We still suffer from the memory problem and the current workaround is to manually kill/restart hitch (yes, a hack with a bad consequence of losing connections).

We consider switching to stunnel 5.58, haproxy 2.3 or envoy 1.17.

caveat: the stunnel link is an old blog post againt stud (hitch ancestor). But we were able to reproduce those numbers (even better ones) as of today:

Capture_d’écran_2021-02-25_à_08_20_37
gquintard commented 3 years ago

@Zabrane , since we are having trouble reproducing the issue, could you try either sharing some docker-compose or vagrant file so we can look at it locally? is there anything special about your certificates? (large numbers, lots of intermediate CA, complete options, etc.)

Zabrane commented 3 years ago

@gquintard we use 1 Certificate and 1 CA as explained above. Unfortunately, we don't rely on Docker for our services. It took us 6-weeks to able to report the issue here ( get approval from business - we work for a private bank).

@robinbohnen could you please shed more lights on your config?

robinbohnen commented 3 years ago

We have about 3500 LetsEncrypt certificates served by Hitch, we don't use Docker as well.

dridi commented 3 years ago

I think what @gquintard was asking is rather, can you reproduce this behavior in a docker or vagrant (or maybe other) setup that we could duplicate on our end to try to observe it as well?

vcabbage commented 11 months ago

FWIW, we observed something similar. In our case we had 300-500K concurrent connections, when the connection count dropped RSS continued to increase, until stablizing around 90GB.

After trying a variety of adjustments we ended up loading jemalloc via LD_PRELOAD. With that change RSS became much more correlated with the number of connections (26-44GB).

I don't have a firm explanation, but it does remind me a bit of this post where it's theorized that the excess memory usage of libc malloc involved fragmentation caused by multithreading. I'm not sure if that would apply in hitch.