jiangwenyuan / nuster

A high performance HTTP proxy cache server and RESTful NoSQL cache server based on HAProxy
Other
1.91k stars 157 forks source link

Proxy fe reached process FD limit #80

Closed igorescobar closed 4 years ago

igorescobar commented 4 years ago

Hi,

When the load increases I'm seeing this errors on my logs which results in 5xx HTTP errors on our load balancers:

2020-05-21T09:05:35.678+01:00
<128>May 21 08:05:35 nuster[6]: Proxy be reached process FD limit (maxsock=4096). Please check 'ulimit-n' and restart.
82
2020-05-21T09:05:35.678+01:00
<128>May 21 08:05:35 nuster[6]: Proxy be reached process FD limit (maxsock=4096). Please check 'ulimit-n' and restart.
83
2020-05-21T09:05:35.678+01:00
<128>May 21 08:05:35 nuster[6]: Proxy fe reached process FD limit (maxsock=4096). Please check 'ulimit-n' and restart.
84
2020-05-21T09:05:35.678+01:00
<128>May 21 08:05:35 nuster[6]: Proxy fe reached process FD limit (maxsock=4096). Please check 'ulimit-n' and restart.
85
2020-05-21T09:05:35.678+01:00
Cannot get a server socket.
86
2020-05-21T09:05:35.678+01:00
Cannot get a server socket.

Is there any documentation about how I can work around this issue? I tried to set things like:

global
    ulimit-n 20514
        maxconn 10240

Locally it works fine but when I deploy it to ECS I see these errors:

[WARNING] 141/083807 (1) : [nuster.main()] Cannot raise FD limit to 20514, limit is 4096. This will fail in >= v2.3
[ALERT] 141/084544 (1) : [nuster.main()] FD limit (4096) too low for maxconn=10240/maxsock=20508. Please raise 'ulimit-n' to 20508 or more to avoid any trouble.This will fail in >= v2.3

Any idea how to configure this properly?

jiangwenyuan commented 4 years ago

@igorescobar You probably need to modify the limits. Maybe this: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Ulimit.html

igorescobar commented 4 years ago

I had no idea about this limitation!

I wonder why Nuster had 4k+ files open/locked. It's very unusual for the number of req/min. 🤔

This is what I had to do to fix the ulimit on my ECS Task Definition:

"ulimits": [
      {
        "name": "nofile",
        "softLimit": 1024,
        "hardLimit": 20508
      }
    ],

Thanks!

igorescobar commented 4 years ago

I have increased the limit and I hit it again 😄

<128>May 21 15:40:46 nuster[7]: Proxy fe reached process FD limit (maxsock=20028). Please check 'ulimit-n' and restart.

Do you have any idea why does nuster keeps all those files locked for so long? The number of requests per minute is usually around 1-2k. This number seems pretty odd to me unless there is a reason for it.

The problem Is that when that happens I get this error: Cannot get a server socket. and this basically shuts everything down and the container needs to be replaced and the user gets a bunch of 5xx errors.

jiangwenyuan commented 4 years ago

@igorescobar Are you able to do lsof -p ${nuster-pid} when that happens? seems there's fd leak.

igorescobar commented 4 years ago

I can't connect to the production containers. I increased the ulimits again and the defaults for disk-loader and disk-saver and so far the problem did not happen. Do you think it might be related somehow?

jiangwenyuan commented 4 years ago

There's no need to increase disk-loader to a very big number, as there's a max time limit for one iterate. And since you are not using disk sync, so there's no need to set disk-saver.

Since The number of requests per minute is usually around 1-2k, and you got fd limit error, smells like a fd leak. I have to do more investigation.

igorescobar commented 4 years ago

Thanks @jiangwenyuan let me know if there is anything else I can support you with.

jiangwenyuan commented 4 years ago

@igorescobar Fix fd leak when failed to move file. https://github.com/jiangwenyuan/nuster/commit/5a2dcaa559c9b437876ed2b8f56ef2017fc45be1

Can you try H2.1 branch and see if this error happens again?

igorescobar commented 4 years ago

Is it available on dockerhub?

On Sat, May 23, 2020, 09:57 Jiang Wenyuan notifications@github.com wrote:

@igorescobar https://github.com/igorescobar Fix fd leak when failed to move file. 5a2dcaa https://github.com/jiangwenyuan/nuster/commit/5a2dcaa559c9b437876ed2b8f56ef2017fc45be1

Can you try H2.1 branch and see if this error happens again?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jiangwenyuan/nuster/issues/80#issuecomment-633009968, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF3YGA6VX6WLV5HNED4GTRS6FXBANCNFSM4NGVXGYQ .

jiangwenyuan commented 4 years ago

Not yet, as I have not released yet, can you build or try nuster/nuster:latest?

igorescobar commented 4 years ago

I will need more time to test it again. I had to revert the infra to the old one because that was the production environment. I might need to re-organise my terraform scripts to accommodate this A/B environment a bit better.

On Sat, May 23, 2020, 10:11 Jiang Wenyuan notifications@github.com wrote:

Not yet, as I have not released yet, can you build or try nuster/nuster:latest?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jiangwenyuan/nuster/issues/80#issuecomment-633011833, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF3YHEDKEZLSQH77VANK3RS6HM7ANCNFSM4NGVXGYQ .

igorescobar commented 4 years ago

Test in progress. I will keep an eye on it for about 2-3 hours and let's see.

igorescobar commented 4 years ago

1) Regarding the FD leak, since I had to recreate everything again the disk was empty so the auto scaling is pretty sensitive right now. As soon it starts to have a decent cache the CPU will start to lower and the number of containers will start to be more steady. I will have a better result probably tomorrow.

2) All of the sudden this happened:

fields @timestamp, @message
| filter @message not like /"*":/
| sort by @timestamp desc

@timestamp @message
2020-05-23 16:39:01.574 [NOTICE] 143/163901 (1) : haproxy version is 2.2-dev8
2020-05-23 16:39:01.574 [NOTICE] 143/163901 (1) : path to executable is /usr/local/sbin/nuster
2020-05-23 16:39:01.574 [ALERT] 143/163901 (1) : Current worker #1 (8) exited with code 143 (Terminated)
2020-05-23 16:39:01.574 [WARNING] 143/163901 (1) : All workers exited. Exiting... (0)
2020-05-23 16:39:01.546 [WARNING] 143/163901 (1) : Exiting Master process...
2020-05-23 16:34:52.128 [NOTICE] 143/163452 (1) : haproxy version is 2.2-dev8
2020-05-23 16:34:52.128 [NOTICE] 143/163452 (1) : path to executable is /usr/local/sbin/nuster
2020-05-23 16:34:52.128 [ALERT] 143/163452 (1) : Current worker #1 (9) exited with code 134 (Aborted)
2020-05-23 16:34:52.128 [ALERT] 143/163452 (1) : exit-on-failure: killing every processes with SIGTERM
2020-05-23 16:34:52.128 [WARNING] 143/163452 (1) : All workers exited. Exiting... (134)
2020-05-23 16:34:41.759 [NOTICE] 143/163441 (9) : haproxy version is 2.2-dev8
2020-05-23 16:34:41.759 <128>May 23 16:34:41 nuster[9]: A bogus STREAM [0x55cbcfd90d30] is spinning at 111086 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x55cbcfd90d30 src=172.31.26.44 fe=fe be=be dst=s1 rqf=69c42000 rqa=48000 rpf=e0000020 rpa=24000000 sif=EST,10008 sib=CLO,280110 af=(nil),0 csf=0x7f4438045650,10c000 ab=(nil),0 csb=0x7f443c0715c0,0 cof=0x7f443c0bc0f0,80001300:H1(0x7f4440077490)/RAW((nil))/tcpv4(60) cob=0x55cbcfd926b0,c2300:H1(0x7f4438036d80)/RAW((nil))/tcpv4(103) filters={0x7f4440023970="cache filter id"}]
2020-05-23 16:34:41.759 [NOTICE] 143/163441 (9) : path to executable is /usr/local/sbin/nuster
2020-05-23 16:34:41.759 [ALERT] 143/163441 (9) : A bogus STREAM [0x55cbcfd90d30] is spinning at 111086 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x55cbcfd90d30 src=172.31.26.44 fe=fe be=be dst=s1 rqf=69c42000 rqa=48000 rpf=e0000020 rpa=24000000 sif=EST,10008 sib=CLO,280110 af=(nil),0 csf=0x7f4438045650,10c000 ab=(nil),0 csb=0x7f443c0715c0,0 cof=0x7f443c0bc0f0,80001300:H1(0x7f4440077490)/RAW((nil))/tcpv4(60) cob=0x55cbcfd926b0,c2300:H1(0x7f4438036d80)/RAW((nil))/tcpv4(103) filters={0x7f4440023970="cache filter id"}]
2020-05-23 16:31:06.480 [NOTICE] 143/163105 (1) : [nuster][cache] on, dict_size=524288000, data_size=524288000
2020-05-23 16:31:06.480 <133>May 23 16:31:05 nuster[1]: Proxy fe started.
2020-05-23 16:31:06.480 <133>May 23 16:31:05 nuster[1]: Proxy be started.
2020-05-23 16:31:06.480 [NOTICE] 143/163105 (1) : New worker #1 (8) forked

It resulted in a bunch of 5XX errors. This error is new to me, I had a different error on the stable image.

jiangwenyuan commented 4 years ago

Another issue reports this error too, latest image is based on haproxy2.2-dev. Let me release a new H2.1 version.

igorescobar commented 4 years ago

Will do!

The issue seems to be solved! Pretty steady since yesterday and the cache is performing pretty good!

So this also solved the other issue xD

Thanks a lot for your help!

On Sun, May 24, 2020, 03:30 Jiang Wenyuan notifications@github.com wrote:

Another issue reports this error too, latest image is based on haproxy2.2-dev. Let me release a new H2.1 version.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jiangwenyuan/nuster/issues/80#issuecomment-633167662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF3YGWBOPHZ63MXQGUKBTRTCBCRANCNFSM4NGVXGYQ .

igorescobar commented 4 years ago

FYI, This is how the Disk & the cache behaved since the experiment.

Current state of the cache (1h period): Screenshot 2020-05-24 at 09 45 57 Overall Disk I/O Limit (last day): Screenshot 2020-05-24 at 09 45 32

Read/Writes (Last day): Screenshot 2020-05-24 at 09 45 45

You can clearly see the writes going down as it gets filled up. The big drop between 16-17 was when the bug happened and as you can see it didn't happen anymore.

igorescobar commented 4 years ago

Any idea when those fixes the H2.1 & the FD Leak fix will land the Stable tag on docker hub?

jiangwenyuan commented 4 years ago

@igorescobar I'm investigating another problem and will release once solved

jiangwenyuan commented 4 years ago

Released v5.0.10.21. Closing now, feel free to reopen if the issue persists