Closed igorescobar closed 4 years ago
@igorescobar You probably need to modify the limits. Maybe this: https://docs.aws.amazon.com/AmazonECS/latest/APIReference/API_Ulimit.html
I had no idea about this limitation!
I wonder why Nuster
had 4k+ files open/locked. It's very unusual for the number of req/min. 🤔
This is what I had to do to fix the ulimit on my ECS Task Definition:
"ulimits": [
{
"name": "nofile",
"softLimit": 1024,
"hardLimit": 20508
}
],
Thanks!
I have increased the limit and I hit it again 😄
<128>May 21 15:40:46 nuster[7]: Proxy fe reached process FD limit (maxsock=20028). Please check 'ulimit-n' and restart.
Do you have any idea why does nuster keeps all those files locked for so long? The number of requests per minute is usually around 1-2k. This number seems pretty odd to me unless there is a reason for it.
The problem Is that when that happens I get this error:
Cannot get a server socket.
and this basically shuts everything down and the container needs to be replaced and the user gets a bunch of 5xx errors.
@igorescobar Are you able to do lsof -p ${nuster-pid}
when that happens? seems there's fd leak.
I can't connect to the production containers. I increased the ulimits again and the defaults for disk-loader and disk-saver and so far the problem did not happen. Do you think it might be related somehow?
There's no need to increase disk-loader to a very big number, as there's a max time limit for one iterate. And since you are not using disk sync
, so there's no need to set disk-saver
.
Since The number of requests per minute is usually around 1-2k
, and you got fd limit error, smells like a fd leak. I have to do more investigation.
Thanks @jiangwenyuan let me know if there is anything else I can support you with.
@igorescobar Fix fd leak when failed to move file. https://github.com/jiangwenyuan/nuster/commit/5a2dcaa559c9b437876ed2b8f56ef2017fc45be1
Can you try H2.1 branch and see if this error happens again?
Is it available on dockerhub?
On Sat, May 23, 2020, 09:57 Jiang Wenyuan notifications@github.com wrote:
@igorescobar https://github.com/igorescobar Fix fd leak when failed to move file. 5a2dcaa https://github.com/jiangwenyuan/nuster/commit/5a2dcaa559c9b437876ed2b8f56ef2017fc45be1
Can you try H2.1 branch and see if this error happens again?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jiangwenyuan/nuster/issues/80#issuecomment-633009968, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF3YGA6VX6WLV5HNED4GTRS6FXBANCNFSM4NGVXGYQ .
Not yet, as I have not released yet, can you build or try nuster/nuster:latest
?
I will need more time to test it again. I had to revert the infra to the old one because that was the production environment. I might need to re-organise my terraform scripts to accommodate this A/B environment a bit better.
On Sat, May 23, 2020, 10:11 Jiang Wenyuan notifications@github.com wrote:
Not yet, as I have not released yet, can you build or try nuster/nuster:latest?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jiangwenyuan/nuster/issues/80#issuecomment-633011833, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF3YHEDKEZLSQH77VANK3RS6HM7ANCNFSM4NGVXGYQ .
Test in progress. I will keep an eye on it for about 2-3 hours and let's see.
1) Regarding the FD leak, since I had to recreate everything again the disk was empty so the auto scaling is pretty sensitive right now. As soon it starts to have a decent cache the CPU will start to lower and the number of containers will start to be more steady. I will have a better result probably tomorrow.
2) All of the sudden this happened:
fields @timestamp, @message
| filter @message not like /"*":/
| sort by @timestamp desc
@timestamp | @message |
---|---|
2020-05-23 16:39:01.574 | [NOTICE] 143/163901 (1) : haproxy version is 2.2-dev8 |
2020-05-23 16:39:01.574 | [NOTICE] 143/163901 (1) : path to executable is /usr/local/sbin/nuster |
2020-05-23 16:39:01.574 | [ALERT] 143/163901 (1) : Current worker #1 (8) exited with code 143 (Terminated) |
2020-05-23 16:39:01.574 | [WARNING] 143/163901 (1) : All workers exited. Exiting... (0) |
2020-05-23 16:39:01.546 | [WARNING] 143/163901 (1) : Exiting Master process... |
2020-05-23 16:34:52.128 | [NOTICE] 143/163452 (1) : haproxy version is 2.2-dev8 |
2020-05-23 16:34:52.128 | [NOTICE] 143/163452 (1) : path to executable is /usr/local/sbin/nuster |
2020-05-23 16:34:52.128 | [ALERT] 143/163452 (1) : Current worker #1 (9) exited with code 134 (Aborted) |
2020-05-23 16:34:52.128 | [ALERT] 143/163452 (1) : exit-on-failure: killing every processes with SIGTERM |
2020-05-23 16:34:52.128 | [WARNING] 143/163452 (1) : All workers exited. Exiting... (134) |
2020-05-23 16:34:41.759 | [NOTICE] 143/163441 (9) : haproxy version is 2.2-dev8 |
2020-05-23 16:34:41.759 | <128>May 23 16:34:41 nuster[9]: A bogus STREAM [0x55cbcfd90d30] is spinning at 111086 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x55cbcfd90d30 src=172.31.26.44 fe=fe be=be dst=s1 rqf=69c42000 rqa=48000 rpf=e0000020 rpa=24000000 sif=EST,10008 sib=CLO,280110 af=(nil),0 csf=0x7f4438045650,10c000 ab=(nil),0 csb=0x7f443c0715c0,0 cof=0x7f443c0bc0f0,80001300:H1(0x7f4440077490)/RAW((nil))/tcpv4(60) cob=0x55cbcfd926b0,c2300:H1(0x7f4438036d80)/RAW((nil))/tcpv4(103) filters={0x7f4440023970="cache filter id"}] |
2020-05-23 16:34:41.759 | [NOTICE] 143/163441 (9) : path to executable is /usr/local/sbin/nuster |
2020-05-23 16:34:41.759 | [ALERT] 143/163441 (9) : A bogus STREAM [0x55cbcfd90d30] is spinning at 111086 calls per second and refuses to die, aborting now! Please report this error to developers [strm=0x55cbcfd90d30 src=172.31.26.44 fe=fe be=be dst=s1 rqf=69c42000 rqa=48000 rpf=e0000020 rpa=24000000 sif=EST,10008 sib=CLO,280110 af=(nil),0 csf=0x7f4438045650,10c000 ab=(nil),0 csb=0x7f443c0715c0,0 cof=0x7f443c0bc0f0,80001300:H1(0x7f4440077490)/RAW((nil))/tcpv4(60) cob=0x55cbcfd926b0,c2300:H1(0x7f4438036d80)/RAW((nil))/tcpv4(103) filters={0x7f4440023970="cache filter id"}] |
2020-05-23 16:31:06.480 | [NOTICE] 143/163105 (1) : [nuster][cache] on, dict_size=524288000, data_size=524288000 |
2020-05-23 16:31:06.480 | <133>May 23 16:31:05 nuster[1]: Proxy fe started. |
2020-05-23 16:31:06.480 | <133>May 23 16:31:05 nuster[1]: Proxy be started. |
2020-05-23 16:31:06.480 | [NOTICE] 143/163105 (1) : New worker #1 (8) forked |
It resulted in a bunch of 5XX errors. This error is new to me, I had a different error on the stable
image.
Another issue reports this error too, latest
image is based on haproxy2.2-dev. Let me release a new H2.1 version.
Will do!
The issue seems to be solved! Pretty steady since yesterday and the cache is performing pretty good!
So this also solved the other issue xD
Thanks a lot for your help!
On Sun, May 24, 2020, 03:30 Jiang Wenyuan notifications@github.com wrote:
Another issue reports this error too, latest image is based on haproxy2.2-dev. Let me release a new H2.1 version.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/jiangwenyuan/nuster/issues/80#issuecomment-633167662, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF3YGWBOPHZ63MXQGUKBTRTCBCRANCNFSM4NGVXGYQ .
FYI, This is how the Disk & the cache behaved since the experiment.
Current state of the cache (1h period): Overall Disk I/O Limit (last day):
Read/Writes (Last day):
You can clearly see the writes going down as it gets filled up. The big drop between 16-17 was when the bug happened and as you can see it didn't happen anymore.
Any idea when those fixes the H2.1 & the FD Leak fix will land the Stable tag on docker hub?
@igorescobar I'm investigating another problem and will release once solved
Released v5.0.10.21. Closing now, feel free to reopen if the issue persists
Hi,
When the load increases I'm seeing this errors on my logs which results in 5xx HTTP errors on our load balancers:
Is there any documentation about how I can work around this issue? I tried to set things like:
Locally it works fine but when I deploy it to ECS I see these errors:
Any idea how to configure this properly?