apurandare-lifesize commented 6 years ago

Hi ,

LogDNA agent stopped bceause of the below error 180227 19:16:33] Streaming /var/log: 2 new file(s), 104 total file(s) [180227 19:18:52] Sent 10001 lines queued from earlier disconnection [180227 19:18:54] Sent 10001 lines queued from earlier disconnection [180227 19:18:55] Sent 7146 lines queued from earlier disconnection [180227 19:18:58] Sent 5000 lines queued from earlier disconnection [180227 19:18:59] Sent 10001 lines queued from earlier disconnection [180227 19:19:19] Sent 10001 lines queued from earlier disconnection [180227 19:19:30] Sent 5012 lines queued from earlier disconnection [180227 19:19:33] Sent 11922 lines queued from earlier disconnection FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory <--- Last few GCs ---> 1208485 ms: Scavenge 1398.9 (1446.7) -> 1398.9 (1446.7) MB, 10.8 / 0 ms (+ 0.8 ms in 1 steps since last GC) [allocation failure] [incremental marking delaying mark-sweep]. 1209002 ms: Mark-sweep 1398.9 (1446.7) -> 1393.5 (1441.6) MB, 517.0 / 0 ms (+ 72.6 ms in 1762 steps since start of marking, biggest step 2.0 ms) [last resort gc]. 1209510 ms: Mark-sweep 1393.5 (1441.6) -> 1392.9 (1447.6) MB, 507.6 / 0 ms [last resort gc].

we speculated that using this --max-old-space-size=1024 param in the and running Logdna agnet using js script as mentioned in link https://github.com/logdna/logdna-agent/blob/master/logdna-agent:

can you please provide a deb file with below changes in logdna-agent file

!/bin/sh

node --max-old-space-size=1024 index.js "$@"

Thanks,Amrut

leeliu commented 6 years ago

We were discussing this earlier internally. I'm not sure adding more heap would necessarily solve the problem, it'll just delay it a bit and then crash at 2.5gb instead of 1.5gb. Basically memory issues come from 2 areas, not flushing items out to our endpoint fast enough (due to bandwidth limitations between the server serving regular traffic vs sending logs to us) and use of compression.

1) Can I ask how many servers you are running the agent on and how many of them are crashing like this? Is it like 1 or 2 or 5 that behaves like this or all of them? Where (geographically) are they located? 2) For the ones crashing, would you say there's a lot of traffic served to clients? Basically I want to understand how much bandwidth is available for sending logs. 3) If bandwidth is NOT an issue and there's plenty available for additional upload, I would try to disable compression by adding COMPRESS=0 into /etc/logdna.conf and restarting the agent. That'll drastically reduce the amount of CPU and memory used to buffer lines. We're also improving things in this area by switching away from gzip compression.

v1.5.0 would also help if you're not already on it as now we buffer items onto disk under load.

apurandare-lifesize commented 6 years ago

I observed this Memory issue when i was doing Load test to check LogDNA CPU consumption during load on our server which has 2Gig bandwidth. Below is the free -m output which shows 14GB free memory in Cache and LogDNA stopped with process out of memory by looking into free mem (190Mb) root@ams1-css-34-stg:/usr/local/lifesize/clearsea/ClearSea/log# free -m total used free shared buffers cached Mem: 32140 31949 190 0 80 13896 -/+ buffers/cache: 17972 14167 Swap: 6143 134 6009

CPU usage during load top - 14:41:46 up 6 days, 23:15, 2 users, load average: 17.11, 16.14, 10.31 Tasks: 656 total, 2 running, 654 sleeping, 0 stopped, 0 zombie Cpu(s): 38.9%us, 4.6%sy, 0.0%ni, 55.0%id, 0.1%wa, 0.0%hi, 1.3%si, 0.0%st Mem: 32911700k total, 31902368k used, 1009332k free, 159392k buffers Swap: 6291452k total, 122524k used, 6168928k free, 14663428k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 3901 root RT 0 26.9g 5.1g 8164 S 40.8 16.3 295:00.74 psems 331 root 20 0 1396m 260m 13m R 1.8 0.8 2:28.67 logdna-agent 15893 root 20 0 26.3g 2.4g 18m S 0.9 7.6 307:19.85 java

ERROR during load run [180227 19:16:33] Streaming /var/log: 2 new file(s), 104 total file(s) [180227 19:18:52] Sent 10001 lines queued from earlier disconnection [180227 19:18:54] Sent 10001 lines queued from earlier disconnection [180227 19:18:55] Sent 7146 lines queued from earlier disconnection [180227 19:18:58] Sent 5000 lines queued from earlier disconnection [180227 19:18:59] Sent 10001 lines queued from earlier disconnection [180227 19:19:19] Sent 10001 lines queued from earlier disconnection [180227 19:19:30] Sent 5012 lines queued from earlier disconnection [180227 19:19:33] Sent 11922 lines queued from earlier disconnection FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - process out of memory

<--- Last few GCs --->

1208485 ms: Scavenge 1398.9 (1446.7) -> 1398.9 (1446.7) MB, 10.8 / 0 ms (+ 0.8 ms in 1 steps since last GC) [allocation failure] [incremental marking delaying mark-sweep]. 1209002 ms: Mark-sweep 1398.9 (1446.7) -> 1393.5 (1441.6) MB, 517.0 / 0 ms (+ 72.6 ms in 1762 steps since start of marking, biggest step 2.0 ms) [last resort gc]. 1209510 ms: Mark-sweep 1393.5 (1441.6) -> 1392.9 (1447.6) MB, 507.6 / 0 ms [last resort gc].

<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x38290e0d81b1 1: inflight [nexe.js:~26953] [pc=0x15b31544785f] (this=0x38290e0f4641 ,key=0x38290e0af9d9 <String[108]\: readdir\x00/usr/local/lifesize/clearsea/ClearSea/log/conf/Conf_WVfj_C_2018022323222829535670_STANDARD.html\x00true>,cb=0x38290e0afa01 <JS Function (SharedFunctionInfo 0x27c0e4be5061)>) 3: wrapper [nexe.js:~85350] [pc=0x15b3154365d8] (this=0x38290e0f4641 <JS Global O...

My Manager came up with the max old space param to use overcome memory issue http://www.codingdefined.com/2015/10/how-to-solve-process-out-of-memory.html

If you could provide me the deb file with max old space param then i can install in our server and try running load to check if problem resolves. please let me know if you need anything

leeliu commented 6 years ago

Yes I'm aware of what max_old_space_size does. What I'm saying is that it'll likely just delay the problem, not solve it. I'd rather fix the underlying issue.

Do you mind answering my questions 1), 2) and 3)? Have you tried some of the things I've suggested? Namely v1.5.0 and COMPRESS=0? Also where are your servers located? For bandwidth, I'm not referring to what is available to the server but how fast it can send to ours, which is why geography matters.

apurandare-lifesize commented 6 years ago

We are running agent on approximately 300 odd server. And non of them are crashed with this error.
This crash is observed while doing Load test on One server for LogDNA cpu/memory consumption. we don't have any bandwidth limit for sending logs and during peak load there was sufficient b/w like 40% free
we can try by adding compress=0 in logdna config file.

ISSUE: we are concerned about the why logDNA is not using cache memory (14GB cached)

leeliu commented 6 years ago

Oh...so this is only happening during a load test. It's not actually happening on any live servers. Am I understanding that correctly?

How are you performing the load test? What parameters are you using, etc? Please attach scripts or other tools you're using so we can try to reproduce it on our end.

Please give the COMPRESS=0 a try.

Regarding cached memory, that is memory used by existing running apps that the system deems can be freed when necessary. The 14GB used in cached that you see is used by other processes. If our agent needed additional memory, linux would automatically force the other apps to free their memory and our process would use the newly freed memory. We don't need to do anything to 'use' it...I'm not sure I understand the question.

apurandare-lifesize commented 6 years ago

Yes. it crashed only during load test run. There is no problem with live server and we are planning to install in our Main Processing Server which handles almost everything (like registration, point to point, VMR calls. etc) This server will be processing around 500 calls/second during peak time and we have 20 such servers clustered so logging will be huge.

Before we install logDNA agent on these servers we are testing LogDNA CPU/Memory usage when peak load running on server. During that time we observed this process out of memory issue.

leeliu commented 6 years ago

Got it. That helps.

How are you performing the load test? What parameters are you using, etc? Please attach scripts or other tools you're using so we can try to reproduce it on our end.

apurandare-lifesize commented 6 years ago

we have our proprietary tools for registration of client to our server and PSEMS tool for making P2p calls. Our server processes audio/Video hence more cpu and memory is used during peak.

we won't change anything in the etc folder. You just have to push load on any of your test VM till 50% or 60% cpu and it should use more memory. (Logs flow should be huge) During this time LogDNA has to process the huge logs and should push to ingestion endpoint with memory issue

leeliu commented 6 years ago

Anyway continuing the thread here from your other issues.

Here's what you need to do:

1) You don't need to run grunt build. That's only used for packaging. Same goes for fpm, etc. The reason we can't add the flag --max-old-space-size=1024 to logdna-agent is because logdna-agent is normally overridden during a compile process through grunt build, so adding that flag would just get overridden anyway. The reason logdna-agent exists as an entrypoint file is mainly for running from source or for IoT devices that don't have compiled versions.

2) If you need to get that flag --max-old-space-size=1024 working, you need to run the agent from source. So ensure node 5.9.0 is installed as the default on the server and you're good to go. I don't believe it'll solve the issue but if it's not a test script but an actual process on your end, since I can't replicate it, you can give this a try.

sgwilson2674 commented 3 years ago

Closing the issue due to inactivity. If additional questions remain, please open another issue and we’ll be sure to respond.

logdna / logdna-agent

Need deb file with below changes #46

!/bin/sh