rcbops / rpc-maas

Ansible playbooks for deploying Rackspace Monitoring-as-a-Service within Openstack Environments
Apache License 2.0
32 stars 68 forks source link

horizon_check.py error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc #110

Closed BjoernT closed 9 years ago

BjoernT commented 9 years ago

RPCv9 checks are failing in repo version 9.0.1

Fri Oct 24 21:33:23 2014 INF: (plugin=horizon_check.py, id=chlmWtuPvL, iid=idHjiWAC9S) -> agent.plugin (details=args="172.29.237.162",file="horizon_check.py",id="chlmWtuPvL",period=60) scheduled for 60s Fri Oct 24 21:34:27 2014 ERR: Connection: nil (50.57.61.13:443) -> 139741911230336:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:159:

Fri Oct 24 21:34:27 2014 ERR: Connection: nil (50.57.61.13:443) -> 139741911230336:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:159:

mattt416 commented 9 years ago

Hi there @BjoernT,

These errors appear to be related to the monitoring agent itself -- I will need to pass this information onto the MaaS team for assistance.

I will keep you posted w/ what I find.

Thanks!

--Matt

mattt416 commented 9 years ago

Hi there @BjoernT,

There is a new cloud monitoring agent available (1.1.0-5). Can you please upgrade the affected nodes and let us know if the problem persists?

Kind Regards, Matt

b3rn4rd0s commented 9 years ago

@BjoernT Can you please provide feedback per @mattt416 comments?

BjoernT commented 9 years ago

@b3rnard0 We have no way of reproducing it, I'll check this with our next RPC 9.0.2 build and rpc-maas 9.0.2

mattt416 commented 9 years ago

Hi @BjoernT, I'm going to go ahead and close this issue. If the problem persists, please let us know and we'll reach out to the MaaS team.

Thanks!

--Matt

claco commented 9 years ago

I have this issue on a fresh 10.1.2rc1 installation, and this is the version that gets dropped:

ii  rackspace-monitoring-agent          1.1.0-41                         amd64        Rackspace Cloud Monitoring Agent

apt-get update && apt-get upgrade rackspace-monitoring-agent
rackspace-monitoring-agent is already the newest version.

In my case, the agent just sits there, saying it should retry in xxxxms, and never does, even after minutes. The process logs nothing further, and does not die. The site shows the agent as not connected until I manually restart the service.

claco commented 9 years ago

I have a few versions of the previous debs we;re manage to dig out from installations. I'm going to give the -5 version a try to see if it still works, or if this is some underlying system change (openssl updates).

claco commented 9 years ago

Tried the "fixed in 1.1.0-5 version". Same problem. This leads me to believe it's either something external transient in nature (openssl), or something unexpected on certain pool members on the endpoints.

Mon Feb 16 19:44:46 2015 INF: (plugin=swift-recon.py, id=ch6YOZCfbZ, iid=idYnpN3g30) -> agent.plugin (details=args="quarantine",file="swift-recon.py",id="ch6YOZCfbZ",period=60) scheduled for 60s
Mon Feb 16 19:45:33 2015 ERR: Connection: 2001:4801:7902:1:0:a:4323:52:443 (2001:4801:7902:1:0:a:4323:52:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:33 2015 ERR: Connection: 2001:4801:7902:1:0:a:4323:52:443 (2001:4801:7902:1:0:a:4323:52:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:33 2015 ERR: Connection: 2001:4801:7902:1:0:a:4323:52:443 (2001:4801:7902:1:0:a:4323:52:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:33 2015 ERR: Connection: 2001:4801:7902:1:0:a:4323:52:443 (2001:4801:7902:1:0:a:4323:52:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:33 2015 INF: SRV:_monitoringagent._tcp.ord1.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 71685ms
Mon Feb 16 19:45:43 2015 ERR: Connection: 2a00:1a48:7902:1:0:a:432:388:443 (2a00:1a48:7902:1:0:a:432:388:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:43 2015 ERR: Connection: 2a00:1a48:7902:1:0:a:432:388:443 (2a00:1a48:7902:1:0:a:432:388:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:43 2015 ERR: Connection: 2a00:1a48:7902:1:0:a:432:388:443 (2a00:1a48:7902:1:0:a:432:388:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:43 2015 ERR: Connection: 2a00:1a48:7902:1:0:a:432:388:443 (2a00:1a48:7902:1:0:a:432:388:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:43 2015 ERR: Connection: 2a00:1a48:7902:1:0:a:432:388:443 (2a00:1a48:7902:1:0:a:432:388:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:43 2015 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 68384ms
BjoernT commented 9 years ago

Yes I had similar issues to pin it down. It appeared only during a POC install and disappeared once I removed all agents and reinstalled everything fresh with working OMSA checks (from the 9.0 OMSA hot fix branch) from rpc-maas

Bjoern

On Feb 16, 2015, at 1:49 PM, Christopher H. Laco notifications@github.com<mailto:notifications@github.com> wrote:

Tried the "fixed in 1.1.0-5 version". Same problem. This leads me to believe it's either something external transient in nature (openssl), or something unexpected on certain pool members on the endpoints.

Mon Feb 16 19:44:46 2015 INF: (plugin=swift-recon.py, id=ch6YOZCfbZ, iid=idYnpN3g30) -> agent.plugin (details=args="quarantine",file="swift-recon.py",id="ch6YOZCfbZ",period=60) scheduled for 60s Mon Feb 16 19:45:33 2015 ERR: Connection: 2001:4801:7902:1:0:a:4323:52:443 (2001:4801:7902:1:0:a:4323:52:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:33 2015 ERR: Connection: 2001:4801:7902:1:0:a:4323:52:443 (2001:4801:7902:1:0:a:4323:52:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:33 2015 ERR: Connection: 2001:4801:7902:1:0:a:4323:52:443 (2001:4801:7902:1:0:a:4323:52:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:33 2015 ERR: Connection: 2001:4801:7902:1:0:a:4323:52:443 (2001:4801:7902:1:0:a:4323:52:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:33 2015 INF: SRV:_monitoringagent._tcp.ord1.prod.monitoring.api.rackspacecloud.comhttp://prod.monitoring.api.rackspacecloud.com -> Retrying connection in 71685ms Mon Feb 16 19:45:43 2015 ERR: Connection: 2a00:1a48:7902:1:0:a:432:388:443 (2a00:1a48:7902:1:0:a:432:388:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:43 2015 ERR: Connection: 2a00:1a48:7902:1:0:a:432:388:443 (2a00:1a48:7902:1:0:a:432:388:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:43 2015 ERR: Connection: 2a00:1a48:7902:1:0:a:432:388:443 (2a00:1a48:7902:1:0:a:432:388:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:43 2015 ERR: Connection: 2a00:1a48:7902:1:0:a:432:388:443 (2a00:1a48:7902:1:0:a:432:388:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:43 2015 ERR: Connection: 2a00:1a48:7902:1:0:a:432:388:443 (2a00:1a48:7902:1:0:a:432:388:443) -> 140073835747200:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Mon Feb 16 19:45:43 2015 INF: SRV:_monitoringagent._tcp.lon3.prod.monitoring.api.rackspacecloud.comhttp://prod.monitoring.api.rackspacecloud.com -> Retrying connection in 68384ms

— Reply to this email directly or view it on GitHubhttps://github.com/rcbops/rpc-maas/issues/110#issuecomment-74562946.

wolfdancer commented 9 years ago

Hi there, this is to let you know that our agent engineer (Ryan) has confirmed the memory issue you are experiencing is the same issue that he is looking at based on the errors here. He is actively working on it.

claco commented 9 years ago

@wolfdancer Thank you!

rphillips commented 9 years ago

@claco could I get access to a machine that is showing this issue?

claco commented 9 years ago

@rphillips Sorry. I was out for a few days. I noticed a new agent version drop. Do you still need a stack to test against?

rphillips commented 9 years ago

Try out the new version please. if you still see the issue, then please let me know asap.

claco commented 9 years ago

@rphillips Just stood up a full stack with the new -53, and I still have the same issue:

Fri Mar  6 15:54:18 2015 ERR: Connection: 2001:4800:7902:1:0:a:4323:46:443 (2001:4800:7902:1:0:a:4323:46:443) -> 140404185462656:error:07069041:memory buffer routines:BUF_MEM_grow_clean:malloc failure:../base/deps/luvit/deps/openssl/openssl/crypto/buffer/buffer.c:169:

Fri Mar  6 15:54:18 2015 INF: SRV:_monitoringagent._tcp.dfw1.prod.monitoring.api.rackspacecloud.com -> Retrying connection in 75480ms

If someone can reach out internally, I can get them login creds.

rphillips commented 9 years ago

MaaS on internal irc... thanks

mancdaz commented 9 years ago

Later version of agent has better handling of mem-errors, and better logging of tracebacks.

Closing for now. Please re-open if we see hangs again.