rochaporto / collectd-ceph

collectd plugins and dashboards for ceph
GNU General Public License v2.0
60 stars 65 forks source link

CentOS 7 errors on collectd start using ceph_pool_plugin #28

Open steve--d opened 9 years ago

steve--d commented 9 years ago

After starting collectd running on CentOS 7, (ceph giant and now upgraded to hammer) I'm getting the following log errors using the ceph_pool_plugin.

-- Unit collectd.service has begun starting up.
Apr 15 15:04:18 ceph1.domain systemd[1]: Started Collectd statistics daemon.
-- Subject: Unit collectd.service has finished start-up
-- Defined-By: systemd
-- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel
-- 
-- Unit collectd.service has finished starting up.
-- 
-- The start-up result is done.
Apr 15 15:04:18 ceph1.domain collectd[22862]: Initialization complete, entering read-loop.
Apr 15 15:04:18 ceph1.domain python[22874]: detected unhandled Python exception in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: Package 'ceph-common' isn't signed with proper key
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: 'post-create' on '/var/tmp/abrt/Python-2015-04-15-15:04:18-22874' exited with 1
Apr 15 15:04:18 ceph1.domain abrt-server[22881]: Deleting problem directory '/var/tmp/abrt/Python-2015-04-15-15:04:18-22874'
Apr 15 15:04:18 ceph1.domain collectd[22862]: Traceback (most recent call last):
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 896, in <module>
Apr 15 15:04:18 ceph1.domain collectd[22862]: retval = main()
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 647, in main
Apr 15 15:04:18 ceph1.domain collectd[22862]: conffile=conffile)
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib/python2.7/site-packages/rados.py", line 212, in __init__
Apr 15 15:04:18 ceph1.domain collectd[22862]: library_path  = find_library('rados')
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library
Apr 15 15:04:18 ceph1.domain collectd[22862]: return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 237, in _findSoname_ldconfig
Apr 15 15:04:18 ceph1.domain collectd[22862]: f.close()
Apr 15 15:04:18 ceph1.domain collectd[22862]: IOError: [Errno 10] No child processes
Apr 15 15:04:18 ceph1.domain python[22884]: detected unhandled Python exception in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain abrt-server[22891]: Not saving repeating crash in '/usr/bin/ceph'
Apr 15 15:04:18 ceph1.domain collectd[22862]: Traceback (most recent call last):
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 896, in <module>
Apr 15 15:04:18 ceph1.domain collectd[22862]: retval = main()
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/bin/ceph", line 647, in main
Apr 15 15:04:18 ceph1.domain collectd[22862]: conffile=conffile)
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib/python2.7/site-packages/rados.py", line 212, in __init__
Apr 15 15:04:18 ceph1.domain collectd[22862]: library_path  = find_library('rados')
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library
Apr 15 15:04:18 ceph1.domain collectd[22862]: return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
Apr 15 15:04:18 ceph1.domain collectd[22862]: File "/usr/lib64/python2.7/ctypes/util.py", line 237, in _findSoname_ldconfig
Apr 15 15:04:18 ceph1.domain collectd[22862]: f.close()
Apr 15 15:04:18 ceph1.domain collectd[22862]: IOError: [Errno 10] No child processes
Apr 15 15:04:18 ceph1.domain collectd[22862]: ceph: failed to get stats :: No JSON object could be decoded :: Traceback (most recent call last):
                                                      File "/usr/lib64/collectd/base.py", line 114, in read_callback
                                                        stats = self.get_stats()
                                                      File "/usr/lib64/collectd/ceph_pool_plugin.py", line 67, in get_stats
                                                        json_stats_data = json.loads(stats_output)
                                                      File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
                                                        return _default_decoder.decode(s)
                                                      File "/usr/lib64/python2.7/json/decoder.py", line 365, in decode
                                                        obj, end = self.raw_decode(s, idx=_w(s, 0).end())
                                                      File "/usr/lib64/python2.7/json/decoder.py", line 383, in raw_decode
                                                        raise ValueError("No JSON object could be decoded")
                                                    ValueError: No JSON object could be decoded
Apr 15 15:04:18 ceph1.domain collectd[22862]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
Apr 15 15:04:18 ceph1.domain collectd[22862]: read-function of plugin `python.ceph_pool_plugin' failed. Will suspend it for 120.000 seconds.

collectd.conf:

<LoadPlugin python>
  Globals true
</LoadPlugin>

<Plugin "python">
    ModulePath "/usr/lib64/collectd"

    Import "ceph_pool_plugin"

    <Module "ceph_pool_plugin">
        Verbose "True"
        Cluster "ceph"
        Interval "60"
        TestPool "rbd"
    </Module>
</Plugin>
brynmathias commented 9 years ago

I also see this error on RHEL 7.1

[root@tapir2 python]# service collectd status Redirecting to /bin/systemctl status collectd.service collectd.service - Collectd statistics daemon Loaded: loaded (/usr/lib/systemd/system/collectd.service; enabled) Active: active (running) since Fri 2015-05-08 13:08:01 BST; 2min 42s ago Docs: man:collectd(1) man:collectd.conf(5) Main PID: 18995 (collectd) CGroup: /system.slice/collectd.service └─18995 /usr/sbin/collectd -C /etc/collectd.conf -f

May 08 13:08:01 tapir2.eng.velocix.com systemd[1]: Started Collectd statistics daemon. May 08 13:08:01 tapir2.eng.velocix.com collectd[18995]: Initialization complete, entering read-loop. May 08 13:08:01 tapir2.eng.velocix.com collectd[18995]: Unhandled python exception in read callback: TypeError: Dataset mutex-JOS::ApplyManager::apply_lock not found May 08 13:08:01 tapir2.eng.velocix.com collectd[18995]: read-function of plugin python.ceph' failed. Will suspend it for 20.000 seconds. May 08 13:08:21 tapir2.eng.velocix.com collectd[18995]: Unhandled python exception in read callback: TypeError: Dataset mutex-JOS::ApplyManager::apply_lock not found May 08 13:08:21 tapir2.eng.velocix.com collectd[18995]: read-function of pluginpython.ceph' failed. Will suspend it for 40.000 seconds. May 08 13:09:01 tapir2.eng.velocix.com collectd[18995]: Unhandled python exception in read callback: TypeError: Dataset mutex-JOS::ApplyManager::apply_lock not found

my collectd.conf

Globals true ModulePath "/usr/lib64/collectd/python" Import "ceph" ``` AdminSocket "/var/run/ceph/ceph-*.asok" ```

TypesDB "/usr/share/collectd/types.db" "/usr/lib64/collectd/python/ceph.types.db"

solune commented 9 years ago

I've the same problem, have you find a workaround ?

ozhanka commented 9 years ago

Hi i have also same problem for Rhel 7.1 and Ceph Hammer release, does anyone has fix/workaround for this problem?

rochaporto commented 9 years ago

I should be able to have a look next week.

ksingh7 commented 9 years ago

I am facing exactly the same issue [error] Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment

Collectd Logs

[2015-07-20 11:30:29] [info] ceph: collectd new data from service :: took 0 seconds
[2015-07-20 11:30:30] [error] ceph: failed to get stats :: Expecting object: line 2 column 124 (char 124) :: Traceback (most recent call last):
  File "/etc/collectd/plugins/ceph/base.py", line 114, in read_callback
    stats = self.get_stats()
  File "/etc/collectd/plugins/ceph/ceph_pool_plugin.py", line 72, in get_stats
    json_stats_data = json.loads(stats_output)
  File "/usr/lib64/python2.6/json/__init__.py", line 307, in loads
    return _default_decoder.decode(s)
  File "/usr/lib64/python2.6/json/decoder.py", line 319, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib64/python2.6/json/decoder.py", line 336, in raw_decode
    obj, end = self._scanner.iterscan(s, **kw).next()
  File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan
    rval, next_pos = action(m, context)
  File "/usr/lib64/python2.6/json/decoder.py", line 217, in JSONArray
    value, end = iterscan(s, idx=end, context=context).next()
  File "/usr/lib64/python2.6/json/scanner.py", line 55, in iterscan
    rval, next_pos = ac
[2015-07-20 11:30:30] [error] Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment
[2015-07-20 11:30:30] [notice] read-function of plugin `python.ceph_pool_plugin' failed. Will suspend it for 240.000 seconds.
[2015-07-20 11:30:41] [info] ceph: collectd new data from service :: took 13 seconds

Did anyone managed to fix this.

@rochaporto Do you have time to check this , appreciate your help.

gcmalloc commented 9 years ago

I'm having the same issue here. Seems like the origin is there:

Traceback (most recent call last):
  File "/usr/bin/ceph", line 896, in <module>
    retval = main()
  File "/usr/bin/ceph", line 647, in main
    conffile=conffile)
  File "/usr/lib/python2.7/site-packages/rados.py", line 212, in __init__
    library_path  = find_library('rados')
  File "/usr/lib64/python2.7/ctypes/util.py", line 244, in find_library
    return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
  File "/usr/lib64/python2.7/ctypes/util.py", line 237, in _findSoname_ldconfig
    f.close()
IOError: [Errno 10] No child processes
roadracer commented 9 years ago

Any news? I'm having the same issue for Ubuntu 14.04 and Ceph Hammer release:

Aug 21 00:07:54 collectd collectd[17115]: ceph: failed to get stats :: No JSON object could be decoded :: Traceback (most recent call last):#012 File "/usr/lib/collectd/plugins/ceph/base.py", line 108, in read_callback#012 stats = self.get_stats()#012 File "/usr/lib/collectd/plugins/ceph/ceph_pool_plugin.py", line 67, in get_stats#012 json_stats_data = json.loads(stats_output)#012 File "/usr/lib/python2.7/json/init.py", line 338, in loads#012 return _default_decoder.decode(s)#012 File "/usr/lib/python2.7/json/decoder.py", line 366, in decode#012 obj, end = self.raw_decode(s, idx=_w(s, 0).end())#012 File "/usr/lib/python2.7/json/decoder.py", line 384, in raw_decode#012 raise ValueError("No JSON object could be decoded")#012ValueError: No JSON object could be decoded Aug 21 00:07:54 collectd collectd[17115]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment Aug 21 00:07:54 collectd collectd[17115]: read-function of plugin `python.ceph_pool_plugin' failed. Will suspend it for 20.000 seconds.

solune commented 8 years ago

One of you has succeed to make it works ? Another ceph ceph -- collectd plugin ?

yashumitsu commented 8 years ago

Hello!

This note described in a man page:

You may put getsigchld.py in scripts folder and insert the line to a configuration:

<Plugin "python"> 
  ModulePath [..]
  Import "getsigchld"
solune commented 8 years ago

it works better yashumitsu !

but now there is a new error: Nov 30 20:33:05 cephrr1n4 collectd[19331]: ceph: failed to get stats :: list index out of range :: Traceback (most recent call last): File "/opt/collectd-ceph/git/collectd-ceph/plugins/base.py", line 114, in read_callback stats = self.get_stats() File "/opt/collectd-ceph/git/collectd-ceph/plugins/ceph_latency_plugin.py", line 67, in get_stats data[ceph_cluster]['cluster']['stddev_latency'] = results[1] IndexError: list index out of range Nov 30 20:33:05 cephrr1n4 collectd[19331]: Unhandled python exception in read callback: UnboundLocalError: local variable 'stats' referenced before assignment Nov 30 20:33:05 cephrr1n4 collectd[19331]: read-function of plugin `python.ceph_latency_plugin' failed. Will suspend it for 120.000 seconds.

yashumitsu commented 8 years ago

No thanks necessary!

The easiest way to get it works is to change default pool name (data) to another pool, which is exists:

solune commented 8 years ago

It works! Thanks

mourgaya commented 8 years ago

with strace we can see that getsigchld.py
so try to copy getsigchld.py cp collectd-5.5.0/contrib/python/getsigchld.py /usr/lib64/python2.7/site-packages/

benh57 commented 8 years ago

Thanks for posting this fix.