threefoldtecharchive / 0-robot

Distributed live cycle management system
Apache License 2.0
0 stars 0 forks source link

Node fails to register on capacity portal , zrobot logs shows error #69

Closed siddiquagig closed 5 years ago

siddiquagig commented 5 years ago

One of the nodes , failed to register on capacity portal automatically, but when we try registering manually using ncl.capacity.register() it shows up as online on the portal but goes back offline on the portal after 10-15min.

From the zrobot container logs we found this error:

In [182]: job = zrobot.client.subscribe("zrobot")

In [183]: job.stream()
[Mon06 08:36] - .startup.py       :110 :j.zrobot_statup      - INFO     - detect if zdb data repository is configured
[Mon06 08:36] - .startup.py       :114 :j.zrobot_statup      - INFO     - no zdb data repository configuration found
[Mon06 08:36] - .startup.py       :153 :j.zrobot_statup      - INFO     - starting node robot: zrobot server start --mode node --admin-organization mazraa --god --template-repo https://github.com/threefoldtech/0-templates#master
/opt/code/github/threefoldtech/jumpscale_core/Jumpscale/data/serializers/SerializerYAML.py:48: YAMLLoadWarning: calling yaml.load() without Loader=... is deprecated, as the default Loader is unsafe. Please read https://msg.pyyaml.org/load for full details.
  return yaml.load(s)
[Mon06 08:36] - robot.py          :112 :j.zerorobot          - INFO     - data directory: /opt/var/data/zrobot/zrobot_data
[Mon06 08:36] - robot.py          :113 :j.zerorobot          - INFO     - config directory: /opt/code/local/stdorg/config
[Mon06 08:36] - robot.py          :117 :j.zerorobot          - INFO     - sshkey used: /root/.ssh/hcqaqdnx
[Mon06 08:36] - robot.py          :292 :j.zerorobot          - INFO     - admin JWT authentication enabled for organization: mazraa
Traceback (most recent call last):
  File "usr/local/bin/zrobot", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/opt/code/github/threefoldtech/0-robot/cmd/zrobot", line 26, in <module>
    entry_point()
  File "/opt/code/github/threefoldtech/0-robot/zerorobot/cli/server.py", line 127, in start
    god=god)
  File "/opt/code/github/threefoldtech/0-robot/zerorobot/robot/robot.py", line 139, in start
    loader.load_services(config)
  File "/opt/code/github/threefoldtech/0-robot/zerorobot/robot/loader.py", line 16, in load_services
    tmpl_uid = TemplateUID.parse(service_details['service']['template'])
TypeError: 'NoneType' object is not subscriptable
Out[183]: 9

Node is on: ZeroOS master b1a1a737352fce69fd71de5f8cf1ae175f4bdcab Zerotier IP : 10.102.79.248

zrobot

zaibon commented 5 years ago
   tmpl_uid = TemplateUID.parse(service_details['service']['template'])
TypeError: 'NoneType' object is not subscriptable

From these lines I would say that the problem is coming from the fact that a service has its file on disk that is empty.

Solution would be to go on the node and check that all the service data files are ok.

abdulgig commented 5 years ago

With ncl.capacity.update_reality(), problem with mounts ?

---> 23         storage = _parse_storage(disks, storage_pools)
     24         self._ressources['sru'] = storage['sru']
     25         self._ressources['hru'] = storage['hru']

/opt/code/github/threefoldtech/jumpscale_lib/JumpscaleLib/tools/capacity/reality_parser.py in _parse_storage(disks, storage_pools)
     77
     78         disk_type = sp.type
---> 79         size = sp.fsinfo['data']['used']
     80
     81         if disk_type in [StorageType.HDD, StorageType.ARCHIVE]:

/opt/code/github/threefoldtech/jumpscale_lib/JumpscaleLib/sal_zos/storage/StoragePool.py in fsinfo(self)
    153     def fsinfo(self):
    154         if self.mountpoint is None:
--> 155             raise ValueError("can't get fsinfo if storagepool is not mounted")
    156         return self.client.btrfs.info(self.mountpoint)
    157

ValueError: can't get fsinfo if storagepool is not mounted
abdulgig commented 5 years ago

We were able to resolve this by clearing all the contents robot data directories and rebooting the node.