vmware-archive / photon-controller

Photon Controller
Other
26 stars 4 forks source link

unable to install Photon Agent #103

Open vChrisR opened 7 years ago

vChrisR commented 7 years ago

Photon platform seems to install fine. vib installation on the ESXi host works OK, but when the agent starts just logs errors. This is the log:

INFO [2017-04-01 12:52:13,667] [47537:47536:Thread-1] [agent.py:start:68] main: Startup config: {'datastores': None, 'host_service_threads': 20, 'logging_file_backup_count': 10, 'stats_store_port': 0, 'bootstrap_poll_frequency': 5, 'port': 8835, 'log_level': u'debug', 'config_path': '/etc/vmware/photon/controller', 'workers': 32, 'hostname': None, 'utilization_transfer_ratio': 9, 'no_syslog': False, 'thrift_timeout_sec': 3, 'memory_overcommit': 1.0, 'logging_file_size': 10485760, 'image_datastores': None, 'heartbeat_interval_sec': 10, 'auth_enabled': False, 'cpu_overcommit': 16.0, 'stats_store_endpoint': None, 'stats_enabled': False, 'control_service_threads': 1, 'heartbeat_timeout_factor': 6, 'wait_timeout': 10, 'host_id': None, 'logging_file': '/scratch/log/photon-controller-agent.log', 'console_log': False, 'deployment_id': None, 'management_only': False, 'stats_host_tags': None} INFO [2017-04-01 12:52:13,795] [47537:47536:Thread-1] [plugin.py:load_plugins:112] common.plugin: Plugins found: [<agent.plugin.AgentControlPlugin object at 0x7e2f9e8c>, <host.plugin.HostPlugin object at 0x7e44a8ec>, <stats.plugin.StatsPlugin object at 0x7e46f4cc>] INFO [2017-04-01 12:52:13,796] [47537:47536:Thread-1] [plugin.py:load_plugins:118] common.plugin: Plugin AgentControl initialized INFO [2017-04-01 12:52:13,984] [47537:47536:Thread-1] [attache_client.py:init:105] host.hypervisor.esx.attache_client: AttacheClient init INFO [2017-04-01 12:52:13,989] [47537:47536:Thread-1] [attache_client.py:nested:79] host.hypervisor.esx.attache_client: Enter AttacheClient.connect_local INFO [2017-04-01 12:52:14,005] [47537:47536:Thread-1] [attache_client.py:_start_syncing_cache:135] host.hypervisor.esx.attache_client: Start attache sync vm cache thread INFO [2017-04-01 12:52:14,016] [47537:47536:Thread-1] [attache_client.py:nested:98] host.hypervisor.esx.attache_client: Leave AttacheClient.connect_local INFO [2017-04-01 12:52:14,016] [47537:47536:Thread-1] [attache_client.py:nested:79] host.hypervisor.esx.attache_client: Enter AttacheClient.get_all_datastores INFO [2017-04-01 12:52:14,026] [47537:47536:Thread-1] [attache_client.py:nested:98] host.hypervisor.esx.attache_client: Leave AttacheClient.get_all_datastores CRITICAL [2017-04-01 12:52:14,026] [47537:47536:Thread-1] [datastore_manager.py:_initialize_datastores:93] host.hypervisor.datastore_manager: Image datastore(s) [] not found in set([Datastore(tags=frozenset(['LOCAL_VMFS']), type=0, id='58df7c60-b42f5c4b-3e21-0050568a0915', name='datastore1'), Datastore(tags=frozenset(['LOCAL_VMFS']), type=0, id='58df8e54-31abb344-e93c-0050568a0915', name='local02')]) INFO [2017-04-01 12:52:14,026] [47537:47536:Thread-1] [attache_client.py:nested:79] host.hypervisor.esx.attache_client: Enter AttacheClient.get_all_datastores INFO [2017-04-01 12:52:14,038] [47537:47536:Thread-1] [attache_client.py:nested:98] host.hypervisor.esx.attache_client: Leave AttacheClient.get_all_datastores CRITICAL [2017-04-01 12:52:14,038] [47537:47536:Thread-1] [datastore_manager.py:_initialize_datastores:93] host.hypervisor.datastore_manager: Image datastore(s) [] not found in set([Datastore(tags=frozenset(['LOCAL_VMFS']), type=0, id='58df7c60-b42f5c4b-3e21-0050568a0915', name='datastore1'), Datastore(tags=frozenset(['LOCAL_VMFS']), type=0, id='58df8e54-31abb344-e93c-0050568a0915', name='local02')]) INFO [2017-04-01 12:52:14,039] [47537:47536:Thread-1] [image_monitor.py:init:36] host.image.image_monitor: IMAGE SCANNER: adding datastore: 58df7c60-b42f5c4b-3e21-0050568a0915 INFO [2017-04-01 12:52:14,040] [47537:47536:Thread-1] [image_monitor.py:init:36] host.image.image_monitor: IMAGE SCANNER: adding datastore: 58df8e54-31abb344-e93c-0050568a0915 INFO [2017-04-01 12:52:14,041] [47537:47536:Thread-1] [plugin.py:load_plugins:118] common.plugin: Plugin Host initialized INFO [2017-04-01 12:52:14,041] [47537:47536:Thread-1] [stats.py:init:38] stats.stats: Stats not configured, Stats plugin will be in silent mode INFO [2017-04-01 12:52:14,042] [47537:47536:Thread-1] [plugin.py:load_plugins:118] common.plugin: Plugin Stats initialized INFO [2017-04-01 12:52:14,042] [47537:47536:Thread-1] [plugin.py:load_plugins:128] common.plugin: Plugin AgentControl started INFO [2017-04-01 12:52:14,042] [47537:47536:Thread-1] [plugin.py:load_plugins:128] common.plugin: Plugin Host started INFO [2017-04-01 12:52:14,042] [47537:47536:Thread-1] [plugin.py:load_plugins:128] common.plugin: Plugin Stats started INFO [2017-04-01 12:52:14,043] [47537:47536:Thread-1] [agent.py:_initialize_thrift_service:136] main: Load thrift services StatsService (num_threads: 2) INFO [2017-04-01 12:52:14,043] [47537:47536:Thread-1] [agent.py:_initialize_thrift_service:136] main: Load thrift services Host (num_threads: 20) INFO [2017-04-01 12:52:14,043] [47537:47536:Thread-1] [agent.py:_initialize_thrift_service:136] main: Load thrift services AgentControl (num_threads: 1) INFO [2017-04-01 12:52:14,044] [47537:47536:Thread-1] [agent.py:_initialize_thrift_service:142] main: Initialize SSLSocket using certfile=/etc/vmware/ssl/rui.crt, keyfile=/etc/vmware/ssl/rui.key, capath=/etc/vmware/ssl/ INFO [2017-04-01 12:52:14,044] [47537:47536:Thread-1] [agent.py:start:80] main: Starting the bootstrap config poll thread INFO [2017-04-01 12:52:14,045] [47537:47536:Thread-1] [agent.py:_start_thrift_service:155] main: Listening on port 8835... ERROR [2017-04-01 12:52:45,231] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: error while accepting ERROR [2017-04-01 12:52:45,232] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: Traceback (most recent call last): ERROR [2017-04-01 12:52:45,232] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: File "/var/lib/jenkins/workspace/photon-controller-python/agent/create_vib.yFQcg/vib/payloads/agent/opt/vmware/photon/controller/2.7/site-packages/thrift/server/TNonblockingServer.py", line 314, in handle ERROR [2017-04-01 12:52:45,232] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: AttributeError: 'NoneType' object has no attribute 'handle' ERROR [2017-04-01 12:52:46,254] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: error while accepting ERROR [2017-04-01 12:52:46,254] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: Traceback (most recent call last): ERROR [2017-04-01 12:52:46,254] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: File "/var/lib/jenkins/workspace/photon-controller-python/agent/create_vib.yFQcg/vib/payloads/agent/opt/vmware/photon/controller/2.7/site-packages/thrift/server/TNonblockingServer.py", line 314, in handle ERROR [2017-04-01 12:52:46,255] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: AttributeError: 'NoneType' object has no attribute 'handle' ERROR [2017-04-01 12:52:47,283] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: error while accepting ERROR [2017-04-01 12:52:47,284] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: Traceback (most recent call last): ERROR [2017-04-01 12:52:47,284] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: File "/var/lib/jenkins/workspace/photon-controller-python/agent/create_vib.yFQcg/vib/payloads/agent/opt/vmware/photon/controller/2.7/site-packages/thrift/server/TNonblockingServer.py", line 314, in handle ERROR [2017-04-01 12:52:47,284] [47537:47536:Thread-1] [init.py:exception:1193] thrift.server.TNonblockingServer: AttributeError: 'NoneType' object has no attribute 'handle'

The last three log entries are repeated a lot of time.

What stands out in the log are the two CRITICALs. Somehow the agent is trying to use 'datastore1' as the image store. But 'local02' was configured in the photon yaml.

Below is my photon yaml:

compute:
  hypervisors:
    vesxi60:
      hostname: "vesxi60"
      ipaddress: "192.168.192.6"
      dns: "192.168.192.78"
      credential:
        username: "root"
        password: "password"
    vesxi60c01:
      hostname: "vesxi60c01"
      ipaddress: "192.168.192.23"
      dns: "192.168.192.78"
      credential:
        username: "root"
        password: "password"

lightwave:
  domain: "photon.lab"
  credential:
    username: "administrator"
    password: "Passw0rd123!"
  controllers:
    lightwave:
      site: "homelab"
      appliance:
        hostref: "vesxi60"
        datastore: "local02"
        memoryMb: 2048
        cpus: 2
        credential:
          username: "root"
          password: "password"
        network-config:
          network: "NAT=VM Network"
          type: "static"
          hostname: "lightwave.photon.lab"
          ipaddress: "192.168.192.78"
          dns: "192.168.192.21"
          ntp: "nl.pool.ntp.org"
          netmask: "255.255.255.0"
          gateway: "192.168.192.1"
photon:
  imagestore:
    img-store-1:
      datastore: "local02"
      enableimagestoreforvms: "true"
  cloud:
    hostref1: "vesxi60c01"
  controllers:
      pc:
        appliance:
          hostref: "vesxi60"
          datastore: "local02"
          memoryMb: 2048
          cpus: 2
          credential:
            username: "root"
            password: "password"
          network-config:
            network: "NAT=VM Network"
            type: "static"
            hostname: "pc.photon.lab"
            ipaddress: "192.168.192.77"
            netmask: "255.255.255.0"
            dns: "192.168.192.78"
            ntp: "95.211.160.148"
            gateway: "192.168.192.1"
loadBalancer:
  plb:
    appliance:
      hostref: "vesxi60"
      datastore: "local02"
      credential:
        username: "root"
        password: "password"
      network-config:
        network: "NAT=VM Network"
        type: "static"
        hostname: "plb.photon.lab"
        ipaddress: "192.168.192.76"
        netmask: "255.255.255.0"
        dns: "192.168.192.78"
        ntp: "nl.pool.ntp.org"
        gateway: "192.168.192.1"

and here is the photon-installer.log:

2017-04-01 12:53:27 DEBUG headers:124 - http-outgoing-17 << HTTP/1.1 200 OK 2017-04-01 12:53:27 DEBUG headers:127 - http-outgoing-17 << Server: Apache-Coyote/1.1 2017-04-01 12:53:27 DEBUG headers:127 - http-outgoing-17 << Cache-Control: no-store 2017-04-01 12:53:27 DEBUG headers:127 - http-outgoing-17 << Pragma: no-cache 2017-04-01 12:53:27 DEBUG headers:127 - http-outgoing-17 << Content-Type: application/json;charset=UTF-8 2017-04-01 12:53:27 DEBUG headers:127 - http-outgoing-17 << Content-Length: 3800 2017-04-01 12:53:27 DEBUG headers:127 - http-outgoing-17 << Date: Sat, 01 Apr 2017 12:53:27 GMT 2017-04-01 12:53:27 DEBUG MainClientExec:284 - Connection can be kept alive indefinitely 2017-04-01 12:53:27 DEBUG PoolingHttpClientConnectionManager:314 - Connection [id: 17][route: {s}->https://192.168.192.78:443] can be kept alive indefinitely 2017-04-01 12:53:27 DEBUG PoolingHttpClientConnectionManager:320 - Connection released: [id: 17][route: {s}->https://192.168.192.78:443][total kept alive: 1; route allocated: 1 of 2; total allocated: 1 of 20] 2017-04-01 12:53:27 DEBUG PoolingHttpClientConnectionManager:388 - Connection manager is shutting down 2017-04-01 12:53:27 DEBUG DefaultManagedHttpClientConnection:81 - http-outgoing-17: Close connection 2017-04-01 12:53:27 DEBUG PoolingHttpClientConnectionManager:394 - Connection manager shut down 2017-04-01 12:53:27 DEBUG ControllerInstaller:736 - https://192.168.192.77:9000/tasks/434a3171-178c-4e3c-a72a-b40b72394557 2017-04-01 12:53:28 DEBUG ControllerInstaller:736 - https://192.168.192.77:9000/tasks/434a3171-178c-4e3c-a72a-b40b72394557 2017-04-01 12:53:30 DEBUG ControllerInstaller:736 - https://192.168.192.77:9000/tasks/434a3171-178c-4e3c-a72a-b40b72394557 2017-04-01 12:53:31 DEBUG ControllerInstaller:736 - https://192.168.192.77:9000/tasks/434a3171-178c-4e3c-a72a-b40b72394557 2017-04-01 12:53:32 DEBUG PoolingHttpClientConnectionManager:388 - Connection manager is shutting down 2017-04-01 12:53:32 DEBUG PoolingHttpClientConnectionManager:394 - Connection manager shut down 2017-04-01 12:53:32 DEBUG ControllerInstaller:736 - https://192.168.192.77:9000/tasks/434a3171-178c-4e3c-a72a-b40b72394557 2017-04-01 12:53:33 DEBUG ControllerInstaller:736 - https://192.168.192.77:9000/tasks/434a3171-178c-4e3c-a72a-b40b72394557 2017-04-01 12:53:34 DEBUG ControllerInstaller:736 - https://192.168.192.77:9000/tasks/434a3171-178c-4e3c-a72a-b40b72394557 2017-04-01 12:53:35 DEBUG ControllerInstaller:736 - https://192.168.192.77:9000/tasks/434a3171-178c-4e3c-a72a-b40b72394557 The last lines are repeated a lot of times and then: 2017-04-01 12:54:32 DEBUG ControllerInstaller:736 - https://192.168.192.77:9000/tasks/434a3171-178c-4e3c-a72a-b40b72394557 2017-04-01 12:54:33 ERROR ControllerInstaller:540 - Failed in provisioning host: vesxi60c01

tactical-drone commented 7 years ago

@vChrisR I went through it quickly. That is good work.

Would love one covering a redundant setup using 5 servers.

Then one on kubernetes, I am having huge headaches trying to boostrap the cluster.

I found a way to hack the installer with the fixes:

  1. Extract the appliances vmdks from the installer to your dev machine
  2. Get vmware-vdiskmanager and guestmount
  3. Convert vmdk using vmware-vdiskmanager to type 0
  4. mount the vmdk in rw mode using guestmount
  5. Apply the lib64 symlink on lightwave and edit /etc/esxcloud/run.sh on the controller and change the KEYSTORE_PASSWORD=photon-controller
  6. Load some newly generated certs for the load balancer in the same way with the correct CN.
  7. Convert the image back to type 5 with vmware-vdiskmanager
  8. Copy the vmdk back to the installer. Just use. Don't bother the checksums they don't check out anyway.
LIV2 commented 7 years ago

So is 1.1.1 just broken or what? I'm unable to deploy against my 6.0u3 cluster but can deploy 1.2 against a 6.5 cluster with no issues.

I just get endless logs like this on my hosts

[2017-05-18 06:12:44,964] [40773:40770:Thread-1] [TNonblockingServer.py:handle:318] thrift.server.TNonblockingServer: error while accepting ERROR [2017-05-18 06:12:44,964] [40773:40770:Thread-1] [TNonblockingServer.py:handle:318] thrift.server.TNonblockingServer: Traceback (most recent call last): ERROR [2017-05-18 06:12:44,964] [40773:40770:Thread-1] [TNonblockingServer.py:handle:318] thrift.server.TNonblockingServer: File "/var/lib/jenkins/workspace/photon-controller-python/agent/create_vib.yFQcg/vib/payloads/agent/opt/vmwar e/photon/controller/2.7/site-packages/thrift/server/TNonblockingServer.py", line 314, in handle ERROR [2017-05-18 06:12:44,965] [40773:40770:Thread-1] [TNonblockingServer.py:handle:318] thrift.server.TNonblockingServer: AttributeError: 'NoneType' object has no attribute 'handle' ERROR [2017-05-18 06:14:16,863] [40773:40770:Thread-1] [TNonblockingServer.py:handle:318] thrift.server.TNonblockingServer: error while accepting ERROR [2017-05-18 06:14:16,863] [40773:40770:Thread-1] [TNonblockingServer.py:handle:318] thrift.server.TNonblockingServer: Traceback (most recent call last): ERROR [2017-05-18 06:14:16,864] [40773:40770:Thread-1] [TNonblockingServer.py:handle:318] thrift.server.TNonblockingServer: File "/var/lib/jenkins/workspace/photon-controller-python/agent/create_vib.yFQcg/vib/payloads/agent/opt/vmwar e/photon/controller/2.7/site-packages/thrift/server/TNonblockingServer.py", line 314, in handle

LIV2 commented 7 years ago

Yep, Downgrading to 4600944 fixed it for me...

tactical-drone commented 7 years ago

@LIV2 1.1.1 works perfectly. But you need ESXi 6.5.

LIV2 commented 7 years ago

@pompomJuice Well the release notes for 1.1.1 state the following:

VMware Photon Platform 1.1.1 Release Notes Supported Platforms • VMware ESXi 6.0.0 Patch 201611001 (ESXi600-201611001), which you can find by searching for patches for ESXi 6.0.0 on My VMware Product Patches at https://my.vmware.com/group/vmware/patch. • Lightwave version 1.0.3. • NSX-T version 1.1. • vSAN for Photon Platform version 1.1. • ESXi 6.5 is unsupported at this time.

Indeed it does work only on 6.0 Patch 201611001 from my testing, it does not work at all on newer versions it just hangs joining the ESXI host to the cluster.

It's pretty disappointing as I'm evaluating this for a production system, the Documentation claims that this software is production ready but if you can't even run up-to-date ESXI with all the latest security patches I fail to see how this is the case.

tactical-drone commented 7 years ago

@LIV2 My bad, 1.2.0 needs ESXi 6.5 not 1.1.1.