vmware / container-service-extension

Container Service for VMware vCloud Director
https://vmware.github.io/container-service-extension
Other
77 stars 52 forks source link

cse.service failed to start #681

Open ChandraRatra opened 4 years ago

ChandraRatra commented 4 years ago

Issue 1# I am installing CSE 2.6.1, template deployment is complete. After that when I tried start cse.services its getting failed

systemctl status cse

● cse.service - Container Service Extension for VMware vCloud Director Loaded: loaded (/etc/systemd/system/cse.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Sat 2020-08-01 18:04:14 UTC; 17s ago Process: 770 ExecStart=/home/vmware/cse.sh (code=exited, status=1/FAILURE) Main PID: 770 (code=exited, status=1/FAILURE)

Aug 01 18:04:14 systemd[1]: cse.service: Main process exited, code=exited, status=1/FAILURE Aug 01 18:04:14 systemd[1]: cse.service: Failed with result 'exit-code'. Aug 01 18:04:14 systemd[1]: cse.service: Service RestartSec=100ms expired, scheduling restart. Aug 01 18:04:14 systemd[1]: cse.service: Scheduled restart job, restart counter is at 5. Aug 01 18:04:14 systemd[1]: Stopped Container Service Extension for VMware vCloud Director. Aug 01 18:04:14 systemd[1]: cse.service: Start request repeated too quickly. Aug 01 18:04:14 systemd[1]: cse.service: Failed with result 'exit-code'. Aug 01 18:04:14 systemd[1]: Failed to start Container Service Extension for VMware vCloud Director.

When I checked list of dependencies for cse.service

systemctl list-dependencies cse.service

getting red dot against systemd-networkd-wait-online.service

Any idea what is causing this issue?

ChandraRatra commented 4 years ago

Once run below command, it completes without any error

cse run --config encrypted-config.yaml

Required Python version: >= 3.7.3 Installed Python version: 3.7.3 (default, Aug 1 2020, 08:50:56) [GCC 7.3.0] Password for config file decryption: Decrypting 'encrypted-config.yaml' Validating config file 'encrypted-config.yaml' Connected to AMQP server (X.X.X.X:5672) InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. Connected to vCloud Director (X.X.X.X:443) Connected to vCenter Server 'X.X.X.X' as 'administrator@vsphere.local' (X.X.X.X:443) Config file 'encrypted-config.yaml' is valid Loading k8s template definition from catalog Found K8 template 'photon-v2_k8-1.14_weave-2.5.2' at revision 2 in catalog 'cse261' Found K8 template 'ubuntu-16.04_k8-1.15_weave-2.5.2' at revision 3 in catalog 'cse261' Found K8 template 'ubuntu-16.04_k8-1.16_weave-2.6.0' at revision 1 in catalog 'cse261' Found K8 template 'ubuntu-16.04_k8-1.17_weave-2.6.0' at revision 1 in catalog 'cse261' Processing compute policy for k8s templates. Removing compute policy from template 'photon-v2_k8-1.14_weave-2.5.2_rev2'. Removing compute policy from template 'ubuntu-16.04_k8-1.15_weave-2.5.2_rev3'. Removing compute policy from template 'ubuntu-16.04_k8-1.16_weave-2.6.0_rev1'. Removing compute policy from template 'ubuntu-16.04_k8-1.17_weave-2.6.0_rev1'. Validating CSE installation according to config file AMQP exchange 'CSE' exists CSE on vCD is currently enabled Found catalog 'cse261' CSE installation is valid Started thread 'MessageConsumer-0 (139737667303168)' Started thread 'MessageConsumer-1 (139737658648320)' Started thread 'MessageConsumer-2 (139737650255616)' Started thread 'MessageConsumer-3 (139737641862912)' Started thread 'MessageConsumer-4 (139737432061696)' Started thread 'MessageConsumer-5 (139737423668992)' Started thread 'MessageConsumer-6 (139737415276288)' Started thread 'MessageConsumer-7 (139737406883584)' Started thread 'MessageConsumer-8 (139737398490880)' Started thread 'MessageConsumer-9 (139737390098176)' Container Service Extension for vCloud Director Server running using config file: encrypted-config.yaml Log files: cse-logs/cse-server-info.log, cse-logs/cse-server-debug.log waiting for requests (ctrl+c to close)


When tried running CSE Server as a Service, got error: Failed to start Container Service Extension for VMware vCloud Director.

systemctl status cse

● cse.service - Container Service Extension for VMware vCloud Directo r Loaded: loaded (/etc/systemd/system/cse.service; enabled; vendor preset: enabled) Active: active (running) since Sun 2020-08-02 04:43:54 UTC; 567msago Main PID: 738 (bash) Tasks: 2 (limit: 2394) Memory: 32.3M CGroup: /system.slice/cse.service ├─738 bash /home/vmware/cse.sh └─739 /usr/local/bin/python3.7 /usr/local/bin/cse run --config /home/ vmware/encrypted-config.yaml

Aug 02 04:43:54 systemd[1]: cse.service: Service RestartSec=100ms expired, scheduling restart. Aug 02 04:43:54 systemd[1]: cse.service: Scheduled restart job, restart counter is at 4. Aug 02 04:43:54 systemd[1]: Stopped Container Service Extension for VMware vCloud Director. Aug 02 04:43:54 systemd[1]: Started Container Service Extension for VMware vCloud Director.

systemctl status cse

● cse.service - Container Service Extension for VMware vCloud Director Loaded: loaded (/etc/systemd/system/cse.service; enabled; vendor preset: enabled) Active: failed (Result: exit-code) since Sun 2020-08-02 04:43:55 UTC; 24s ago Process: 738 ExecStart=/home/vmware/cse.sh (code=exited, status=1/FAILURE) Main PID: 738 (code=exited, status=1/FAILURE)

Aug 02 04:43:54 systemd[1]: cse.service: Main process exited, code=exited, status=1/FAILURE Aug 02 04:43:54 systemd[1]: cse.service: Failed with result 'exit-code'. Aug 02 04:43:55 systemd[1]: cse.service: Service RestartSec=100ms expired, scheduling restart. Aug 02 04:43:55 systemd[1]: cse.service: Scheduled restart job, restart counter is at 5. Aug 02 04:43:55 systemd[1]: Stopped Container Service Extension for VMware vCloud Director. Aug 02 04:43:55 systemd[1]: cse.service: Start request repeated too quickly. Aug 02 04:43:55 systemd[1]: cse.service: Failed with result 'exit-code'. Aug 02 04:43:55 systemd[1]: Failed to start Container Service Extension for VMware vCloud Director.

ChandraRatra commented 4 years ago

After disabling proxy on 2nd nic, systemd-networkd-wait-online.service is now green. But still cse.service is getting failed to start.

Observed same issue, after upgrading CSE 2.5.1 to CSE2.6.1

ChandraRatra commented 4 years ago

Just to clear manually able to start cse service, only issue is when try to run CSE Server as a Service.

Anirudh9794 commented 4 years ago

Hello, Can you please confirm if the cse.service file has references to /home/vmware/cse.sh or /root/cse.sh?

ChandraRatra commented 4 years ago

cse.service file has references to /home/vmware/cse.sh

-cat cse.service [Unit] Description=Container Service Extension for VMware vCloud Director Wants=network-online.target After=network-online.target

[Service] ExecStart=/home/vmware/cse.sh Type=simple User=root WorkingDirectory=/home/vmware Restart=always

[Install] WantedBy=multi-user.target

and below is the details of cse.sh file

ChandraRatra commented 4 years ago

In case of new installation of CSE 2.6.1 Issue 2# I had manually started cse service and tried to deploy Photon and Ubuntu template. In both cases I am getting below error. In both cases master is deployed and getting error on worker node. Just to clear I don't have direct internet connectivity to CSE server, I am using proxy for internet connectivity. Seems like something is missing from node.sh. Can you please look into this issue also



Photon cluster deployment started using template- photon-v2_k8-1.14_weave-2.5.2 . Task failed with below error message

cluster operation: Error creating cluster 'PH-CSECLS01'. Deleting cluster (rollback=True) task: 928d3c83-415e-4c6a-a5ce-aa10715f41d7, result: error, message: Join cluster script execution failed on worker node ['node-e09n']:["/tmp/0b46a1fa-d785-11ea-b009-005056010df3.sh: line 5: $'\240': command not found\n"]


Below is the node.sh file for photon template (X.X.X.X = proxy IP address) cat node.sh

!/usr/bin/env bash

set -e

mkdir /etc/systemd/system/docker.service.d   echo '[Service]' >> /etc/systemd/system/docker.service.d/http-proxy.conf echo 'Environment="HTTP_PROXY=http://X.X.X.X:8080"' >> /etc/systemd/system/docker.service.d/http-proxy.conf echo '[Service]' >> /etc/systemd/system/docker.service.d/https-proxy.conf echo 'Environment="HTTPS_PROXY=http://X.X.X.X:8080"' >> /etc/systemd/system/docker.service.d/https-proxy.conf   systemctl daemon-reload systemctl restart docker

while [ systemctl is-active docker != 'active' ]; do echo 'waiting for docker'; sleep 5; done kubeadm join --token {token} {ip}:6443 --discovery-token-unsafe-skip-ca-verification


Below is the details from cse-server-info.log

20-08-06 01:11:33 | WARNING :: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. 20-08-06 01:11:40 | WARNING :: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. 20-08-06 01:11:56 | WARNING :: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. 20-08-06 01:13:08 | WARNING :: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. 20-08-06 01:29:11 | WARNING :: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. 20-08-06 01:29:30 | WARNING :: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. 20-08-06 01:30:53 | WARNING :: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. 20-08-06 01:31:18 | WARNING :: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. 20-08-06 01:35:14 | INFO :: Error creating cluster 'PH-CSECLS01'. Deleting cluster (rollback=True) 20-08-06 01:35:36 | ERROR :: Error creating cluster 'PH-CSECLS01' Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/container_service_extension/vcdbroker.py", line 901, in _create_cluster_async template[LocalTemplateKey.REVISION]) File "/usr/local/lib/python3.7/site-packages/container_service_extension/vcdbroker.py", line 1774, in join_cluster f"Join cluster script execution failed on worker node " container_service_extension.exceptions.ScriptExecutionError: Join cluster script execution failed on worker node ['node-e09n']:["/tmp/0b46a1fa-d785-11ea-b009-005056010df3.sh: line 5: $'\240': command not found\n"]



Ubuntu cluster deployment started using template ubuntu-16.04_k8-1.17_weave-2.6.0 . Task failed with below error message

Error creating cluster 'UB-CSECLS01'. Deleting cluster (rollback=True) task: 026c40ff-2de8-42e6-bf3c-ea444661f3da, result: error, message: Join cluster script execution failed on worker node ['node-8rxn']:["/tmp/cca0838c-d70a-11ea-b3a6-005056010df3.sh: line 4: $'\240': command not found\n"]

ChandraRatra commented 4 years ago

In case of upgrade from CSE from 2.5.1 to 2.6.1. Even after manually starting CSE service, not able to deploy CSE cluster from command line Issue 3# Getting below error message vcd cse cluster create PH-CSCLS02 --template-name photon-v2_k8-1.14_weave-2.5.2 --template-revision 2 --nodes 1 --network ORG-NW05 Usage: vcd cse cluster create [OPTIONS] NAME Try "vcd cse cluster create -h" for help.

Error: External service 'cse' failed to respond in the specified timeout (40 SECONDS)


And when tried to ADD cluster from VCD UI, Create New Cluster page keeps on a spinning wheel Error 3

Anirudh9794 commented 4 years ago

Issue 1: cannot run cse as a service Can you please double check if the paths mentioned in cse.service and cse.sh files are correct. The cse.sh is using an encrypted config without a password variable set. You can take a look at the example cse.sh in https://github.com/vmware/container-service-extension/blob/master/cse.sh to set password environment variable.

Issue 2: problem with node.sh: I think the node.sh you are using has been modified. and the error is also in the line that was added. I guess we can't help there as the supported node.sh was modified.

Issue 3: cluster create not going through Can you please give the steps followed to start the CSE server?

ChandraRatra commented 4 years ago

Issue1: Sure I will update cse.sh and cse.service and than will confirm back

Issue 2: In my environment internet connectivity is available through proxy If I use default node.sh

!/usr/bin/env bash

set -e while [ systemctl is-active docker != 'active' ]; do echo 'waiting for docker'; sleep 5; done kubeadm join --token {token} {ip}:6443 --discovery-token-unsafe-skip-ca-verification

Cluster get deployed, but node status is not ready kubectl get nodes NAME STATUS ROLES AGE VERSION mstr-s501 Ready master 30m v1.14.6 node-2mz7 NotReady 23m v1.14.6

Issue 3: Steps used to start CSE server manually cse run --config encrypted-config.yaml

ChandraRatra commented 4 years ago

Regarding Issue 1: I updated details of cse.sh & cse.service file for my upgrade/new installation. Now I am able to start CSE server as service

Below is details from cse.sh

root@CSEVM251 [ ~ ]# cat cse.sh CSE_CONFIG_PATH=/root/encrypted-config.yaml cse run --config $CSE_CONFIG_PATH


below is details from cse.service

cat /etc/systemd/system/cse.service [Unit] Description=Container Service Extension for VMware vCloud Director Wants=network-online.target,rabbitmq-server.service After=network-online.target,rabbitmq-server.service

[Service] ExecStart=/root/cse.sh User=root WorkingDirectory=/root Type=simple Restart=always EnvironmentFile=/home/vmware/CSE_CONFIG_PASSWORD

[Install] WantedBy=default.target


Now EnvironmentFile=/home/vmware/CSE_CONFIG_PASSWORD is a plain text file and anyone with access to CSE VM can get the password to decrypt config.yaml

can you please suggest how to secure EnvironmentFile ??

Issue 2: In my environment internet connectivity is available through proxy If I use default node.sh

!/usr/bin/env bash

set -e while [ systemctl is-active docker != 'active' ]; do echo 'waiting for docker'; sleep 5; done kubeadm join --token {token} {ip}:6443 --discovery-token-unsafe-skip-ca-verification

Cluster get deployed, but node status is not ready kubectl get nodes NAME STATUS ROLES AGE VERSION mstr-s501 Ready master 30m v1.14.6 node-2mz7 NotReady 23m v1.14.6

Issue 4: After upgrade CSE from 2.5.1 to 2.6.1 version. Tried to update existing cluster from CSE 2.5.1, but task failed

]# vcd cse cluster upgrade PH-CSECLS01 photon-v2_k8-1.14_weave-2.5.2 2 cluster operation: Upgrading cluster 'PH-CSECLS01' software to match template photon-v2_k8-1.14_weave-2.5.2 (revision 2): Kubernetes: 1.14.6 -cluster operation: Upgrading cluster 'PH-CSECLS01' software to match template photon-v2_k8-1.14_weave-2.5.2 (revision 2): Kubernetes: 1.14.6 -> 1.14.6, Docker-CE: 18.06.2 -> 18.06.2-6, CNI: weave 2.5.2 -> 2.5.2 cluster operation: Draining master node ['mstr-ij3f'] cluster operation: Upgrading Kubernetes (1.14.6 -> 1.14.6) in master node ['mstr-ij3f'] cluster operation: Uncordoning master node ['mstr-ij3f'] cluster operation: Draining node node-14wi task: 7425b108-9cfc-48af-8849-36436c7fe115, result: error, message: Unexpected error while upgrading cluster 'PH-CSECLS01': Script execution failed on node ['mstr-ij3f'] Errors: ['Error from server (NotFound): nodes "node-14wi" not found\n']

Where as node is visible from command line and UI also ]# vcd cse node list PH-CSECLS01 ipAddress name


192.168.0.203 mstr-ij3f 192.168.0.202 node-14wi

]# vcd cse cluster info PH-CSECLS01 property value


cluster_id 960a7475-c0fe-4447-906b-77dc438c5a3c cni weave cni_version 2.5.2 cse_version 2.5.1 docker_version 18.06.2 k8s_provider native kubernetes upstream kubernetes_version 1.14.6 leader_endpoint 192.168.0.203 master_nodes {'name': 'mstr-ij3f', 'ipAddress': '192.168.0.203'} name PH-CSECLS01 nfs_nodes nodes {'name': 'node-14wi', 'ipAddress': '192.168.0.202'} number_of_vms 2 os photon-v2 status POWERED_ON template_name photon-v2_k8-1.14_weave-2.5.2 template_revision 1 vapp_href https://vcdlab.lab65.local/api/vApp/vapp-87c8bef0-99d4-44eb-8145-fbba14225040 vapp_id 87c8bef0-99d4-44eb-8145-fbba14225040 vdc_href https://vcdlab.lab65.local/api/vdc/c58af5a0-b002-43c5-8f66-8152f21e1a19 vdc_id c58af5a0-b002-43c5-8f66-8152f21e1a19 vdc_name ORG05-VDC

image

ChandraRatra commented 4 years ago

can you please suggest how to secure EnvironmentFile ?? as currently it is plain text file

ChandraRatra commented 4 years ago

For Issue 2# can you please suggest regarding changes I have to make in node.sh. considering my scenario, where internet connectivity is available through proxy.