sujiar37 / AWX-HA-InstanceGroup

Build AWX clustering on Docker Standalone Installation
MIT License
107 stars 39 forks source link

Trouble When add new instances #28

Closed LeandroSalvas closed 4 years ago

LeandroSalvas commented 4 years ago

I make a new fresh deploy of AWX-HA-InstanceGroup with just one Instance(TASK and WEB) but when I try to add two(2) new instances some times the new instances goes unavailable and I get host marked as lost out from "docker logs -f build_image_task_1":

2020-06-24 17:34:03,715 DEBUG awx.main.dispatch publish awx.main.tasks.awx_periodic_scheduler(793666d6-b0ce-4550-b3a5-4d763aff4dcd, queue=awx_private_queue) 2020-06-24 17:34:03,727 DEBUG awx.main.dispatch task 793666d6-b0ce-4550-b3a5-4d763aff4dcd starting awx.main.tasks.awx_periodic_scheduler([]) 2020-06-24 17:34:03,738 DEBUG awx.main.tasks Starting periodic scheduler 2020-06-24 17:34:03,741 DEBUG awx.main.tasks Last scheduler run was: 2020-06-24 17:29:12.077878+00:00 2020-06-24 17:34:13,738 DEBUG awx.main.dispatch publish awx.main.scheduler.tasks.run_task_manager(7da3cd09-2c56-4ad5-ae03-bed7c82b6450, queue=awx_private_queue) 2020-06-24 17:34:13,752 DEBUG awx.main.dispatch task 7da3cd09-2c56-4ad5-ae03-bed7c82b6450 starting awx.main.scheduler.tasks.run_task_manager([]) 2020-06-24 17:34:13,754 DEBUG awx.main.scheduler Running Tower task manager. 2020-06-24 17:34:13,771 DEBUG awx.main.scheduler Starting Scheduler 2020-06-24 17:34:33,768 DEBUG awx.main.dispatch publish awx.main.tasks.cluster_node_heartbeat(f6baef1a-fad8-4595-8fe8-ac276bb358d5, queue=cctdcapllx0828) 2020-06-24 17:34:33,842 DEBUG awx.main.dispatch publish awx.main.tasks.awx_k8s_reaper(93dbb6d2-c462-41fe-898c-5d5ea8b93774, queue=cctdcapllx0828) 2020-06-24 17:34:33,854 DEBUG awx.main.dispatch task f6baef1a-fad8-4595-8fe8-ac276bb358d5 starting awx.main.tasks.cluster_node_heartbeat([]) 2020-06-24 17:34:33,855 DEBUG awx.main.tasks Cluster node heartbeat task. 2020-06-24 17:34:33,860 DEBUG awx.main.dispatch task 93dbb6d2-c462-41fe-898c-5d5ea8b93774 starting awx.main.tasks.awx_k8s_reaper([]) 2020-06-24 17:34:33,862 DEBUG awx.main.dispatch publish awx.main.tasks.awx_periodic_scheduler(cfc8326a-2958-4b00-b3ff-80bb0bab269f, queue=awx_private_queue) 2020-06-24 17:34:33,875 DEBUG awx.main.dispatch task cfc8326a-2958-4b00-b3ff-80bb0bab269f starting awx.main.tasks.awx_periodic_scheduler([]) 2020-06-24 17:34:33,875 DEBUG awx.main.dispatch publish awx.main.scheduler.tasks.run_task_manager(fac73454-c75f-4c77-bd45-32d2fe431eed, queue=awx_private_queue) 2020-06-24 17:34:33,884 DEBUG awx.main.tasks Starting periodic scheduler 2020-06-24 17:34:33,887 DEBUG awx.main.tasks Last scheduler run was: 2020-06-24 17:29:45.704402+00:00 2020-06-24 17:34:33,888 ERROR awx.main.tasks Host cctdcapllx0830 last checked in at 2020-06-24 17:29:15.621609+00:00, marked as lost. 2020-06-24 17:34:33,894 ERROR awx.main.tasks Host cctdcapllx0831 last checked in at 2020-06-24 17:29:11.107072+00:00, marked as lost. 2020-06-24 17:34:38,985 DEBUG awx.main.dispatch task 18e6473b-216b-4e33-9e01-4cb2f4ca8a49 starting awx.main.scheduler.tasks.run_task_manager([]) 2020-06-24 17:34:38,987 DEBUG awx.main.scheduler Running Tower task manager. 2020-06-24 17:34:39,005 DEBUG awx.main.scheduler Starting Scheduler 2020-06-24 17:34:53,907 DEBUG awx.main.dispatch publish awx.main.scheduler.tasks.run_task_manager(4d8a5cda-988f-4c24-8a08-1fb52a62fd39, queue=awx_private_queue) 2020-06-24 17:34:53,920 DEBUG awx.main.dispatch task 4d8a5cda-988f-4c24-8a08-1fb52a62fd39 starting awx.main.scheduler.tasks.run_task_manager(*[]) 2020-06-24 17:34:53,921 DEBUG awx.main.scheduler Running Tower task manager. 2020-06-24 17:34:53,936 DEBUG awx.main.scheduler Starting Scheduler RESULT 2 OKREADY

Someone has seen this before?

zsoterr commented 4 years ago

Yes, I can confirm if the issue exists - I got same error after the deployment. I couldn't fix that - until now-, but I found a "workaround" for this problem.

Here it is: I had to use the older deployment: V9.1.1: and the new instances worked normally. but, I got error message on the webportal (for example: when I would have liked to add or delete project,): "Cal to api/v2/.... returned status 5000" The error messages in the logs ( (web and task container) referred to rabbitmq: "Connection to broker lost, trying to re-establish connection... ....amqp.exceptions.AccessRefused: (0, 0): (403) ACCESS_REFUSED - Login was refused using authentication mechanism AMQPLAIN."

The solution - for this problem-: I had to update the rabbitmq-server version from 3.5 to 3.6. For example: Add the rabbitmq repository (on all nodes): cat << EOF | sudo tee /etc/yum.repos.d/rabbitmq.repo [bintray-rabbitmq-server] name=bintray-rabbitmq-rpm baseurl=https://dl.bintray.com/rabbitmq/rpm/rabbitmq-server/v3.6.x/el/7/ gpgcheck=0 repo_gpgcheck=0 enabled=1 EOF

and run this commands: yum check-update&&yum install -y rabbitmq-server-3.6.16 If you want , you can check the other versions: yum --showduplicates list rabbitmq-server

regards, Robert