Second node not running jobs

loceee commented 5 years ago

This project is awesome, I had a AWX cluster with 2 node up backing on to RDS in no time.

I am not seeing any jobs run on the second node though, they time out and fail. Nodes show up in the instance group and seem to be responding. Both web interfaces are up -- but jobs only run on the first node.

Am I missing something? More of an AWX question I guess. Thanks for your great work!

sujiar37 commented 5 years ago

@loceee Thank you for your comments. This is a bug actually and I might come up with a fix soon.

The problem where it is, if you had noticed the play for rabbitmq cluster, it only instruct those tasks to the agent node which would ideally mentioned under the inventory group [awx_instance_group_task], however in your case, both nodes were comes under [awx_instance_group_web] if I guess correct.

So here is the work around till I come up with a fix,

Put that trouble node under [awx_instance_group_task] and working node under [awx_instance_group_web]

Edit the awx_ha.yml and comment / disable the awx_ha role like below,

#- { role: awx_ha, when: ansible_os_family == "RedHat" and ansible_distribution_major_version == "7" }

Execute the playbook again and this time it should connect to the cluster
You could also verify cluster status via the command /sbin/rabbitmqctl cluster_status on both nodes and see whether it has been clustered already.

Note: Why I had asked to disable awx_ha role because the web GUI of AWX I intentionally disabled for the inventory group [awx_instance_group_task] since it is independent with the instance group functionality unless if someone needs a load balancing feature..

loceee commented 5 years ago

Hey @sujiar37 thanks for a super fast response. You are correct, my goal state is a simple 2x web/agent nodes --> RDS HA instance that gives me tolerance in case of an AZ failure.

I think I understand here.

Changing the logic on https://github.com/sujiar37/AWX-HA-InstanceGroup/blob/b9cf318bafeb5bb40c6c9b2638120620c88961cc/roles/rabbitmq_cluster/tasks/join_rmq_cluster.yml#L16

and https://github.com/sujiar37/AWX-HA-InstanceGroup/blob/b9cf318bafeb5bb40c6c9b2638120620c88961cc/roles/rabbitmq_cluster/tasks/join_rmq_cluster.yml#L22

would also quickly fix my issue right? It would join all nodes to the rabbit agent cluster despite their membership awx_instance_group_web

Thanks heaps for your great work on this!

sujiar37 commented 5 years ago

@loceee , Yes, that would fix your problems. However, you may have to restart containers on both nodes if the jobs are not still picking up,

# ls 
docker-compose.yml  Dockerfile  Dockerfile.task  launch_awx.sh  launch_awx_task.sh  settings.py  system_uuid.txt

# pwd
/var/lib/awx/build_image

# docker-compose restart
Restarting build_image_task_1      ... done
Restarting build_image_memcached_1 ... done

Once again, thanks for reporting this bug, I shall come up with a fix along with the release version of AWX 6.1.0 since that is the latest one at the moment.

loceee commented 5 years ago

I commented out those checks for awx_instance_group_web in https://github.com/sujiar37/AWX-HA-InstanceGroup/blob/b9cf318bafeb5bb40c6c9b2638120620c88961cc/roles/rabbitmq_cluster/tasks/join_rmq_cluster.yml

That seems to have solved it. Shutting off the primary node resulted in jobs flicking the secondary. Awesome! I have bounced to 6.1.0 when building this cluster and evyethign seems to work. Having some issues changing the base URL in the settings, not sure if thats related to 6.1.0.

Thanks again -- will share this around. It;s great!

loceee commented 5 years ago

And once the cluster was working correctly the problem I had saving the base URL went away too! Awesome. Thanks!

sujiar37 commented 5 years ago

@loceee The fix has been merged into master and closing this issue as well. Once again, thank you for supporting us and please do circulate this project to the needy peoples who had wish to explore.

sujiar37 / AWX-HA-InstanceGroup

Second node not running jobs #7