Heat stuck at bastion - Githubissues

ghost commented 6 years ago

Hi Everybody,

I am trying to deploy OCP 3.5 (even 3.7) on OSP 11 from Red Hat. When I run the heat script, it does create the stack, all the necessary networks are created and creates the bastion and does the usual cloud-init provisioning steps (adding repos, updating, installation basic packages) and cloud init send the finished signal and get the HTTP 200.

After that, it get stuck at

$> openstack stack resource list -n 2 ocp2 | grep -i progress
| bastion_host                    | 98bd1fee-87c3-4360-bd4b-549e39d1345e | file:///Users/myself/projects/openshift-on-openstack/bastion.yaml                                              | CREATE_IN_PROGRESS | 2017-12-21T16:00:41Z | ocp2                                                     |
| deployment_write_templates      | c8be1435-3125-4e06-8234-b620dd556fa8 | OS::Heat::SoftwareDeployment                                                                                        | CREATE_IN_PROGRESS | 2017-12-21T16:01:12Z | ocp2-bastion_host-n4vsl5fz4maw                           |
| deployment_update_node_count    | 79327e5c-579d-4a95-a0b4-e93c52385afd | OS::Heat::SoftwareDeployment                                                                                        | CREATE_IN_PROGRESS | 2017-12-21T16:01:12Z | ocp2-bastion_host-n4vsl5fz4maw                           |
| deployment_tune_ansible         | a705f997-3cf0-44aa-90f1-af21e3a23ca1 | OS::Heat::SoftwareDeployment

If I force the signal with openstack heat resource signal ... it goes to the next step but I see that the ansible template isn't create and the usual pushed files aren't present. The /etc/os-collect-config.conf points to the good endpoint:

$> cat /etc/os-collect-config.conf
[DEFAULT]
command = os-refresh-config
collectors = ec2
collectors = cfn
collectors = local

[cfn]
metadata_url = https://10.1.3.11:13005/v1/
stack_name = ocp2-bastion_host-n4vsl5fz4maw
secret_access_key = 7e7214750d1a48c9a4cad81010fe2173
access_key_id = 494ab1ed83b441168423aec7d868267c
path = host.Metadata

$> openstack endpoint list | grep heat
| 1b24a4cf65a74e38992c4d8230a6e7da | regionOne | heat-cfn     | cloudformation | True    | internal  | http://172.17.1.16:8000/v1               |
| 2f666c5f3f25445682d8cc6ca51f9488 | regionOne | heat         | orchestration  | True    | admin     | http://172.17.1.16:8004/v1/%(tenant_id)s |
| 557a1fc9ff2549a8bc142bd305ac26bb | regionOne | heat-cfn     | cloudformation | True    | public    | https://10.1.3.11:13005/v1               |
| 622df692e35b424b93cd24f54c577df4 | regionOne | heat         | orchestration  | True    | public    | https://10.1.3.11:13004/v1/%(tenant_id)s |
| da4ed879390b4b6c9d97e114aa011f49 | regionOne | heat         | orchestration  | True    | internal  | http://172.17.1.16:8004/v1/%(tenant_id)s |
| fba19a090ed6437f86513a91e9cdc0ba | regionOne | heat-cfn     | cloudformation | True    | admin     | http://172.17.1.16:8000/v1

After few hours, it times out and the stack is failed.

Does anyone might have a clue why?

Thanks a lot for your support P.

parameters.yaml

parameters:
  ssh_key_name: myself
  bastion_image: rhel-guest-image-7.2-20160302.0.x86_64
  bastion_flavor: m1.medium
  master_image: rhel-guest-image-7.2-20160302.0.x86_64
  master_flavor: m1.medium
  infra_image: rhel-atomic-cloud-7.2-10.x86_64
  infra_flavor: m1.medium
  node_image: rhel-atomic-cloud-7.2-10.x86_64
  node_flavor: m1.medium
  loadbalancer_image: rhel-atomic-cloud-7.2-10.x86_64
  loadbalancer_flavor: m1.medium
  ocp_version: 3.5
  osp_version: 11

  external_network: internet_access
  container_subnet: 192.168.1.0/24
  loadbalancer_type: neutron

  dns_nameserver: 8.8.4.4,8.8.8.8
  node_count: 2

  rhn_username: ""
  rhn_password: "."
  rhn_pool: ""
  extra_rhn_pools: ""
  deployment_type: openshift-enterprise
  domain_name: "example.com"
  master_hostname: "openshift-master"
  node_hostname: "openshift-node"
  ssh_user: cloud-user
  master_docker_volume_size_gb: 25
  infra_docker_volume_size_gb: 25
  node_docker_volume_size_gb: 25

  system_update: false

resource_registry:
  #OOShift::LoadBalancer: ../openshift-on-openstack/loadbalancer_dedicated.yaml
  OOShift::LoadBalancer: ../openshift-on-openstack/loadbalancer_neutron.yaml
  OOShift::ContainerPort: ../openshift-on-openstack/sdn_openshift_sdn.yaml
  OOShift::IPFailover: ../openshift-on-openstack/ipfailover_keepalived.yaml
  OOShift::DockerVolume: ../openshift-on-openstack/volume_docker.yaml
  OOShift::DockerVolumeAttachment: ../openshift-on-openstack/volume_attachment_docker.yaml
  OOShift::RegistryVolume: ../openshift-on-openstack/registry_ephemeral.yaml

Doc-Savage commented 6 years ago

@pburgisser - Did you ever figure out this problem? I seem to be stuck in exactly the same place...

-Andy

daleking commented 6 years ago

I have the same issue.

What I've noticed is that the wait_handle in bastion.yaml is not set up until after the success signal is sent by fragments/bastion-boot.sh. I can see this in /var/log/containers/heat/heat-engine.log on the controller node(s). Moving the order of wait_condition resource to the top helps but I haven't worked out the exact dependencies to make it work properly yet.

Doc-Savage commented 6 years ago

@daleking - Thanks for the info. What I have done is gone over to openshift-on-openstack-123 and have made it a bunch further. Of course I had to flail about wildly. I may come back to this problem once I get over the hump.

tomassedovic commented 6 years ago

Hey folks, I'm really sorry but none of the past maintainers of this repo are able to dedicate much time to it (including myself).

The good news is that the openshift-ansible project (the main OpenShift installer -- this repo uses it under the hood, too) now includes playbooks for various cloud providers including OpenStack:

https://github.com/openshift/openshift-ansible/tree/master/playbooks/openstack

If it helps any, this is what most Red Hat engineers involved with running OpenShift on OpenStack these days are working on.

I'll update the readme to reflect this, but in the meantime, this project is not really maintained anymore.

hakanelgin commented 6 years ago

Hi Doc,

May be I can help you.

My setup is RH OCP 3.7 on OSP 12 RHEL 7.5 It looks like ready signal not back to your stack engine.

Wich VM are deployd already? Bastion, master, infra
If bastion deployd and if you can login via vip ip of OSP console than check /var/log/cloud-init-output.log, search in that file ‘part-0’ if you see e.g. part-012 it means cloud-init user-data script part-012 has some trouble and it is not executed fully, you can find those files /var/lib/cloud/instance/scripts/, those are linux commandos thus check ech commado if its exe well
Is OSP api work TLS if so do you have server certificate in your bastion host, send curl command to heat-cfn endpoint then you will know
Check if all packages are installed

daleking commented 6 years ago

OK, solved my issue - the WaitCondition signals were OK but the heat agents were not installed in my cloud image (official RedHat 7.5) so the SoftwareDeployment steps were not being run.

The following work around ensures that openstack-heat-agents is installed so that the OS::Heat::SoftwareDeployment tasks do not time out:

https://github.com/daleking/openshift-on-openstack/commit/475e997628fe8af047ddda1fb57e051f747099a1

redhat-openstack / openshift-on-openstack

Heat stuck at bastion #397