Building and using your own golden image undocumented

cben commented 6 years ago

Hi. We (cc @nimrodshn) are trying out cluster-operator according to README, in fake=false mode. We got MachineSet and Machine objects being created but we don't get any AWS instances. Machine status remains at:

  status:
    lastUpdated: null
    providerStatus: null

Looking at pod logs it seems AWS credentials didn't make it into openshift-ansible:

```TASK [openshift_aws : fetch master instances] **********************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_aws/tasks/setup_master_group.yml:10
Wednesday 11 July 2018  07:50:04 +0000 (0:00:00.033)       0:00:03.289 ******** 
Using module file /usr/lib/python2.7/site-packages/ansible/modules/cloud/amazon/ec2_instance_facts.py
<127.0.0.1> ESTABLISH LOCAL CONNECTION FOR USER: default
<127.0.0.1> EXEC /bin/sh -c '/usr/bin/python2 && sleep 0'
FAILED - RETRYING: fetch master instances (20 retries left).Result was: {
    "attempts": 1, 
    "changed": false, 
    "instances": [], 
    "invocation": {
        "module_args": {
            "aws_access_key": null, 
            "aws_secret_key": null, 
            "ec2_url": null, 
            "filters": {
                "instance-state-name": "running", 
                "tag:clusterid": "nshneor-gfv8m", 
                "tag:host-type": "master"
            }, 
            "instance_ids": [], 
            "profile": null, 
            "region": "us-east-1", 
            "security_token": null, 
            "validate_certs": true
        }
    }, 
    "retries": 21
}

The pod has AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY set:

nshneor@dhcp-2-169 ~/workspace/go/src/github.com/openshift/cluster-operator (master) $ oc describe pods master-nshneor-gfv8m-nqts5-gcnk8 
Name:           master-nshneor-gfv8m-nqts5-gcnk8
Namespace:      myproject
Node:           localhost/10.35.2.169
Start Time:     Wed, 11 Jul 2018 10:46:12 +0300
Labels:         controller-uid=798a762e-84de-11e8-a192-28d2448581b1
                job-name=master-nshneor-gfv8m-nqts5
Annotations:    openshift.io/scc=restricted
Status:         Running
IP:             172.17.0.4
Controlled By:  Job/master-nshneor-gfv8m-nqts5
Containers:
  install-masters:
    Container ID:   docker://31a09cd730e09b7e739654cc0fdc497a2d2e569f1142ceba566a38599b993e99
    Image:          cluster-operator-ansible:canary
    Image ID:       docker://sha256:2f0c518288260d1f0026dcc12129fa359b4909c4fbdaab83680d7e62fe295e25
    Port:           <none>
    State:          Running
      Started:      Wed, 11 Jul 2018 10:49:59 +0300
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Wed, 11 Jul 2018 10:48:02 +0300
      Finished:     Wed, 11 Jul 2018 10:49:45 +0300
    Ready:          True
    Restart Count:  2
    Environment:
      INVENTORY_FILE:             /ansible/inventory/hosts
      ANSIBLE_HOST_KEY_CHECKING:  False
      OPTS:                       -vvv --private-key=/ansible/ssh/privatekey.pem -e @/ansible/inventory/vars
      AWS_ACCESS_KEY_ID:          <set to the key 'awsAccessKeyId' in secret 'nshneor-aws-creds'>      Optional: false
      AWS_SECRET_ACCESS_KEY:      <set to the key 'awsSecretAccessKey' in secret 'nshneor-aws-creds'>  Optional: false
      PLAYBOOK_FILE:              /usr/share/ansible/openshift-ansible/playbooks/cluster-operator/aws/install_masters.yml
    Mounts:
      /ansible/inventory/ from inventory (rw)
      /ansible/ssh/ from sshkey (rw)
      /ansible/ssl/ from sslkey (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from cluster-installer-token-fvrqc (ro)
Conditions:
  Type           Status
  Initialized    True 
  Ready          True 
  PodScheduled   True 
Volumes:
  inventory:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      master-nshneor-gfv8m-nqts5
    Optional:  false
  sshkey:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nshneor-ssh-key
    Optional:    false
  sslkey:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  nshneor-certs
    Optional:    false
  cluster-installer-token-fvrqc:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  cluster-installer-token-fvrqc
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     <none>
Events:
  Type     Reason                 Age               From                Message
  ----     ------                 ----              ----                -------
  Normal   Scheduled              4m                default-scheduler   Successfully assigned master-nshneor-gfv8m-nqts5-gcnk8 to localhost
  Normal   SuccessfulMountVolume  4m                kubelet, localhost  MountVolume.SetUp succeeded for volume "inventory"
  Normal   SuccessfulMountVolume  4m                kubelet, localhost  MountVolume.SetUp succeeded for volume "sshkey"
  Normal   SuccessfulMountVolume  4m                kubelet, localhost  MountVolume.SetUp succeeded for volume "sslkey"
  Normal   SuccessfulMountVolume  4m                kubelet, localhost  MountVolume.SetUp succeeded for volume "cluster-installer-token-fvrqc"
  Warning  BackOff                36s               kubelet, localhost  Back-off restarting failed container
  Normal   Pulled                 24s (x3 over 4m)  kubelet, localhost  Container image "cluster-operator-ansible:canary" already present on machine
  Normal   Created                23s (x3 over 4m)  kubelet, localhost  Created container

The secrets do exist:

Name:         nshneor-aws-creds
Namespace:    myproject
Labels:       <none>
Annotations:  
Type:         Opaque

Data
====
awsAccessKeyId:      20 bytes
awsSecretAccessKey:  40 bytes

Name:         nshneor-ssh-key
Namespace:    myproject
Labels:       app=cluster-operator
Annotations:  
Type:         Opaque

Data
====
ssh-privatekey:  1674 bytes

How can we troubleshoot it further?

dgoodwin commented 6 years ago

Can you check out the logs for your aws-machine-controller pod in openshift-cluter-operator namespace, on the "root" cluster-operator cluster. (this is where the masters are currently created)

nimrodshn commented 6 years ago

@dgoodwin @cben

nshneor@dhcp-2-169 ~/workspace/go/src/github.com/openshift/cluster-operator (master) $ oc project openshift-cluster-operator 
Now using project "openshift-cluster-operator" on server "https://127.0.0.1:8443".
nshneor@dhcp-2-169 ~/workspace/go/src/github.com/openshift/cluster-operator (master) $ oc get pods
NAME                                              READY     STATUS    RESTARTS   AGE
aws-machine-controller-1-gjv4b                    1/1       Running   0          26m
cluster-api-controller-manager-7dddc65c96-4z7px   1/1       Running   0          30m
cluster-operator-apiserver-1-6pcgz                2/2       Running   0          26m
cluster-operator-controller-manager-1-tz9cb       1/1       Running   0          26m
playbook-mock-6bf5c6f9d6-mnhms                    1/1       Running   0          26m
nshneor@dhcp-2-169 ~/workspace/go/src/github.com/openshift/cluster-operator (master) $ oc log aws-machine-controller-1-gjv4b
W0711 13:48:13.786632   14837 cmd.go:358] log is DEPRECATED and will be removed in a future version. Use logs instead.
ERROR: logging before flag.Parse: W0711 10:21:31.525882       1 controller.go:64] environment variable NODE_NAME is not set, this controller will not protect against deleting its own machine
ERROR: logging before flag.Parse: E0711 10:21:31.830068       1 reflector.go:205] github.com/openshift/cluster-operator/vendor/sigs.k8s.io/cluster-api/pkg/controller/sharedinformers/zz_generated.api.register.go:57: Failed to list *v1alpha1.MachineSet: the server could not find the requested resource (get machinesets.cluster.k8s.io)
ERROR: logging before flag.Parse: E0711 10:21:31.830457       1 reflector.go:205] github.com/openshift/cluster-operator/vendor/sigs.k8s.io/cluster-api/pkg/controller/sharedinformers/zz_generated.api.register.go:56: Failed to list *v1alpha1.MachineDeployment: the server could not find the requested resource (get machinedeployments.cluster.k8s.io)
ERROR: logging before flag.Parse: E0711 10:21:31.831315       1 reflector.go:205] github.com/openshift/cluster-operator/vendor/sigs.k8s.io/cluster-api/pkg/controller/sharedinformers/zz_generated.api.register.go:55: Failed to list *v1alpha1.Machine: the server could not find the requested resource (get machines.cluster.k8s.io)
...
...
ERROR: logging before flag.Parse: I0711 10:39:20.655436       1 controller.go:91] Running reconcile Machine for nshneor-zxq8g-master-b8b5n
time="2018-07-11T10:39:20Z" level=debug msg="checking if machine exists" controller=awsMachine machine=myproject/nshneor-zxq8g-master-b8b5n
time="2018-07-11T10:39:27Z" level=debug msg="instance does not exist" controller=awsMachine machine=myproject/nshneor-zxq8g-master-b8b5n
ERROR: logging before flag.Parse: I0711 10:39:27.125829       1 controller.go:134] reconciling machine object nshneor-zxq8g-master-b8b5n triggers idempotent create.
time="2018-07-11T10:39:27Z" level=info msg="creating machine" controller=awsMachine machine=myproject/nshneor-zxq8g-master-b8b5n
time="2018-07-11T10:39:27Z" level=debug msg="Obtaining EC2 client for region \"us-east-1\"" controller=awsMachine machine=myproject/nshneor-zxq8g-master-b8b5n
time="2018-07-11T10:39:27Z" level=debug msg="Describing AMI ami-0dd8ad483cef75c18" controller=awsMachine machine=myproject/nshneor-zxq8g-master-b8b5n
time="2018-07-11T10:39:27Z" level=error msg="error creating machine: Unexpected number of images returned: 0" controller=awsMachine machine=myproject/nshneor-zxq8g-master-b8b5n

cben commented 6 years ago

fuller log: https://gist.github.com/cben/c2a3a7d6e364d010e2b9f8825bb75087

dgoodwin commented 6 years ago

This indicates it cannot find the AMI configured in your cluster version. If you're just using our direct development playbooks, you'll see we loaded up a cluster version pointing to an AMI that is only available in our rh-dev account. I'm wondering if you guys are using a different AWS account?

If so you need to create your own cluster version, see oc get clusterversions -o yaml for what they look like, or this link for what we create them from: https://github.com/openshift/cluster-operator/blob/master/contrib/examples/cluster-versions-template.yaml

I don't think the AWS key is your issue at this point, pretty sure it's something else.

cben commented 6 years ago

yep we're using different AWS account. Thanks, will look into it!

dgoodwin commented 6 years ago

You kind of need a "golden" image, we've been building our own (@abutcher has) for development which is what you see in our clusterversion by default.

I just cc'd you on an email, trying to track down where or what you could use on another account.

abutcher commented 6 years ago

I've been building our AMIs with https://github.com/openshift/openshift-ansible/blob/master/playbooks/aws/openshift-cluster/build_ami.yml.

cben commented 6 years ago

@abutcher Thanks! The READMEs under https://github.com/openshift/openshift-ansible/tree/master/playbooks/aws are pretty great, except for one Catch 22: "A base AMI is required for AMI building. Please ensure `openshift_aws_base_ami` is defined." I can't find any explanation in openshift-ansible repo what is expected from a "base AMI" and where do I find one :confused: (I'm really new to AWS and have zero idea what I'm doing :smile:)

Gonna try a CentOS image...

cben commented 6 years ago

Tried a centos 7 AMI from centos wiki as base_ami, got: Instance creation failed => AuthFailure: Not authorized for images: [ami-4bf3d731]

abutcher commented 6 years ago

My base ami is openshift_aws_base_ami: ami-b81dbfc5 which is a Centos 7 AMI on the marketplace.

cben commented 6 years ago

Thanks!! I also had to click Subscribe and accept terms on Marketplace and then I was able to use it.

[x] contacting centos list to suggest improvements https://wiki.centos.org/Cloud/AWS
[ ] I'll send an openshift-ansible docs PR about "base image".
[ ] Setting openshift_aws_build_ami_ssh_user: centos led to permission errors ("Destination /etc/pki/rpm-gpg not writable"...). Worked with become, become_user, and become_method. Checking whether I need to modify the playbooks or just document existing options... (g_sudo looks promising?)
3.10 doesn't build yet with CentOS because of missing origin-docker-excluder-3.10 package: https://github.com/openshift/openshift-ansible/issues/7794. Checking workarounds suggested there...
[ ] Document in this repo that deploy-devel-playbook.yml uses contrib/examples/cluster-versions-template.yaml defaulting to non-public AMI. Link to how build your own image. Document or add param to choose your own AMI.

[STATUS: I'm taking a break from this, but I still intend to get the whole process tested & documented eventually]

openshift / cluster-operator

Building and using your own golden image undocumented #294