openshift / installer

Install an OpenShift 4.x cluster
https://try.openshift.com
Apache License 2.0
1.42k stars 1.38k forks source link

Azure OpenShift Installer issue #2334

Closed harshjay04 closed 4 years ago

harshjay04 commented 5 years ago

Version

4.1

$ openshift-install version
<your output here>

Platform:

azure

What happened?

OpenShift Installer is installing only the master nodes and no worker nodes are getting deployed, Terraform scripts doesn't show the worker node config for deployment I have downloaded the Sep 2 version that was available in http://mirror.openshift.com/pub/openshift-v4/clients/ocp-dev-preview/latest/   openshift-install-linux-4.2.0-0.nightly-2019-09-05-234433.tar.gz 02-Sep-2019 11:01 68M  

See the troubleshooting documentation for ideas about what information to collect. For example, if the installer fails to create resources, attach the relevant portions of your .openshift_install.log.

What you expected to happen?

Expected to have a Dev Preview OpenShift 4.1 cluster installed and running in Azure as per documented steps.

How to reproduce it (as minimally and precisely as possible)?

Just follow the steps in https://cloud.redhat.com/openshift/install/azure/installer-provisioned to reproduce openshift_install.log

Please list the full steps required to reproduce the issue. -->

$ your-commands-here

Anything else we need to know?

Enter text here.

References

abhinavdahiya commented 5 years ago

The installer doesn't create the worker directly, but the cluster operator machine-api-operator using the cluster-api creates the workers.

try to see the logs for the controller in openshift-machine-api and the machine objects.

also make sure the service principal is correctly setup https://github.com/openshift/installer/blob/master/docs/user/azure/credentials.md#step-2-request-permissions-for-the-service-principal-from-tenant-administrator

harshjay04 commented 5 years ago

This is what we found as part of the troubleshooting, also we have the service principal setup as per the documentation, for our understanding if service principal is not set properly the master nodes itself won't get deployed correctly, I may be wrong, need help to get this through

./oc logs ingress-operator-58478cc77f-ckss5 -n openshift-ingress-operator 2019-09-09T18:37:05.575Z INFO operator log/log.go:26 started zapr logger 2019-09-09T18:37:07.497Z INFO operator.entrypoint ingress-operator/main.go:62 using operator namespace {"namespace": "openshift-ingress-operator"} 2019-09-09T18:37:07.514Z ERROR operator.entrypoint ingress-operator/main.go:105 failed to create DNS manager {"error": "failed to get cloud credentials from secret /: secrets \"cloud-credentials\" not found"}

abhinavdahiya commented 5 years ago

This is what we found as part of the troubleshooting, also we have the service principal setup as per the documentation, for our understanding if service principal is not set properly the master nodes itself won't get deployed correctly,

The master nodes will be created even if you haven't done step 2 from https://github.com/openshift/installer/blob/master/docs/user/azure/credentials.md#step-2-request-permissions-for-the-service-principal-from-tenant-administrator

If you read that section, it is required so that the operators can be provided new, tightly scope credentials to contact the Azure APIs...

These new creds are minted by cloud-credential-operator, check out it's logs.. oc logs -n openshift-cloud-credential-operator deploy/cloud-credential-operator

I may be wrong, need help to get this through

./oc logs ingress-operator-58478cc77f-ckss5 -n openshift-ingress-operator 2019-09-09T18:37:05.575Z INFO operator log/log.go:26 started zapr logger 2019-09-09T18:37:07.497Z INFO operator.entrypoint ingress-operator/main.go:62 using operator namespace {"namespace": "openshift-ingress-operator"} 2019-09-09T18:37:07.514Z ERROR operator.entrypoint ingress-operator/main.go:105 failed to create DNS manager {"error": "failed to get cloud credentials from secret /: secrets "cloud-credentials" not found"}

harshjay04 commented 5 years ago

I have the permission enabled through Azure console as per the documentation for my Service Principle Osp41 since when i ran the cli command it gives out this output, i have also attached the screenshot from azure portal

[oseadmin@osejumpserver /]$ az ad app permission add --id 911f3d60-e8e8-4881-a433-329949270436 --api 00000002-0000-0000-c000-000000000000 --api-permissions 824c81eb-e3f8-4ee6-8f6d-de7f50d565b7=Role Invoking "az ad app permission grant --id 911f3d60-e8e8-4881-a433-329949270436 --api 00000002-0000-0000-c000-000000000000" is needed to make the change effective [oseadmin@osejumpserver /]$ az ad app permission grant --id 911f3d60-e8e8-4881-a433-329949270436 --api 00000002-0000-0000-c000-000000000000 { "clientId": "840bfed3-acb1-42f8-8ae9-5665b5640281", "consentType": "AllPrincipals", "expiryTime": "2020-09-09T19:34:41.997267", "objectId": "0_4LhLGs-EKK6VZltWQCgYXZsX09AdJFjcopS24DevE", "odata.metadata": "https://graph.windows.net/72e0e644-3484-447a-9c89-7530f692cf5f/$metadata#oauth2PermissionGrants/@Element", "odatatype": null, "principalId": null, "resourceId": "7db1d985-013d-45d2-8dca-294b6e037af1", "scope": "user_impersonation", "startTime": "2019-09-09T19:34:41.997267" } [oseadmin@osejumpserver /]$ az ad app permission add --id 911f3d60-e8e8-4881-a433-329949270436 --api 00000002-0000-0000-c000-000000000000 --api-permissions 824c81eb-e3f8-4ee6-8f6d-de7f50d565b7=Role Invoking "az ad app permission grant --id 911f3d60-e8e8-4881-a433-329949270436 --api 00000002-0000-0000-c000-000000000000" is needed to make the change effective [oseadmin@osejumpserver /]$ az ad app permission add --id 911f3d60-e8e8-4881-a433-329949270436 --api 00000002-0000-0000-c000-000000000000 --api-permissions 824c81eb-e3f8-4ee6-8f6d-de7f50d565b7=Role Invoking "az ad app permission grant --id 911f3d60-e8e8-4881-a433-329949270436 --api 00000002-0000-0000-c000-000000000000" is needed to make the change effective [oseadmin@osejumpserver /]$

Screen Shot 2019-09-09 at 12 40 52 PM
abhinavdahiya commented 5 years ago

The API permissions for Active Directory look enough..

do you see errors in credential-operator oc logs -n openshift-cloud-credential-operator deploy/cloud-credential-operator ??

nichochen commented 5 years ago

I experienced the same error for ingress operator.

$ ./oc --config ./ocp42/auth/kubeconfig get pod --all-namespaces|grep -i off
openshift-ingress-operator                              ingress-operator-7c7cb5dfdc-n7m26                                 0/1     CrashLoopBackOff   13         45m

$ ./oc --config ./ocp42/auth/kubeconfig  logs -n openshift-ingress-operator  ingress-operator-7c7cb5dfdc-n7m26
2019-09-13T17:06:46.824Z        INFO    operator        log/log.go:26   started zapr logger
2019-09-13T17:06:48.756Z        INFO    operator.entrypoint     ingress-operator/main.go:62     using operator namespace        {"namespace": "openshift-ingress-operator"}
2019-09-13T17:06:48.787Z        ERROR   operator.entrypoint     ingress-operator/main.go:105    failed to create DNS manager    {"error": "failed to get cloud credentials from secret /: secrets \"cloud-credentials\" not found"}

Installation error.

INFO Waiting up to 30m0s for the cluster at https://api.ocp4.az-devops.org:6443 to initialize...
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring

Also no worker node was created.

$ ./oc --config ./ocp42/auth/kubeconfig get nodes
NAME                  STATUS   ROLES    AGE   VERSION
ocp4-6mdvt-master-0   Ready    master   56m   v1.14.6+5a523078f
ocp4-6mdvt-master-1   Ready    master   56m   v1.14.6+5a523078f
ocp4-6mdvt-master-2   Ready    master   56m   v1.14.6+5a523078f

Pods.

openshift-kube-scheduler                                openshift-kube-scheduler-ocp4-6mdvt-master-1                      1/1     Running            0          52m
openshift-kube-scheduler                                openshift-kube-scheduler-ocp4-6mdvt-master-2                      1/1     Running            0          55m
openshift-kube-scheduler                                revision-pruner-2-ocp4-6mdvt-master-0                             0/1     Completed          0          55m
openshift-kube-scheduler                                revision-pruner-4-ocp4-6mdvt-master-0                             0/1     Completed          0          50m
openshift-kube-scheduler                                revision-pruner-4-ocp4-6mdvt-master-1                             0/1     OOMKilled          0          51m
openshift-kube-scheduler                                revision-pruner-4-ocp4-6mdvt-master-2                             0/1     Completed          0          52m
openshift-machine-api                                   cluster-autoscaler-operator-85c88fcbdf-7z9jk                      1/1     Running            0          50m
openshift-machine-api                                   machine-api-controllers-6684f88794-j5kz8                          3/3     Running            0          56m
openshift-machine-api                                   machine-api-operator-598fd56f46-q5jdt                             1/1     Running            0          57m
openshift-machine-config-operator                       etcd-quorum-guard-5f8c9b48f8-62zlw                                1/1     Running            0          55m
openshift-machine-config-operator                       etcd-quorum-guard-5f8c9b48f8-f4vc7                                1/1     Running            0          55m
openshift-machine-config-operator                       etcd-quorum-guard-5f8c9b48f8-n2728                                1/1     Running            0          55m
openshift-machine-config-operator                       machine-config-controller-5ddc9cf57-sc29c                         1/1     Running            0          56m
openshift-machine-config-operator                       machine-config-daemon-h7nxr                                       1/1     Running            0          56m
openshift-machine-config-operator                       machine-config-daemon-nqkjv                                       1/1     Running            0          56m
openshift-machine-config-operator                       machine-config-daemon-v9wnc                                       1/1     Running            0          56m
openshift-machine-config-operator                       machine-config-operator-6f9775d7c6-7t2v9                          1/1     Running            0          57m
openshift-machine-config-operator                       machine-config-server-7jtsc                                       1/1     Running            0          56m
openshift-machine-config-operator                       machine-config-server-hnkmr                                       1/1     Running            0          56m
openshift-machine-config-operator                       machine-config-server-tgj58                                       1/1     Running            0          56m
openshift-marketplace                                   certified-operators-6757fc8c95-rd2n2                              0/1     Pending            0          50m
openshift-marketplace                                   community-operators-764fddfcd7-ptsbl                              0/1     Pending            0          50m
openshift-marketplace                                   marketplace-operator-777cb7fd85-2gcp9                             1/1     Running            0          51m
openshift-marketplace                                   redhat-operators-74865497dc-mvwn4                                 0/1     Pending            0          51m
openshift-monitoring                                    cluster-monitoring-operator-655f555fdc-7nt6p                      1/1     Running            0          51m
openshift-monitoring                                    kube-state-metrics-57d8c7766b-pww8g                               0/3     Pending            0          50m
openshift-monitoring                                    node-exporter-6wlz5                                               2/2     Running            0          51m
openshift-monitoring                                    node-exporter-v826b                                               2/2     Running            0          50m
openshift-monitoring                                    node-exporter-zc4kr                                               2/2     Running            0          50m
openshift-monitoring                                    openshift-state-metrics-84c7f8c5d8-5v4p5                          0/3     Pending            0          51m
openshift-monitoring                                    prometheus-adapter-749cdcf9b5-hfq46                               0/1     Pending            0          45m
openshift-monitoring                                    prometheus-adapter-749cdcf9b5-sfh6q                               0/1     Pending            0          45m
openshift-monitoring                                    prometheus-operator-696c9ddfb4-vq6xv                              1/1     Running            0          50m
openshift-monitoring                                    telemeter-client-748475b66-wbnrh                                  0/3     Pending            0          45m
openshift-monitoring                                    telemeter-client-8f8bdcd7c-f8rd6                                  0/3     Pending            0          50m
openshift-multus                                        multus-2s5g9                                                      1/1     Running            0          57m
openshift-multus                                        multus-admission-controller-5842x                                 1/1     Running            0          57m
openshift-multus                                        multus-admission-controller-5dgzh                                 1/1     Running            0          57m
openshift-multus                                        multus-admission-controller-grmvr                                 1/1     Running            0          57m
openshift-multus                                        multus-ll9tb                                                      1/1     Running            0          57m
openshift-multus                                        multus-mprf7                                                      1/1     Running            0          57m
openshift-network-operator                              network-operator-8d9d7ddc5-thh24                                  1/1     Running            0          57m
openshift-operator-lifecycle-manager                    catalog-operator-5697fc6c88-tqgfc                                 1/1     Running            0          57m
openshift-operator-lifecycle-manager                    olm-operator-df6fddccd-hhdmb                                      1/1     Running            0          57m
openshift-operator-lifecycle-manager                    packageserver-8b695d794-65qpb                                     1/1     Running            0          55m
openshift-operator-lifecycle-manager                    packageserver-8b695d794-tlvsf                                     1/1     Running            0          55m
openshift-sdn                                           ovs-2dd9h                                                         1/1     Running            0          57m
openshift-sdn                                           ovs-6v9q6                                                         1/1     Running            0          57m
openshift-sdn                                           ovs-hbghh                                                         1/1     Running            0          57m
openshift-sdn                                           sdn-5mlt6                                                         1/1     Running            1          57m
openshift-sdn                                           sdn-controller-887b4                                              1/1     Running            0          57m
openshift-sdn                                           sdn-controller-qm2jf                                              1/1     Running            0          57m
openshift-sdn                                           sdn-controller-srvc7                                              1/1     Running            0          57m
openshift-sdn                                           sdn-g6f9t                                                         1/1     Running            0          57m
openshift-sdn                                           sdn-qdx4c                                                         1/1     Running            1          57m
openshift-service-ca-operator                           service-ca-operator-7b4f5bf9f4-xfmhz                              1/1     Running            0          57m
openshift-service-ca                                    apiservice-cabundle-injector-5b848f5bc8-rg9kc                     1/1     Running            0          56m
openshift-service-ca                                    configmap-cabundle-injector-84bf66575b-kncg4                      1/1     Running            0          56m
openshift-service-ca                                    service-serving-cert-signer-5575b77cc4-89rdz                      1/1     Running            0          56m
openshift-service-catalog-apiserver-operator            openshift-service-catalog-apiserver-operator-675c4ccf8b-xh2cz     1/1     Running            0          52m
openshift-service-catalog-controller-manager-operator   openshift-service-catalog-controller-manager-operator-59d79fdmx   1/1     Running            0          52m
nichochen commented 5 years ago

@abhinavdahiya Do you have any ideas on this issue? There are couple people are having the same issue for a while.

abhinavdahiya commented 5 years ago

can you make sure the appID for which you have requested and received the Admin consent matches the one in ~/.azure/osServicePrincipal.json and the secret in the cluster oc get secret -n kube-system azure-credentials -oyaml

We have a bug report where the user followed the docs and created a new service principal overriding the credentials in the default location, which sadly wouldn't have the permissions from https://github.com/openshift/installer/blob/master/docs/user/azure/credentials.md#step-2-request-permissions-for-the-service-principal-from-tenant-administrator

see https://github.com/openshift/installer/pull/2388

nichochen commented 5 years ago

I tried again today with the latest nightly build of the installer, got the same error again. The SP is correct for both ~/.azure/osServicePrincipal.json and azure-credentials.

$ oc --config /home/nico/ocp0921/test1/auth/kubeconfig log -n openshift-ingress-operator                              ingress-operator-68d49d478d-zqjfx
2019-09-21T10:44:54.036Z        INFO    operator        log/log.go:26   started zapr logger
2019-09-21T10:44:55.956Z        INFO    operator.entrypoint     ingress-operator/main.go:62     using operator namespace        {"namespace": "openshift-ingress-operator"}
2019-09-21T10:44:55.972Z        ERROR   operator.entrypoint     ingress-operator/main.go:105    failed to create DNS manager    {"error": "failed to get cloud credentials from secret /: secrets \"cloud-credentials\" not found"}
abhinavdahiya commented 5 years ago

I tried again today with the latest nightly build of the installer, got the same error again. The SP is correct for both ~/.azure/osServicePrincipal.json and azure-credentials.

can you make sure the appID for which you have requested and received the Admin consent matches the one in ~/.azure/osServicePrincipal.json and the secret in the cluster oc get secret -n kube-system azure-credentials -oyaml

I'm not sure this is what you meant, but just to be sure the appID for the service principal which has the OwnedBy permission, matches the clientID in the azure-credentials secret, the ~/.azure/osServicePrincipal.json and azure-credentials. will always tend to match.

nichochen commented 5 years ago

@abhinavdahiya yes, they matched and have the permission. The result was the installation failed.

joaotomazio commented 4 years ago

I am experiencing a similar issue on the openshift-install 4.2 on Ubuntu 18.04.

$ openshift-install version
./openshift-install v4.2.0
built from commit f96afb99f1ce4f8976ce62f7df44acb24d2062d6
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:b3ba58c53a3f5e98f53dff425e7e4c87b60f5d49d66213853b79f00f7a8a9448

Following the documentation, an initialization of the cluster was attempted with:

./openshift-install create cluster --dir OCP4 --log-level debug

However, the installation times out when initializing the cluster, on the "Waiting up to 30m0s for the cluster" phase.

INFO Waiting up to 30m0s for the cluster at https://api.openshift4.oc-demo.ml:6443 to initialize... 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 99% complete 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 99% complete, waiting on authentication, console, image-registry, ingress, monitoring 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 99% complete 
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 99% complete 
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 99% complete 
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring 
DEBUG Still waiting for the cluster to initialize: Working towards 4.2.0-0.nightly-2019-09-23-154647: 99% complete 
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring 
FATAL failed to initialize the cluster: Some cluster operators are still updating: authentication, console, image-registry, ingress, monitoring

It looks like initializations/updates of some operators are on endless loops, timing out the process. While the installation was stalled, the following output was produced on a parallel terminal:

$ oc --config ./OCP4/auth/kubeconfig get nodes
NAME                        STATUS   ROLES    AGE   VERSION
openshift4-6rxh9-master-0   Ready    master   16m   v1.14.6+c4799753c
openshift4-6rxh9-master-1   Ready    master   16m   v1.14.6+c4799753c
openshift4-6rxh9-master-2   Ready    master   16m   v1.14.6+c4799753c
$ oc --config ./OCP4/auth/kubeconfig get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                 Unknown     Unknown       True       15m
cloud-credential                           4.2.0-0.nightly-2019-09-23-154647   True        True          True       18m
cluster-autoscaler                         4.2.0-0.nightly-2019-09-23-154647   True        False         False      12m
console                                    4.2.0-0.nightly-2019-09-23-154647   Unknown     True          False      14m
dns                                        4.2.0-0.nightly-2019-09-23-154647   True        False         False      18m
image-registry                                                                 False       True          False      14m
insights                                   4.2.0-0.nightly-2019-09-23-154647   True        False         False      18m
kube-apiserver                             4.2.0-0.nightly-2019-09-23-154647   True        False         False      17m
kube-controller-manager                    4.2.0-0.nightly-2019-09-23-154647   True        False         False      15m
kube-scheduler                             4.2.0-0.nightly-2019-09-23-154647   True        False         False      16m
machine-api                                4.2.0-0.nightly-2019-09-23-154647   True        False         False      18m
machine-config                             4.2.0-0.nightly-2019-09-23-154647   True        False         False      17m
marketplace                                4.2.0-0.nightly-2019-09-23-154647   True        False         False      13m
monitoring                                                                     False       True          True       8m55s
network                                    4.2.0-0.nightly-2019-09-23-154647   True        False         False      17m
node-tuning                                4.2.0-0.nightly-2019-09-23-154647   True        False         False      15m
openshift-apiserver                        4.2.0-0.nightly-2019-09-23-154647   True        False         False      14m
openshift-controller-manager               4.2.0-0.nightly-2019-09-23-154647   True        False         False      16m
openshift-samples                          4.2.0-0.nightly-2019-09-23-154647   True        False         False      11m
operator-lifecycle-manager                 4.2.0-0.nightly-2019-09-23-154647   True        False         False      17m
operator-lifecycle-manager-catalog         4.2.0-0.nightly-2019-09-23-154647   True        False         False      17m
operator-lifecycle-manager-packageserver   4.2.0-0.nightly-2019-09-23-154647   True        False         False      16m
service-ca                                 4.2.0-0.nightly-2019-09-23-154647   True        False         False      18m
service-catalog-apiserver                  4.2.0-0.nightly-2019-09-23-154647   True        False         False      15m
service-catalog-controller-manager         4.2.0-0.nightly-2019-09-23-154647   True        False         False      15m
storage                                    4.2.0-0.nightly-2019-09-23-154647   True        False         False      13m

This installation was attempted on several Azure Regions, with every one timming out on the "Waiting up to 30m0s for the cluster" phase.

During the installation, the Master Nodes were successfully instanced and the Bootstrap Node was destroyed, but the Worker Nodes never appeared on the Resource Group (checking on Azure Portal).

nichochen commented 4 years ago

@abhinavdahiya Please kindly suggest how should we move forward on this, or we need to escalate this to someone in Red Hat engineering to help out? This is pending for more than 2 weeks. 4.2 is going to be GA soon, if this is a real issue, it's going to affect many OpenShift 4 users on Azure. If this is not an issue, please kindly suggest how we should get around this.

nichochen commented 4 years ago

The installer is critical for OpenShift 4 on Azure experience. Escalating for more visibility @smarterclayton. We are also escalating this via internal Red Hat contact point.

wking commented 4 years ago

@joaotomazio:

NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
...
cloud-credential                           4.2.0-0.nightly-2019-09-23-154647   True        True          True       18m

To understand why an operator like this is degraded, fetch its ClusterOperator (as suggested in our troubleshooting docs:

$ oc --config=${INSTALL_DIR}/auth/kubeconfig get -o yaml clusteroperator cloud-credential

which will give you the cred operator's description for why it is degraded. You can also gather the cred-operator logs (as Abhinav suggested above).

nstielau commented 4 years ago

Thanks @wking . @joaotomazio can you get us those logs?

wking commented 4 years ago

We have a bug report...

Linking https://bugzilla.redhat.com/show_bug.cgi?id=1753419#c5 , since I think that's what @abhinavdahiya was referencing.

nstielau commented 4 years ago

@nichochen @joaotomazio we believe this is a credentials issue due to inaccurate/unhelpful docs.
Here is the Github PR where the docs were updated two days ago: https://github.com/openshift/installer/pull/2388

Can you run through that new permissions flow and see if you can get the cluster up?

Thanks

joaotomazio commented 4 years ago

I've managed to finish the installation successfully by granting a certain permission to the Azure App.

On the portal: Azure AD -> App Registration -> App -> API Permissions -> Delegated Permission on user_impersonation.

I don't know if this is among the best practices, but for demo purposes, it now works just fine!!! Thank you everyone

nichochen commented 4 years ago

@nstielau Thanks for the information. @abhinavdahiya mentioned this bug last Saturday, and I have verified that the sp I used in my installation was the one with the admin consent, last Saturday, see here.

@joaotomazio Awesome!

@abhinavdahiya Could you kindly confirm that the permission user_impersonation is required? Appreciate for your clarification and help.

abhinavdahiya commented 4 years ago

@abhinavdahiya Could you kindly confirm that the permission user_impersonation is required? Appreciate for your clarification and help.

cc @dgoodwin @joelddiaz @ingvagabund who are owners of the Azure credential-operator for Azure... hopefully they can shed more light.

joelddiaz commented 4 years ago

fwiw, here's the permissions i've had for the Azure clusters i've installed in the past:

image

dgoodwin commented 4 years ago

@joelddiaz has provided the permissions we were successful with. I notice that Application.ReadWriteAll is not present under Azure Active Directory Graph in the screenshot in https://github.com/openshift/installer/issues/2334#issuecomment-529634802, but it does appear under Microsoft Graph. I do not know how to interpret this but it doesn't look correct, and the Azure UI is still showing me the version Joel sees with Azure Active Directory Graph -> Application.ReadWriteAll.

We dealt with permissions in the UI, it looks like someone boiled these down to az commands but I'm wondering if a mistake was made and the command is somehow granting the wrong Application.ReadWrite.All permission? (Microsoft Graph, instead of Azure Active Directory Graph)

dgoodwin commented 4 years ago

According to https://blogs.msdn.microsoft.com/aaddevsup/2018/06/06/guid-table-for-windows-azure-active-directory-permissions/

It looks like the API ID that was missing is: 1cda74f2-2616-4834-b122-5cb1b07f8a59 Read and write all applications

This does not appear in our docs.

dgoodwin commented 4 years ago

@joelddiaz clarified on scrum call this morning.

Microsoft Graph -> Application.ReadWrite.All is the new style permission. Azure Active Directory Graph -> Application.ReadWrite.All is the legacy one. However the gosdk used in the cred operator requires the legacy permission.

It would appear the translation to CLI commands to request the permissions got the wrong UUID for the API. This explains why adding the impersonate permission got past as this is likely a part of the legacy ReadWrite.All.

nstielau commented 4 years ago

@dgoodwin I'm not clear on the next step. Do we need to update our docs again?

joelddiaz commented 4 years ago

I went through the instructions at https://github.com/openshift/installer/blob/master/docs/user/azure/credentials.md , and I was able to deploy a cluster out to Azure.

The extra creds that @dgoodwin mentions above would bring the list of permissions up to the level of what he and I have both been using previously, but those extra permissions appear to be unnecessary.

These perms: image

plus adding the Application Registration as a Contributor and User Access Administrator into the Subscription being installed into are enough permissions to get a cluster up and running.

TL;DR: the docs appear to be okay.

harshjay04 commented 4 years ago

The installer doesn't create the worker directly, but the cluster operator machine-api-operator using the cluster-api creates the workers.

try to see the logs for the controller in openshift-machine-api and the machine objects.

also make sure the service principal is correctly setup https://github.com/openshift/installer/blob/master/docs/user/azure/credentials.md#step-2-request-permissions-for-the-service-principal-from-tenant-administrator

Hi Abhinav,

Need one more additional information related to OpenShift 4.1, do you have binaries that i can use to deploy UPI version for one of the POC requirements from Client, Since OCP 4.2 is not GA yet we have a requirement from one of our client to do a POC in coming weeks and having trouble find the 4.1 binaries for Azure.

Thanks Jay

joelddiaz commented 4 years ago

And to come full circle, I also followed the instructions at https://github.com/openshift/openshift-docs/blob/master/modules/installation-azure-service-principal.adoc .

image

This also worked (without the delegated Microsoft Graph User.Read permissions).

nichochen commented 4 years ago

I tried installing with 4.2 on Azure today with the GA release, the installation finished without error. The cluster got deployed successfully onto Azure. Appreciate for the great work!

abhinavdahiya commented 4 years ago

/close

I tried installing with 4.2 on Azure today with the GA release, the installation finished without error. The cluster got deployed successfully onto Azure. Appreciate for the great work!

openshift-ci-robot commented 4 years ago

@abhinavdahiya: Closing this issue.

In response to [this](https://github.com/openshift/installer/issues/2334#issuecomment-550004273): >/close > >> I tried installing with 4.2 on Azure today with the GA release, the installation finished without error. The cluster got deployed successfully onto Azure. Appreciate for the great work! > > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
enkicoma commented 4 years ago

Hi Guys,

I will try to explain step by step what I did so far and hope that will clarify my issue.

So, I followed this pre-steps from Openshift Documentation

I created and configured successfully the Service Principle and even I checked if it has the right permission. 

I checked if in Azure Active Directory looks all good... I even gave more permission as per this GitHub issue... #2334

I even have full access to my Service Principal on Subscription level (Owner, contributor, administrator)   Following the next stage for 4.2 Installing a cluster quickly on Azure 

I managed to deploy successfully the "Default Openshift cluster" which proves that my Service Principle is configured and it's working as expected.

DEBUG OpenShift console route is created           

INFO Install complete!    

INFO Access the OpenShift web-console here: https:......

Unfortunately, My task is to create a Customised Openshift cluster because I have to deploy Cloud Pak for Integration, which requires a very high configuration. so I followed these steps Installing a cluster on Azure with customizations 

I used the same Service Principle, I increased the quotas for CPU for "Dsv3-series" but I am getting time out on deploying Openshift (stuck at 99%)  and when I checked the Resource group in Azure UI there is only 3 Master Nodes but no Worker nodes at all!

here are my install-config.yaml values:

apiVersion: v1
baseDomain: poc-*****
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    azure:
      type: Standard_D16s_v3
      osDisk:
        diskSizeGB: 512 
      zones: 
      - "1"
      - "2"
      - "3"
  replicas: 3
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    azure:
      osDisk:
        diskSizeGB: 512 
      type: Standard_D18s_v3
  replicas: 3
metadata:
  creationTimestamp: null
  name: sec-****-**
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineCIDR: 10.0.0.0/16
  networkType: OpenShiftSDN
  serviceNetwork:
  - 172.30.0.0/16
platform:
  azure:
    baseDomainResourceGroupName: dns
    region: ukwest
pullSecret: '{"auths":{"******'
sshKey: |
  ssh-rsa *******

maybe I did something wrong? 

The same issue as described here on this page:   OpenShift Installer is installing only the master nodes and no worker nodes are getting deployed I tried everything but still can't manage to deploy a configurable Cluster.

I hope you could help to understand my issue guys!

enkicoma commented 4 years ago

Looks like Azure doesn't have Availability Zones for region: ukwest https://azure.microsoft.com/en-us/global-infrastructure/regions/ and if you specify it in your install-config.yaml as

      zones: 
      - "1"
      - "2"
      - "3"

will fail with time-out at 99%

I tried with region: uksouth and did work!

It will be great to make a little warning in DOCS!

steffencircle commented 4 years ago

I don't really get that ?! Does the installer only support Azure Regions with AZs or not ? From my tests it looks like I can only deploy to region with an AZ, others would fail.

However the docs say the installer is tested in several regions that do not have Availability Zones https://docs.openshift.com/container-platform/4.3/installing/installing_azure/installing-azure-account.html#installation-azure-regions_installing-azure-account

So can somebody let me know, if and how we can deploy to an Azure Region that has no Availability Zones?

abhinavdahiya commented 4 years ago

If you don't set any zones, the installer will pick the zones for a region when they are available.. and skip using zones when they are not.

If a user explicitly sets zones for a region when there are none, thats when workers fail to come up, because of the user error

So this is not about if we support that region.. but more to do with user misconfiguration..

Maybe we help warn or fail when user sets such a invalid configuration... Not sure how important or useful that's going to be...

steffencircle commented 4 years ago

Weird,

I have tried that the other day with OCP 4.3.5 deploying a private Cluster to an existing VNet in Germany West Central. My install-config.yaml definitely had no zone specifications inside. It still failed while trying to deploy the VMs with a message that the region does not support zones !

Is there some defaults for the masters behind the scenes that we somehow need to override, or did I miss anything else ?

enkicoma commented 4 years ago

I don't think you missed anything, could be some magic behind the scenes which defaults to use zones always when you deploy a cluster with install-config.yaml, because I spent 2 days to deploy on region: ukwest with zones and without them and still had no clue what's going on.