odpi / egeria-charts

Helm chart repository
https://odpi.github.io/egeria-charts
Apache License 2.0
13 stars 10 forks source link

Update lab chart docs with notes on openshift security context #18

Closed juergenhemelt closed 2 years ago

juergenhemelt commented 4 years ago

I try to get started with ODPi using the descriptions found here: https://egeria.odpi.org/open-metadata-resources/open-metadata-labs/

I installed the lab in my Openshift cluster using helm. On startup of the pod lab-odpi-egeria-lab-jupyter I get the event:

"Error creating: pods "lab-odpi-egeria-lab-jupyter-56f7fb969f-" is forbidden: unable to validate against any security context constraint: [fsGroup: Invalid value: []int64{100}: 100 is not an allowed group]"

and the pod is not starting.

Any suggestions?

planetf1 commented 4 years ago

I am running the chart in OpenShift regularly.

I am using OpenShift 4, but do have my own dev cluster. It looks as if we have a security limitation here.

I can try and respond here, but if you are ok with it, slack (slack.odpi.org) may be a more interactive approach to work through some ideas. I'd very much like to help you get it working and improve our docs or charts accordingly.

planetf1 commented 4 years ago

To elaborate on jupyter in particular - we reuse the base jupyter image (jupyter/base-notebook:latest on dockerhub) and just add a file python modules & load up some notebooks.

That base container is defined at https://github.com/jupyter/docker-stacks/tree/master/base-notebook and there are some docs at https://jupyter-docker-stacks.readthedocs.io/en/latest/index.html

The container does indeed use group=100

So can you tell me anything about the security constraints you have in your environment -- as the error indicates 100 is not allowed ?

planetf1 commented 4 years ago

It's not something I've investigated in my environment, however openshift can be secured to only permit certain groups.

This can be modified (by a cluster admin) with:

oc edit scc restricted

Or specifically

oc adm policy add-scc-to-group restricted 100

However I don't know if you may hit any other restrictions.

planetf1 commented 4 years ago

There some docs on SCC strategies (openshift 4) at https://docs.openshift.com/container-platform/4.3/authentication/managing-security-context-constraints.html#authorization-SCC-strategies_configuring-internal-oauth

In my test environment I have:

fsGroup:
  type: MustRunAs
planetf1 commented 4 years ago

In the Deployment yaml spec for our jupyter container I define:

    spec:
      securityContext:
        fsGroup: 100

which I think all hangs together and should work.

So need to understand more about your environment, restrictions, versions

juergenhemelt commented 4 years ago

I am still on Openshift 3.11

My restricted scc shows this for fsGroup:

fsGroup:
  type: MustRunAs

No ranges defined.

juergenhemelt commented 4 years ago

Ok. I found it. My namespace had a default value for openshift.io/sa.scc.supplemental-groups:

apiVersion: project.openshift.io/v1
kind: Project
metadata:
  annotations:
    openshift.io/description: ""
    openshift.io/display-name: ""
    openshift.io/requester: xce3579
    openshift.io/sa.scc.mcs: s0:c73,c32
    openshift.io/sa.scc.supplemental-groups: 1005320000/10000

So I changed the template/jupyter.yaml: fsGroup: 1005320000 and it works.

juergenhemelt commented 4 years ago

Still not fixed. The container crashes now with denied access:

Fail to get yarn configuration. {"type":"error","data":"Could not write file \"/opt/conda/lib/python3.7/site-packages/jupyterlab/yarn-error.log\": \"EACCES: permission denied, open '/opt/conda/lib/python3.7/site-packages/jupyterlab/yarn-error.log'\""}
{"type":"error","data":"An unexpected error occurred: \"EACCES: permission denied, scandir '/home/jovyan/.config/yarn/link'\"."}
{"type":"info","data":"Visit https://yarnpkg.com/en/docs/cli/config for documentation about this command."}

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/traitlets/traitlets.py", line 528, in get
    value = obj._trait_values[self.name]
KeyError: 'runtime_dir'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/jupyter-lab", line 10, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.7/site-packages/jupyter_core/application.py", line 270, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/traitlets/config/application.py", line 663, in launch_instance
    app.initialize(argv)
  File "</opt/conda/lib/python3.7/site-packages/decorator.py:decorator-gen-7>", line 2, in initialize
  File "/opt/conda/lib/python3.7/site-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/notebook/notebookapp.py", line 1766, in initialize
    self.init_configurables()
  File "/opt/conda/lib/python3.7/site-packages/notebook/notebookapp.py", line 1380, in init_configurables
    connection_dir=self.runtime_dir,
  File "/opt/conda/lib/python3.7/site-packages/traitlets/traitlets.py", line 556, in __get__
    return self.get(obj, cls)
  File "/opt/conda/lib/python3.7/site-packages/traitlets/traitlets.py", line 535, in get
    value = self._validate(obj, dynamic_default())
  File "/opt/conda/lib/python3.7/site-packages/jupyter_core/application.py", line 100, in _runtime_dir_default
    ensure_dir_exists(rd, mode=0o700)
  File "/opt/conda/lib/python3.7/site-packages/jupyter_core/utils/__init__.py", line 13, in ensure_dir_exists
    os.makedirs(path, mode=mode)
  File "/opt/conda/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/opt/conda/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/opt/conda/lib/python3.7/os.py", line 211, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/opt/conda/lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/home/jovyan/.local'
planetf1 commented 4 years ago

Can you share any more about your security configuration in the openshift 3 cluster? There's some info at https://docs.openshift.com/enterprise/3.0/admin_guide/manage_scc.html

Can you follow through and determine

Within the dev team we've used simpler developer-centric / default configured openshift clusters, as well as having deployed onto minikube, IKS, microk8s, k3s, so I'm unable to determine exactly what setting is causing this issue, though I'd like to help you get it working so we can improve our docs etc .

My openshift environment is v4. I will try v3 also.

There is a specific blog on running jupyter containers on openshift at https://blog.openshift.com/jupyter-openshift-part-2-using-jupyter-project-images/

From the error above it looks as if you have permissions issues writing data within the container

The blog entry documents a very similar error, along with the required steps to ensure the container runs as the jovyan user. Could you try those steps and see if it fixes your problem?

planetf1 commented 4 years ago

One specific suggestion -- this allows containers to run under any user:

oc adm policy add-scc-to-group anyuid system:authenticated

Obviously this does change the security profile of the platform. There are other alternatives including extending the namespace annotation openshift.io/sa.scc.uid-rangeor creating a service account with a dedicated scc

I verified this change on a clean openshift 3 cluster (IBM Cloud). Prior to the change I found the zookeeper image (we use the chart from BitNami) didn't start either with a default user of 1001. A clean openshift 4 cluster worked fine by default - but clearly may fail as security is tightened beyond the defaults.

This is an area we could benefit from documenting in more detail in future. The underlying info is relevant for all environments, though the defaults, and mechanisms of modification are openshift specific - or even cloud provider specific (and interestingly varied in my default installs of openshift 3 & openshift 4)

Therefore I propose to extend the documentation and testing in this area, though this won't be addressed immediately.

I would like to ensure you have enough to keep working (as will anyone else who find this issue report)

Do the links above help?

juergenhemelt commented 4 years ago

Thx. The links do actually help. As I am not allowed to do this setting for my own namespace I asked the admins. I will let you know if it works as soon as possible.

planetf1 commented 4 years ago

Further reports of another issue where it was noted the restricted security context was in use:

image (1)

In this case the encrypted filestore connector is throwing an error. The code is manipulating permissions - so looks like a security issue

As a workaround the following was added to the common check notebook

def useClearConfigStore(platformURL, adminUserId):
    adminCommandURLRoot = platformURL + '/open-metadata/admin-services/users/' + adminUserId 
    print ("   ... switching config store to unencrypted...")
    url = adminCommandURLRoot + '/stores/connection'
    jsonContentHeader = {'content-type':'application/json'}
    clearConfigStore = {
        "class": "Connection",
        "connectorType": {
            "class": "ConnectorType",
            "connectorProviderClassName": "org.odpi.openmetadata.adapters.adminservices.configurationstore.file.FileBasedServerConfigStoreProvider"
        },
        "endpoint": {
            "class": "Endpoint",
            "address": "omag.server.{0}.config"
        }
    }
    postAndPrintResult(url,json=clearConfigStore, headers=jsonContentHeader)
useClearConfigStore(corePlatformURL, adminUserId)
useClearConfigStore(devPlatformURL, adminUserId)
useClearConfigStore(dataLakePlatformURL, adminUserId)

No errors were then shown in relation to the config connector. However a little later on we get:

image-2

it looks to me as if something must be failing silently during the config step ie:

  "   ...... (POST https://lab-dev:9443/open-metadata/admin-services/users/garygeeke/servers/cocoMDS1/local-repository/mode/in-memory-repository )\n",
      "   ...... Response:  {'class': 'VoidResponse', 'relatedHTTPCode': 200}\n",

Cool - we’ve just configured the in memory repository… then …

      "   ... configuring the short descriptive name of the metadata stored in this server...\n",
      "   ...... (POST https://lab-dev:9443/open-metadata/admin-services/users/garygeeke/servers/cocoMDS1/local-repository/metadata-collection-name/Data Lake Catalog )\n",
      "   ...... Response:  {'class': 'VoidResponse', 'relatedHTTPCode': 400, 'exceptionClassName': 'org.odpi.openmetadata.adminservices.ffdc.exception.OMAGConfigurationErrorException', 'actionDescription': 'setLocalMetadataCollectionName', 'exceptionErrorMessage': 'OMAG-ADMIN-400-008 The local repository mode has not been set for OMAG server cocoMDS1', 'exceptionErrorMessageId': 'OMAG-ADMIN-400-008', 'exceptionErrorMessageParameters': ['cocoMDS1'], 'exceptionSystemAction': 'The local repository mode must be enabled before the event mapper connection or repository proxy connection is set.  The system is unable to configure the local server.', 'exceptionUserAction': 'The local repository mode is supplied by the caller to the OMAG server. This call to enable the local repository needs to be made before the call to set the event mapper connection or repository proxy connection.'}\n",
      "   ... configuring the membership of the cohort...\n",

(awkward cut/paste!)

That error is complaining the local repository isn’t enabled!

So somehow (I’ve never seen it) it seems the config isn’t being saved? Maybe the error checking in code isn’t good enough, it surely is down to security within the container.

planetf1 commented 4 years ago

a kubectl logs & describe doesn't show anything untoward other than the use of the secure context ....

planetf1 commented 4 years ago

I suspect the issue in this case is writing to storage. So far we have not setup volumes for the lab notebook, but should do so. I suspect this may well address this problem.

For all the issues addressed here the first step is to setup a user/config to use the restricted context, then to write up the docs & add volumes as needed.

planetf1 commented 4 years ago

Looking at this further: I

If the scc change is not made, the openshift console will show that kafka, zookeeper, jupyter pods could not be created due to the users being used:

create Pod lab-kafka-0 in StatefulSet lab-kafka failed error: pods "lab-kafka-0" is forbidden: unable to validate against any security context constraint: [fsGroup: Invalid value: []int64{1001}: 1001 is not an allowed group spec.containers[0].securityContext.securityContext.runAsUser: Invalid value: 1001: must be in the ranges: [1004250000, 1004259999]]

Error creating: pods "lab-odpi-egeria-lab-jupyter-744469ccbb-" is forbidden: unable to validate against any security context constraint: [fsGroup: Invalid value: []int64{100}: 100 is not an allowed group]

create Pod lab-zookeeper-0 in StatefulSet lab-zookeeper failed error: pods "lab-zookeeper-0" is forbidden: unable to validate against any security context constraint: [fsGroup: Invalid value: []int64{1001}: 1001 is not an allowed group spec.containers[0].securityContext.securityContext.runAsUser: Invalid value: 1001: must be in the ranges: [1004250000, 1004259999]]

These are both third party components (egeria itself is well behaved).

The range or fixed users can also be set - some references as above.

There were no issues in this environment with config or storage

github-actions[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 20 days if no further activity occurs. Thank you for your contributions.

planetf1 commented 3 years ago

The current error on OpenShift 4.8, if the security context is not changed, for the base chart is:

  Warning  FailedCreate      14s (x13 over 35s)  statefulset-controller  create Pod egeria-base-platform-0 in StatefulSet egeria-base-platform failed error: pods "egeria-base-platform-0" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted: .spec.securityContext.fsGroup: Invalid value: []int64{0}: 0 is not an allowed group, provider "ibm-restricted-scc": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "ibm-anyuid-scc": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "ibm-anyuid-hostpath-scc": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "ibm-anyuid-hostaccess-scc": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "ibm-privileged-scc": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]
planetf1 commented 3 years ago

A reminder of the current change required on scc restricted (or a new policy):

fsGroup:
   type: MustRunAs

This must be changed to 'RunAsAny'

And

   runAsUser:
     type: MustRunAsRange

Similarly must change to RunAsAny

planetf1 commented 3 years ago

Also worth noting that kafka also failed otherwise with

  Warning  FailedCreate      4m41s (x17 over 10m)  statefulset-controller  create Pod base-kafka-0 in StatefulSet base-kafka failed error: pods "base-kafka-0" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted: .spec.securityContext.fsGroup: Invalid value: []int64{1001}: 1001 is not an allowed group, spec.containers[0].securityContext.runAsUser: Invalid value: 1001: must be in the ranges: [1000660000, 1000669999], provider "ibm-restricted-scc": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "ibm-anyuid-scc": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "ibm-anyuid-hostpath-scc": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "ibm-anyuid-hostaccess-scc": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "ibm-privileged-scc": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]

and zookeeper with:

  Warning  FailedCreate      5m16s (x17 over 10m)  statefulset-controller  create Pod base-zookeeper-0 in StatefulSet base-zookeeper failed error: pods "base-zookeeper-0" is forbidden: unable to validate against any security context constraint: [provider "anyuid": Forbidden: not usable by user or serviceaccount, provider restricted: .spec.securityContext.fsGroup: Invalid value: []int64{1001}: 1001 is not an allowed group, spec.containers[0].securityContext.runAsUser: Invalid value: 1001: must be in the ranges: [1000660000, 1000669999], provider "ibm-restricted-scc": Forbidden: not usable by user or serviceaccount, provider "nonroot": Forbidden: not usable by user or serviceaccount, provider "ibm-anyuid-scc": Forbidden: not usable by user or serviceaccount, provider "hostmount-anyuid": Forbidden: not usable by user or serviceaccount, provider "ibm-anyuid-hostpath-scc": Forbidden: not usable by user or serviceaccount, provider "machine-api-termination-handler": Forbidden: not usable by user or serviceaccount, provider "hostnetwork": Forbidden: not usable by user or serviceaccount, provider "hostaccess": Forbidden: not usable by user or serviceaccount, provider "ibm-anyuid-hostaccess-scc": Forbidden: not usable by user or serviceaccount, provider "node-exporter": Forbidden: not usable by user or serviceaccount, provider "ibm-privileged-scc": Forbidden: not usable by user or serviceaccount, provider "privileged": Forbidden: not usable by user or serviceaccount]
planetf1 commented 2 years ago

The egeria-base chart now runs with the default, restricted security context. Working on the lab chart (issues with nginx, egeria-ui [based on nginx] and jupyter)

planetf1 commented 2 years ago

The two main user charts

planetf1 commented 2 years ago

When running the Egeria Dojo yesterday 2022-01-17, it appeared as if the egeria-base chart, and possibly lab, in fact did not work on OpenShift.

Reopening and creating a new cluster to validate

planetf1 commented 2 years ago

Checked with a clean deployment of 4.8.21_1537 (no changes to any security context)

Both egeria-base & odpi-egeria-lab work fine.

I suspect my failure was user error, either in manually editing security context, or deploying a chart, whilst trying to present and test at the same time!

Closing