odpi / egeria-coco-labs

Egeria Jupiter notebooks used in the Open Metadata Labs
Apache License 2.0
9 stars 10 forks source link

SSLError issue when have a test in lab chart (local jupyter) #3

Closed Kelukin closed 2 years ago

Kelukin commented 2 years ago

When I deployed a lab chart and ran common/environment-check.ipynb in the Jupyter, I triggered the following Exceptions:

Checking OMAG Server Platform availability...
Exception: HTTPSConnectionPool(host='lab-core', port=9443): Max retries exceeded with url: /open-metadata/platform-services/users/garygeeke/server-platform/origin (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:997)')))
    Core Platform is down - start it before proceeding
Exception: HTTPSConnectionPool(host='lab-datalake', port=9443): Max retries exceeded with url: /open-metadata/platform-services/users/garygeeke/server-platform/origin (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:997)')))
    Data Lake Platform is down - start it before proceeding
Exception: HTTPSConnectionPool(host='lab-dev', port=9443): Max retries exceeded with url: /open-metadata/platform-services/users/garygeeke/server-platform/origin (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:997)')))
    Dev Platform is down - start it before proceeding
Done.

I am unsure whether the cause is the unalignment between the Jupyter and Egeria's version since you specify the Jupyter tag to the latest and have just upgraded Jupyter to 3.10-SNAPSHOT.

planetf1 commented 2 years ago

I presume you installed the 3.9 version of the odpi-egeria-lab chart ie the latest one:

➜  ~ helm  search repo lab
NAME                    CHART VERSION   APP VERSION     DESCRIPTION           
egeria/odpi-egeria-lab  3.9.0           3.9             Egeria lab environment

Did this work as-is?

You also mentioned you'd updated Jupiter to 3.10-SNAPSHOT. This image is based on a core Jupyter image upon which we've added our notebooks - nothing else.

As such it would probably work just fine, unless there are changes in egeria that make it incompatible. That's very unlikely, though it's not something we'd test.

If you're having issues in reaching the egeria platforms (which is what the errors you are seeing suggest), the first thing I'd check is that those pods are running ie with:

kubectl get pods
kubectl get services

If they are not you could also try

kubectl describe pod <podid>

It may be the pod is not starting due to an error - could be related to security for example, or resource constraint.

What is your k8s host environment?

planetf1 commented 2 years ago

In my first reply I missed one of your points - presuming you had modified your values.

As you point out, the values.yaml for that chart does indeed use 'latest' for the tag of the jupyter image.

I think we should change that back to be aligned to the release - I see it as a bug and will fix.

You can override the value locally via

helm install lab egeria/odpi-egeria-lab --set-string image.jupyter.tag=3.9

I'll check this out tomorrow, though I expect the reason your environment isn't working is different

Kelukin commented 2 years ago

In my first reply I missed one of your points - presuming you had modified your values.

As you point out, the values.yaml for that chart does indeed use 'latest' for the tag of the jupyter image.

I think we should change that back to be aligned to the release - I see it as a bug and will fix.

You can override the value locally via

helm install lab egeria/odpi-egeria-lab --set-string image.jupyter.tag=3.9

I'll check this out tomorrow, though I expect the reason your environment isn't working is different

Thank you, @planetf1, for your quick reply. When I override the jupyter tag to the 3.9, it works without the SSL Error issue.

Kelukin commented 2 years ago

When I set the Jupyter's tag to the default value, the latest, this SSL Error issue comes out again. I deployed the lab chart in the Azure Kubernetes Service.

It is quite strange since all the services are in the running status. Besides, this failed deployment and the above successful deployment with a specific Jupyter tag happen in one AKS.

planetf1 commented 2 years ago

I've fixed the released charts & published 3.9.1 to correct the errors - thanks for reporting.

By using only the jupyter image as latest, I can reproduce the issue as per the original issue

To install with our current development code:

helm install lab egeria/odpi-egeria-lab --set-string egeria.version=3.10-SNAPSHOT

This also fails - even though we should be using a consistent set. I couldn't see any change in our certificates - I wonder if something is different in the container environment. Our jupyter image is based on docker.io/jupyter/base-notebook:latest. The main change at https://github.com/jupyter/docker-stacks/commits/master/base-notebook is a python version bump

I'll take a look at our notebooks to fix....

FYI the medium term the plan is to:

planetf1 commented 2 years ago

Using the 'master' version of the containers with the latest charts now works ok ie:

helm install lab egeria/odpi-egeria-lab --set-string egeria.version=3.10-SNAPSHOT

Following the PR above, which pinned the Jupyter version to the same we previously used - ie before some ssl and python library changes.

Will revisit this when we refactor the notebooks & get rid of our customized container (hopefully) - within next few months.

Thanks for the report. I think all the changes are made now so will close.

planetf1 commented 2 years ago

If you get any further issues on Azure let us know.

Kelukin commented 2 years ago

Thank you, @planetf1! Everything looks fine now.

planetf1 commented 2 years ago

I'm going to re-open this, as this problem still occurs when trying to run the notebooks locally.

However any fix will be rolled into the proposed changes to migrate notebooks to their own repository, and use a stock image

We should also ensure docs on running locally are updated at the same time

Re-opening and moving to base (since the charts are no longer affected since a workaround has been implemented)

dwolfson commented 2 years ago

Been doing some research and haven't yet found what changed - ultimately, the call seems to use the requests package https://requests.readthedocs.io/en/latest/user/advanced/#ssl-cert-verification that uses the URL3 package https://urllib3.readthedocs.io/en/stable/user-guide.html

There are some hints that the package certifi is used and that it changes often. There are additional hints that the .pem file needed for validation must contain the full chain of certs from root to intermediate to local.

planetf1 commented 2 years ago

As part of the refactoring I plan to no longer build a special container, but rather use the standard Jupyter containers, which will retrieve our notebooks via an init script (supported by Jupyter container) or if that fails, a k8s init container.

once that is done will address any issues with ssl certs - leaving open until then

planetf1 commented 2 years ago

The refactoring is mostly done. Unfortunately I wasn't able to get the certificates sorted in the time available - in part because we need to create certs at deployment time .

Nor was it easy to control the behaviour within the environment, in part due to changes in the requests module and python. There's a capability to create a session - and manage context that way which was appealing, but also too much refactoring for now.

Therefore the first pass (along with moving the code) has disabled cert checking directly in calls to the requests module

As such I think this issue is now addressed.