opendatahub-io / odh-dashboard

Dashboard for ODH
Apache License 2.0
28 stars 166 forks source link

Enable GRPC support for model serving routes #1149

Closed lucferbux closed 11 months ago

lucferbux commented 1 year ago

Goal

Enable support for model serving grpc, both in the creation of the model server and the visualization of routes in the models deployed.

Dependency issue

Itemized goals

  1. Support GRPC in deployed models in DS Project
  2. Support GRPC in models in Global Model Serving View
  3. Enable GRPC when deploying a model

Mocks

Additional details

lucferbux commented 1 year ago

@heyselbi @Jooho Let's have this conversation here, do you guys have a target release for grpc? Today we are aiming to merge Custom Model Templates so this will be left out of the feature freeze's PR. It would be nice to have a roadmap to know if we can get this next sprint.

cc. @andrewballantyne

andrewballantyne commented 1 year ago

cc @vconzola (@kywalker-rh) & @jedemo

GRPC is missing this release as it stands.

jedemo commented 1 year ago

I thought this was on track for 1.27? Are there any issues we need to discuss to move this forward? Thanks.

cc @taneem-ibrahim

heyselbi commented 1 year ago

https://github.com/heyselbi/odh-model-controller/tree/rhods-6401-grpc

This is on track for 1.27 considering that the code freeze is this coming Friday. I am currently testing it with grpc calls to make sure the port is working as intended.

The UI part can be enabled by annotations in servingRuntime: enable-route, enable-http, enable-auth, enable-grpc --> all of which have to be set to true for https and grpc routes to be exposed. If user desires unsecured http, then enable-route, enable-http, enable-grpc have to be set to true.

shalberd commented 1 year ago

@heyselbi so enable-auth makes the route passthrough? I am asking because of this background article https://cloud.redhat.com/blog/grpc-or-http/2-ingress-connectivity-in-openshift I am not sure and never tested TLSTerminationReencrypt instead of passthrough. Not sure whether the issue under the heading "Known Limitation" in that article still applies for end-to-end client communication via http/2. Not sure why any client would still use WebSocket, since libraries like akka http have full http/2 support now. And I guess the issue of re-negotiation is not there if indeed a client uses http/2 and protobuffers from the start.

By the way, very neat to see grpc and http/2 support via dashboard.

heyselbi commented 1 year ago

Update for yesterday: we were able to make the gRPC work as a PoC, and now I am working on implementing those changes in the repo. It was done by changing the mm container port into https by adding TLS cert. For this I needed to update the deployment template in manifests, which happen to be in 4 different places across two repos. Here are the draft PRs: https://github.com/opendatahub-io/modelmesh-serving/pull/106 https://github.com/opendatahub-io/odh-manifests/pull/839 My dev cluster is down for some reason, but I will re-create a new one today (Tue) and test that these changes don't affect modelmesh-serving functioning overall.

@shalberd yes, the route will change to passthrough. I also confirmed (and tested successfully) that if we use passthrough in route and TLS in mm container, we don't need to enable http2 on cluster wide ingress.

lucferbux commented 1 year ago

Hey @heyselbi do you have any ETA for the completion of gRPC, just to start testing with your backend code in order to validate the UI changes.

heyselbi commented 1 year ago

Hi @lucferbux .

In short, unfortunately no ETA yet, we are running into cert issues on modelmesh-serving side with passthrough route. Jeff was notified and the decision we made was that I test re-encrypt route - from the community updates, sounds like it might work now (it was previously not recommended and labeled as limitation.). Another test for me is to try and make the custom self-signed cert in multiple namespaces, but that altogether might not be preferable due to self-signed nature of the certificate.

Long answer is what I have shared with Jeff to keep him in the loop:

Just wanted to give you an update on gRPC-related efforts and get your thoughts. It has gotten a lot more complicated as I was working through it.

  1. gRPC is already available through port-forwarding: (see ref: https://github.com/kserve/modelmesh-serving/blob/main/docs/quickstart.md#3-perform-an-inference-request )

  2. If we want to expose a gRPC route, it is recommended be passthrough - which means a client request is sent to the target service, without terminating TLS. Reencrypt option is not advised because it can cause issues (ref: see “Known Limitation” in https://cloud.redhat.com/blog/grpc-or-http/2-ingress-connectivity-in-openshift ).

This means for this to work, the receiving web-server needs to be HTTPs, which means it needs a TLS secret (aka certificate). So I tried re-using a self-signed certificate that we already generate for oauth-proxy. OpenShift generates it automatically. The gRPC route was finally able to connect to modelmesh service (which is awesome news), but then inference service was unable to communicate with runtime serving because now all internal communications needed to be secure and DNS had to have a specific custom name. kserve/modelmesh suggests two options for this (ref: https://github.com/kserve/modelmesh-serving/blob/main/docs/configuration/tls.md#tls-configuration ):

a) Generate a self-signed certificate using openssl. It must have a custom DNS name. Our controller (inherited from kserve) apparently already takes care of all internal communications, making them secure. b) Use cert-manager.

So option a) is a bit tricky, since self-signed certificates are not preferable in general. Some industries may altogether not allow self-signed certificates with external routes. We also tested it, and it worked only in one namespace. The test failed when trying to do the same in multiple namespaces. Option b), cert-manager is not owned by Red Hat, and adding cert-manager to ODH is a major undertaking, and is even more tricky for RHODS.

If we are to agree to use self-signed certificates, we have two options: i) We hard code the self-signed certificate generation into our controller with custom DNS name, per modelmesh requirements ii) (exploring phase) Look into modifying the self-signed certificate generated by OpenShift for oath-proxy with custom DNS name. According to OpenShift docs, there is mention on how or if this is possible.

lucferbux commented 1 year ago

Hi @lucferbux .

In short, unfortunately no ETA yet, we are running into cert issues on modelmesh-serving side with passthrough route. Jeff was notified and the decision we made was that I test re-encrypt route - from the community updates, sounds like it might work now (it was previously not recommended and labeled as limitation.). Another test for me is to try and make the custom self-signed cert in multiple namespaces, but that altogether might not be preferable due to self-signed nature of the certificate.

Long answer is what I have shared with Jeff to keep him in the loop:

Just wanted to give you an update on gRPC-related efforts and get your thoughts. It has gotten a lot more complicated as I was working through it.

  1. gRPC is already available through port-forwarding: (see ref: https://github.com/kserve/modelmesh-serving/blob/main/docs/quickstart.md#3-perform-an-inference-request )
  2. If we want to expose a gRPC route, it is recommended be passthrough - which means a client request is sent to the target service, without terminating TLS. Reencrypt option is not advised because it can cause issues (ref: see “Known Limitation” in https://cloud.redhat.com/blog/grpc-or-http/2-ingress-connectivity-in-openshift ).

This means for this to work, the receiving web-server needs to be HTTPs, which means it needs a TLS secret (aka certificate). So I tried re-using a self-signed certificate that we already generate for oauth-proxy. OpenShift generates it automatically. The gRPC route was finally able to connect to modelmesh service (which is awesome news), but then inference service was unable to communicate with runtime serving because now all internal communications needed to be secure and DNS had to have a specific custom name. kserve/modelmesh suggests two options for this (ref: https://github.com/kserve/modelmesh-serving/blob/main/docs/configuration/tls.md#tls-configuration ): a) Generate a self-signed certificate using openssl. It must have a custom DNS name. Our controller (inherited from kserve) apparently already takes care of all internal communications, making them secure. b) Use cert-manager. So option a) is a bit tricky, since self-signed certificates are not preferable in general. Some industries may altogether not allow self-signed certificates with external routes. We also tested it, and it worked only in one namespace. The test failed when trying to do the same in multiple namespaces. Option b), cert-manager is not owned by Red Hat, and adding cert-manager to ODH is a major undertaking, and is even more tricky for RHODS. If we are to agree to use self-signed certificates, we have two options: i) We hard code the self-signed certificate generation into our controller with custom DNS name, per modelmesh requirements ii) (exploring phase) Look into modifying the self-signed certificate generated by OpenShift for oath-proxy with custom DNS name. According to OpenShift docs, there is mention on how or if this is possible.

Ok, I was asking cause @jedemo had this as one of the top priorities, I'm still in the loop, so when you have the changes we can easily merge the frontend PR asap. Thanks!

lucferbux commented 1 year ago

Hi @heyselbi I'll be on PTO most of this week, I'll be back at the beginning of the next sprint. How's the status of the implementation of GRPC? Are we aiming any specific sprint or are we still blocked by the issues you mentioned last time? Thanks in advance!

heyselbi commented 1 year ago

An update on the dev side: I have been able to run gRPC inferencing in modelmesh-serving on OpenShift with ODH in multiple namespaces using passthrough route. The only caveat is that it requires a certificate. For dev work, we use self signed cert generated with openssl. The cert needs to have a specific custom DNS, which modelmesh-serving recognizes. I am currently testing out my instructions for gRPC inferencing and two of my teammates will have to test it too to verify that it works as we expect - I will confirm once the verification is complete.

This might necessitate some discussions around the UX/UI of it: we will need to ask the user/customer to provide certificate with custom DNS and/or we could offer self signed one with a warning that this is meant only for testing (not for production). I am also looking into cert manager --> this option/alternative implies customer will have to install cert-manager operator on their own (ie. we won't be providing it as part of ODH/RHODS).

As these discussions happen, I will start adding the gRPC instructions logic into our odh-model-controller.

@lucferbux suggested we start these discussions next sprint. QE said that they could start looking into it for 1.33 release the earliest due to operator work.

cc @taneem-ibrahim @Jooho

shalberd commented 1 year ago

@heyselbi I, too, think passthough is the way to go.

that if we use passthrough in route and TLS in mm container, we don't need to enable http2 on cluster wide ingress

That is what the cluster-admins and network folks confirmed, too, at my company. As long as we can have true end2end http/2 that way, let's say an API consumer of a customer partner, coming in through the corporate WAF, being directed to openshift ingress, then that is a good way to go. I am not sure what you mean by not enabling http2 on cluster ingress, a lot of the benefits may be lost when not doing that, would they not?

https://cloud.redhat.com/blog/grpc-or-http/2-ingress-connectivity-in-openshift

It requiring a certificate is not a bad thing. The certificate itself could possibly be added somewhere in ODH-Dashboard code and there could be a UI section for pasting that in if wished. Doesn't matter whether self-signed or generated by a PKI (on-prem or public). I have not looked at cert-manager operator, whether that is for cert issuance or mounting in the pod, but see my next few lines here. If it is for cert issuance, I'd posit: leave that to the openshift cluster admins or corporate IT admins.

Do you have an architecture draft or an architecture group on that whole topic? If yes, I' love to participate.

Some industries may altogether not allow self-signed certificates with external routes. We also tested it, and it worked only in one namespace. The test failed when trying to do the same in multiple namespaces.

Yes, that is an absolute no-go. Same for some operator code generating certificates. See my issue with certificate trust in corporate environments further below, too.

The cert needs to have a specific custom DNS, which modelmesh-serving recognizes

looking forward for details on this. Kubeflow folks upstream are often very hacky when it comes to security and certificates and all that.

We need to take into account CA-trust. In Openshift, there is already for Openshift Admins a possibility to inject a combination bundle of all publicly-trusted CAs at Core OS Level as well as all additionally-trusted CA certificates in an additional trust bundle, such a self-signed certs or corporate PKI root and intermediate CAs which are not publicly-trusted into a configmap, then mounting in the configmap into a volume path location in the container.

Do you have that on the radar as well? @harshad16 and @VaishnaviHire worked in this at the notebook controller spawned notebook container level in https://github.com/opendatahub-io/kubeflow/pull/43/files#diff-447c1a1ddad4e46669c4371d0d9714dad4a3368c5ccf2292b356e3b5c7441ce1, I can say it is working beautifully.

For golang, as you take that ca-bundle and mount it to /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem, you have got all trust issues covered, even for certificates issued by private PKIs or self-signed ones. For other languages like Python, you give that full path for env variables used by python, like PIP_CERT and REQUESTS_CA_BUNDLE. The approach requires working with openshift cluster-admins, but in corporate contexts, that happens anyways. Anytime you go beyond more dev scenarios, basically :-)

lucferbux commented 1 year ago

@kywalker-rh @vconzola Tagging you guys since we might need UX work here.

@heyselbi confirm us if we need certificates for this, and if so, we can add this topic to the ux meeting to start a discussion.

Thanks a lot!

lucferbux commented 1 year ago

@Jooho @heyselbi Hi! I was reviewing this to unblock grpc asap, Selbi mentioned that we need self-signed certificates, is this somewhat related to https://github.com/opendatahub-io/odh-dashboard/issues/1711 ?

If so, we can include @xianli123 in the conversation in case we need to add some ux here too @vconzola

xianli123 commented 1 year ago

Thanks for being involved. @lucferbux. I will take some time to research this issue.

xianli123 commented 1 year ago

@lucferbux I am wondering who designed these mockups.

https://www.sketch.com/s/5c26573b-fd37-4d54-a4f7-aab1cc17ff5d/a/WdKx7Vk https://www.sketch.com/s/5c26573b-fd37-4d54-a4f7-aab1cc17ff5d/a/EeLgEmq https://www.sketch.com/s/5c26573b-fd37-4d54-a4f7-aab1cc17ff5d/a/x0rPwZ0

vconzola commented 1 year ago

Hi @xianli123, those are mine. Did you have questions about them?

xianli123 commented 1 year ago

@vconzola LoL, sorry. I just thought I needed to do something about this issue. I didn't have any questions. And now it can be returned to its rightful owner.

dgutride commented 11 months ago

Migrated to: https://issues.redhat.com/browse/RHOAIENG-560