Support for service resources using virtual private endpoints, public access, and service-to-service controls

davidbreitgand commented 4 months ago

Original title:

Support of Services as Communicating Endpoints via VPC Endpoints (a.k.a. Virtual Private Endpoints)

Support communicating across VPCs using Virtual Private Endpoints. Initially, support this in the same cloud across different regions. At the second stage, support this across clouds. Detailed proposition is here.

Sub-tasks:

[ ] #447
[x] #417
[ ] #446
[ ] #445
[ ] #444

divega commented 4 months ago

Triage: Made some updates to the title to make it a bit more generic and use this issue to track general support for PaaS resources in Paraglider, within and across clouds.

divega commented 4 months ago

Some minor feedback on the one pager:

I agree we should tackle this on Paraglider and I like the general approach.
I would prefer us to adopt the term PaaS resources instead of services when we talk/think about this. On one hand, "services" is a bit too overloaded and on the other hand, the items that we want customers to be able to connect to using Paraglider are not the services themselves but the individual instances, such as a cloud database server, a cloud storage account, or a cloud AI services account. I find that the common word of "resources" makes sense because it provides conceptual unification: From the perspective of Paraglider/connectivity intent, they are just resources in the same way that VMs and K9s clusters are resources. They just happen to be implemented and connected in a different way under the hood.
For both K8s clusters and VMs we have the same createResource API, so maybe we can get away on API and CLI with keeping only minimal differences and just infer what needs to happen under the hood from the resource type rather than requiring users to differentiate on every call.
Besides providing private connectivity via virtual private endpoints, I think setting up PaaS resources with public connectivity (and firewall rules to control that traffic) or both private and public should be in scope.
I know from working on related technologies that supporting controls for PaaS services doesn't stop at the private endpoint but also requires controls for PaaS-to-PaaS communication.
Technologies such as VPC service controls offer uniform (within cloud) solutions for (4) and (5). I think it would be good to keep this in picture for Paraglider, at least for the clouds that have a way to express it.

davidbreitgand commented 4 months ago

Thanks for the feedback.

Great!
I see the value of a unifying "resource" concept and I actually thought about service as another type of resource. I don't think "PaaS resources" captures it, because (1) people often debate what cloud services are in terms of IaaS, PaaS, and SaaS. And VPC Endpoint (i.e., VPE) are used in all three of these contexts, because the unifying theme is "as-a-Service". To exemplify:

IaaS context: instead of connecting a Consumer instance (say, a VM) to an individual Producer instance (say, a VM), one uses LB at the Producer side as a front-end and Producer VM is a back-end. This is useful when you need migrate a VM, create a horizontal auto-scaling group, take a VM off for maintenance swapping it by a another one transparently to Consumer, and allow local IP address space that can overlap with that of the Consumer. In a private case there is just LB and a single VM instance registered with it.
PaaS context: is what you write
SaaS: one uses VPC Endpoints also in the SaaS context (see: https://docs.aws.amazon.com/vpc/latest/privatelink/privatelink-access-saas.html)

In all three cases, there is the same underlying structure: a frontend (LB) and (possibly a number) of backend instances. "Service" captures exactly this structure. The value proposition of the VPC Endpoint is that it is not just a private endpoint address, but a fully managed network with all the good properties mentioned in the design note

I think that down the road Paraglider needs to support all three use cases to merge seamlessly with the ecosystem. (a) VPE for IaaS use case: Pros: see above; Cons: additional cost for LB, which is not warranted if you are not connecting an auto-scaling group, but indeed a single VM. However, if you want control over the underlay in all circumstances, you might wish to pay the cost of LB even when connecting two individual VM instances. (b) VPE for PaaS use case: pros, no cons(?) (c) VPE for SaaS use case: the same.

The beauty of using it across the board when you need control over underlay and want the cloud provider manage the connectivity, is that VPE is exactly the same API in all three UCs.

I agree. It would be good to keep the changes to the minimum and I think that for the use cases (b) and (c) above, this is entirely possible. However, in the case of connecting a resource instance to an auto-scaling group of VMs as described above, I currently do not see how you can preserve the current createResource API that currently simply unmarshals JSON stanza and applies it as-is to a respective cloud. The problem is that in this case, we need a composition capability that would execute a mini-flow of creating a few resources: VM, LB, autoscaling group, and configuring them. But this might be a natural and even nice extensions so that the plugin is execution part of the flow "by delegation" resulting in a hierarchical structure.
Agree, but given what I wrote above, I would rather call them "service resources" :)) that can be accessed privately or publicly. It would be a matter of the connectivity rules that you set.
This is true, but then we need to decide what is the Paraglider's scope in the short term, medium term, and long term and what are specific PaaS resources that you want to interconnect. E.g., if this is about K8s to K8s cluster interconnectivity then it's a whole bunch of work on top of VPE. Think ClusterLink, Skupper, Istio. IMHO, they all can leverage private connectivity offered by VPE, but they also do their own things w.r.t. networking on top of the provider network. Makes sense?
Are VPC Servce Controls - like offering also provided in AWS and Azure? This is very interesting. In GCP, VPC Service Control is already integrated with Google Private Access, which is the Google name for the VPE functionality: https://cloud.google.com/vpc-service-controls/docs/private-connectivity I think we definitely need to to take a look at this and if this is something supported universally, then this will make the value proposition even stronger.

@smcclure20 , @praveingk , what do you think?

divega commented 3 months ago

@davidbreitgand, thanks for the writeup, and sorry I didn't catch up with it earlier. You make good points, and I have updated the title to remove the term "PaaS".

Anyway, I'll try to explain my point of view, hoping it will help address/reconcile some conceptual or terminology differences:

On Azure, private endpoints can be used in two scenarios:

To access cloud provider-owned PaaS services that have built-in "private link" capabilities.
To access customer-owned private link services.

In my day-to-day, "PaaS resource" generally implies that there is a multi-tenant service running on the cloud provider's shared infrastructure. The unit of consumption is a resource, but generally, that resource is not associated with a unique, customer-usable IP address unless they use a private endpoint to connect to it from a virtual network.

This contrasts with my typical use of "IaaS resource" to describe customer-owned compute resources directly connected to a customer-owned virtual network.

My knowledge of private link services (point 2 above) is limited, but I believe on Azure, private link services would be used in both the scenarios you referred to as "private endpoints to IaaS" and " private endpoints to SaaS". The main difference between them is that in the case of a SaaS offering, the private link service belongs to a different tenant who built it for consumption by other customers on the same cloud.

Although I see the commonality between (1) and (2) (i.e., they both use private endpoints) and would be happy to see Paraglider leverage that commonality to get "more bang for the buck," I am having a difficult time seeing how to do it.

For cloud-owned PaaS services (1), my main focus is on solving the impedance mismatch between "IaaS" and "PaaS" that today forces customers to deal with too many moving parts to achieve otherwise simple intent. For example, "my application on 'vm1' needs access to my database 'sql2' and my cloud storage 'blob3'" should be easily expressed in Paraglider and result in all the necessary virtual networks and private endpoints being provisioned. In this scenario, the existence of a load balancer and the backend architecture is already abstracted away by the cloud platform, and I would love to see Paraglider also abstract away the usage of private endpoints.

I would hope that Paraglider could also help model the customer-owned private link service case (2). I just don't see a similar opportunity to abstract away much of the complexity: If a customer wants to use Paraglider to define their service, I expect them to need #274 to model the load balancer explicitly and other Paraglider constructs to model their backend's topology. Paraglider could then provide a way to specify that this entire thing is "packaged" as a service and exposed in such a way that access to it can be private/not require going through the internet. Will this require Paraglider customers to explicitly add private links as a network function? I don't know. It would be great to hear how you imagine the Paraglider story for it looking.

Support for private endpoints to SaaS should look pretty much like the PaaS scenario (1) from the perspective of a customer that is only interested in consuming the service, and like the IaaS scenario (2) for the SaaS vendor defining the service.

Regarding GCP's VPC service controls, I mentioned them because the intent often also involves PaaS-to-PaaS. For example, "my application on 'vm1' needs access to my database 'sql2', and both my application on 'vm1' and my database 'sql2' need to access my cloud storage 'blob3'." Talking to @smcclure20, I mentioned that although we might initially not support this (that is, based on my typical use of the terminology, support "IaaS-to-IaaS" and "IaaS-to-PaaS," but not "PaaS-to-PaaS"), from the Paraglider perspective this will translate to not supporting certain combinations of rules between specific endpoints, which may feel quirky and difficult for Paraglider customers to predict. Also, what would the defined behavior be when you can't specify PaaS-to-PaaS controls? Should 'sql2' and 'blob3' be able to communicate with each other or be isolated by default? We can talk in person about options here. I am not aware of what AWS offers on this.

smcclure20 commented 3 months ago

Thanks for the discussion on this! I will try to quickly summarize how I see this issue and how Paraglider should tackle it.

PaaS resources (or whatever we would like to call them :) ). To me, the defining feature of these that (regardless of their name) delineates them from other resources are their multitenancy. Using this as the definition, I think there are only two interesting sub groups: public and private. For public multitenant services, Paraglider can use these today (barring maybe some additions like using a FQDN in rules and resources having public IPs). You just add a permit list rule allowing traffic to the service from your private endpoint and the connection will work. The more interesting part is when they are private. And how these services are exposed privately varies greatly across clouds. But the minimal requirement that Paraglider needs is true regardless of the implementation: the service is reachable via a private IP connected to your virtual network. I think, at least as a first cut, everything beyond this can be hidden from the user. Now, I do think there are interesting questions about how to integrate more tightly with these private link-like services and the policy they expose. But, we are then moving increasingly into the application layer, which Paraglider does not necessarily have to do. As we've discussed in other design meetings, I think it is okay if each cloud takes its own stance on where this line is drawn for Paraglider.
In the above discussion, I saw some mention of services created by the tenant. These fall into a different category for more than the PaaS resources for the reasons given above. I agree that the CreateResource operation today would not immediately work for such services. But, I tend to think about this more in alignment with what we proposed in the paper. We can (and do intend to) support in-network functions like load balancers. Then, all that's left is to support VM scale sets and expose an API to connect the load balancer to the VMs (which we proposed in the paper and have ideas for how to do in this new implementation). I don't see a strong need to immediately jump to exposing an API that just provisions an entire "service" for the user. In fact, I might argue that is outside the scope of Paraglider, which should only do the networking side. But, I could see a longer-term goal where we have some nice API on top that uses all the steps I just mentioned to create a basic setup for you.
Paas to Paas connectivity. This one is far less clear to me, even though I do understand it is a problem. I think I need to better understand what the status quo is on this front before proposing what Paraglider's role in it is.

smcclure20 commented 3 months ago

Updating the thread with slides from our TSC meeting today: https://docs.google.com/presentation/d/1UOTN5ONsOdcPaS_g_FTNbYKclwVMRbEuxb3pr61PXTQ/edit?usp=sharing

divega commented 2 months ago

@smcclure20 would it make sense to split this issue into multiple subtasks so we can mark some as done soon and move others to backlog?

smcclure20 commented 2 months ago

Yeah, we should create one for each cloud, especially since this is partially implemented already.

paraglider-project / paraglider

Support for service resources using virtual private endpoints, public access, and service-to-service controls #344