Create a abstract NBI for infrastructure orchestration in Nephio

gvbalaji commented 9 months ago

In R2 we created an abstract interface for NF orchestration called NFTopology. However we do not have that kind of abstract interface for infrastructure orchestration ( cluster creation , deletion etc). We need to define the Nephio NBI for this. This can be used by any application north bound such as an service orchestrator or e2e orchestrator. Immediately as we plan O-RAN use cases, cluster management use cases in O-RAN will greatly benefit from such an interface.

This interface needs to be defined based on our experience on -- NfTopology Interface -- Cluster creation in various cloud environments ( GCP. AWS, Azure, OpenShift , WindRiver etc) -- Our understanding on requirements on packet code and RAN use cases.

Acceptance Criteria :

Design document for the NBI
NBI CRDs defined , agreed upon and checked into the API repo

Stakeholders:

Infra providers Operators Immediately the Nephio team working on o-RAN cluster use cases

henderiw commented 9 months ago

Why is this not just a package?

s3wong commented 9 months ago

@henderiw Generally I agree that NB should be a package. But particularly for common infra types (such as cluster), it would be useful for Nephio to expose a common simplistic structure where system components of Nephio can interpret the specification and more importantly its operational status. For example, for Kubernetes cluster, it would be great if we have a more uniformed way to poll if a workload cluster is ready.

So there are two parts of this: 1.) a NB, such as O-RAN, cluster package would contain Nephio base cluster CR, and all the extension CRs pertaining to that particular high-level, Nephio service requesting system 2.) the Nephio base cluster CR --- where all the infra provider would support (and if we will have infra operator in the future, this could be one CR a infra cluster controller would be watching and reconciling) 3.) Nephio can now have software module or controller watching the status of this cluster resource to take next step actions (such as installing configsync, for example)

henderiw commented 9 months ago

I would say our Northbound is a package and we have a PV to actuate it. In the current package we have a Nephio workload resource that is used as a Workload Cluster CR. We don't need this resource to install configSync. E.g. we can do this on the basis that the workload package is instantiated. This is way more generic and applicable to any resource and object.

henderiw commented 9 months ago

So my suggestion is to extend this resource and add the parameterRefs as we did for NFDeployment. Now here you can question if this needs to be done or if the resources in the package serve this purpose. The latter approach is more flexible and does not require anything to be done.

electrocucaracha commented 9 months ago

Trying to put some context here, we already have the workload-cluster package that deploys, via Cluster API, Kind clusters with additional plugins like multus and configsync. This package contains a WorkloadCluster CR which is underutilized. So offering a WorkloadCluster CRD will provide a standard definition for NBI, right?

s3wong commented 9 months ago

So my suggestion is to extend this resource and add the parameterRefs as we did for NFDeployment. Now here you can question if this needs to be done or if the resources in the package serve this purpose. The latter approach is more flexible and does not require anything to be done.

We won't be taking away the flexibility. Recall that during R1, when we sat together on designing XXXDeployment, we talked about it was perfectly fine that XXXDeployment is an empty structure with just vendorRef (back then we called it vendor reference or even vendor blob); but what a XXXDeployment signifies is that:

a contract between Nephio and the NF operator that a successful XXXDeployment reconciliation is of a determined behavior (in this case, the NF is deployed on workload cluster and "ready")
that Nephio has a unified way to detect if the intent is not met via reading the status field of the CR

So this is the same --- in fact, given that the current WorkloadCluster definition is so multus centric, I even would advocate to put the current spec into a different struct to either linked to as parameterref or making that struct part of WorkloadCluster but is completely omitempty.

But now, particularly for this issue, there is a NEW ask. That the application layer (O-RAN orchestration system, SOM or whatever) is asking to have a north bound API such that they can fill in some info to trigger a cluster CRUD operation. We can of course dump it onto the northbound system and tell them our usual "just full up a PV, and YOU are responsible for providing all the input to the infra cluster package of the infra provider of your choice"; my current thinking of this is that --- given that O-RAN is also a standard body, they would (as a standard body) be defining cluster config interface for O-RAN orchestrators; and for infra providers that are supporting O-RAN standards, they should consume the O-RAN cluster standard input and either providing kpt functions to convert the O-RAN input to input to their packages, or just flat out able to support them. A provider name field should be given, but not much else. That way, Nephio can support a O-RAN cluster package and map it directly to a O-RAN supporting infra provider's package.

s3wong commented 9 months ago

Trying to put some context here, we already have the workload-cluster package that deploys, via Cluster API, Kind clusters with additional plugins like multus and configsync. This package contains a WorkloadCluster CR which is underutilized. So offering a WorkloadCluster CRD will provide a standard definition for NBI, right?

@electrocucaracha we will still have workload cluster packages for a different number of infra providers. As @henderiw said, we already have a WorkloadCluster CRD that we expect all workload cluster for NFDeploy purpose to contain. See the Slack thread here on the history of the info we included for ClusterContext (the previous name of WorkloadCluster) --- but it is very multus centric, and even for that I would advocate pushing to a different struct, and link it as part of parameterref, or I would somewhat prefer a omitempty struct field in WorkloadCluster CRD (so it is more pronounced, and perhaps we can read status off of it also).

Having some basic common structure on Nephio defined WorkloadCluster (and other infra related CRDs) allows Nephio as a system to have some knowledge of the infra. For example, if Nodepool is included, Nephio can in theory be able to understand the number of nodes in a workload cluster, and if # of nodes available == # of nodes as intended by user. This information is interesting for observability reason, but moving forward it may potentially be able to influence placement (as an application on top of Nephio)

henderiw commented 9 months ago

NF Deployment is a bit different compared to a workload cluster. NFDeployment is a CR that is conveyed to the workload cluster and actuated there. For workload cluster package or CR stay in the mgmt cluster. So this is a major difference.

So I see 2 different approaches:

Use a package that has all the vendor specific KRM inside and instantiates a cluster. You write a business logic through specialisation and instantiate your cluster. This is basically what we do today and is very flexible. We spit a WorkloadCluster CRD today to convey info but I believe using packages for these is even better, since it is more flexible and does not limit you on defining CR. The way you know this is related to the cluster is by the references we use. The thing we need here is what I call a WorkloadClusterClaim/NodePoolClaim that provides the requirements and this is used in the business logic to call upon the proper packages that could be vendor specific.
Define a WorkloadCluster CR like NFDeployment and spits out the relevant CR(s) on which things get derived. This is more the operator pattern.

Now there is complete different discussion on how to convey status from workload cluster to mgmt cluster, but this a completely different can of worms that will need its own discussion. Both options can be use for this. But for me the fundamental discussion is about approach 1 and 2 first

s3wong commented 8 months ago

@henderiw

Actually I am a big fan of class-claim pattern --- NFTopology follows the class-claim pattern (as was its predecessor, the FiveGTopology CRD we defined back at ONE Summit '22) where capacity serves as the common claim language and NFClass specifies the provider (we use a package reference instead of just provider name, like Gateway API). As the providers need to understand capacity, NFTopology uses search-replace to set capacity to the provider packages. So no disagreement on using class-claim to match workload cluster claim with a provider (class implementor which should be different infra providers) which will try to satisfy the claim.

Do note that the class-claim pattern basically leads to the following essential results:

the WorkloadClusterClaim basically becomes our northbound CRD, where users / 3P orchestrators would utilize (based on Nephio community definition) to specify their workload cluster needs
the provider (class implementor) would essentially NEED to understand the claim structure and actuate the create/update on workload cluster based on its understanding of the claim. In essence, all providers need to understand the claim structure, which means they all have to operate on the claim (i.e., a common CRD), and would expose events to track the claim statuses (both PVC and Gateway do that)

I can draft a doc on what we talked about thus far, and we can then iterate on the detail of the approach

henderiw commented 8 months ago

In NF2Infra this is what I am leaning towards. This is very basic and allows us to do many things. This does not have to be actuated in the cluster. E.g. we can use this as a way to select the package that implements the specifics based on the selector. You could actuate this in the cluster and use a controller, but you see you start having to write a bunch of code which in the first approach is just achieved through expressions.

I changed from req -> claim but this is a detail. if we have a claim or req for the various packages we can build powerful logic.

apiVersion: claim.nephio.org/v1alpha1
kind: WorkloadCluster
metadata:
  name: example
  namespace: x
spec:
  location: us-west1
  workloadCluster:
    selector:

s3wong commented 8 months ago

@henderiw I see the flexibility of your structure, one can specify the provider, the list of statically known required resources, and even other needs:

spec:
  location: us-west1
  workloadCluster:
    selector:
      provider: aws
      dpu: true
      gpu: true
      sriov: false
...

However, while I like it for specifying static resource needs and potentially allowing the provider to select a package corresponding to an option of offerings (i.e., AWS may opt for EKS vs a K8s distro on EC2 vs Outpost...etc based on the selectors), I think the structure you have is overwhelmingly designed for the use case of dynamically matching a claim to a provider package. I think we need to balance system control / expectation with flexibility. Particularly for the ask for this issue, we want to have a simplistic and (hopefully) stable CRD (with extension) for different types of orchestrator. Use of selector gives flexibility, but could make the API unstable --- imagine, for example, if users don't set the provider field above, since Nephio isn't in the business of picking cloud provider, if multiple providers can satisfy the set of needs specified by the selectors, Nephio would have to fail it; so i.e., the provider field is required --- and if so, we should absolutely make provider a required field in our CRD such that we can check the required field at the API level instead of offloading that check to a controller. I think this is likewise true for user needing to specify the number of homogeneous nodes in a cluster.

I would advocate the following as a starting point:

type WorkloadClusterSpec struct {
    ClusterName string `...`
    Location string `...`
    Provider string `...`
    Nodepools []Nodepool `..., omitempty`
    Selectors []metav1.LabelSelector `..., omitempty`
}

Let me draft a doc to specify how it would work based on our conversation thus far.

henderiw commented 8 months ago

@s3wong A provider should be a selector. We don't have to define these parameters explicitly. A Very loose API is more open to various use cases. The reference to packages and node pool should be in the reverse direction. Similar how we do it today. I want to deploy multus or multiple node pools. You have these packages reference the cluster. versus the cluster referencing the items. It is much more flexible. With the current proposal every change require the cluster to change while you want to dynamically add delete these referenced packages w/o touching the cluster. So the reverse mapping is more flexible.

andersheric commented 8 months ago

I am trying to follow this discussion but I am missing a piece. Where do we intend to use these types? on the FOCOM side of O2-ims or the IMS side? or both? I miss the cluster templates being introduced in O-RAN for O2-ims in this discussion. These will affect what makes sense to model on the FOCOM side on an NBI since these templates will limit what a FOCOM can configure. Types that make sense in an IMS implementation are not likely a good fit for FOCOM and a NEPHIO NBI.

henderiw commented 8 months ago

We don't want to design this specific for ORAN. We should be agnostic of this, but if you want to create a cluster using Nephio this will be the way you would specify this.

andersheric commented 8 months ago

Thats why I ask where we intend to use these types. Supporting non O-RAN cluster deployments is one thing but we still expect NEPHIO to be O-RAN compliant, that implies we need to be able to talk to an O-Cloud over O2-ims to get a cluster provisioned.

henderiw commented 8 months ago

Why should the base nephio construct be dependent on O-RAN. I don't get this. so far we don't do anything specific to ORAN and we should keep it that way. if ORAN want to use Nephio that is an ORAN problem not a Nephio problem

andersheric commented 8 months ago

I am not saying we should be ORAN dependent in these types but since we expect to be able to use an NBI also in an O-RAN environment (this is made quite clear on the NEPHIO about page) we need types that can support also O-RAN use cases. This is where I see a gap if we assume to always have full control of every config option of the workload cluster in the NBI.

If these types are on an NBI and don't support a template based deployment they will not be usable in an O-RAN integration over O2-ims. That does not imply that they need to only support template based deployments though, similar to how clusterclass is used in CAPI templates can be an optional alternative that would be used with infrastructure providers that support or require templating.

Conceptually the use of a template in an api is equivalent to a function call with named parameters regardless if the ims is CAPI, O2-ims or something else which means the API type can still be generic.

henderiw commented 8 months ago

The way to instantiate these packages in Nephio is using a packageVariant CR that reference the package and in which you can customise input. This is the mechanism we use for this. So the package is a collection of KRM that get instantiated using the PVAR CR.

https://github.com/nephio-project/porch/blob/main/controllers/packagevariants/api/v1alpha1/packagevariant_types.go

This can also be consumed from PVSET.

https://github.com/nephio-project/porch/blob/main/controllers/packagevariantsets/api/v1alpha2/packagevariantset_types.go

henderiw commented 8 months ago

Also the parameters with the selectors are very flexible and you could consume it with ORAN specific keys e.g.

nephio-project / nephio

Create a abstract NBI for infrastructure orchestration in Nephio #527