nephio-project / nephio

Nephio is a Kubernetes-based automation platform for deploying and managing highly distributed, interconnected workloads such as 5G Network Functions, and the underlying infrastructure on which those workloads depend.
Apache License 2.0
108 stars 53 forks source link

Implement edge watcher GRPC server #20

Open gvbalaji opened 1 year ago

gvbalaji commented 1 year ago

Implement the GRPC server for the edge watcher. EdgeWatcher will be a service/pod running on management cluster, exposes List/Watch interface for client to access statuses

grzegorzpnk commented 1 year ago

@gvbalaji could you please provide more info regarding this task? What watcher agent refers to? To monitor infrastructure layer or CNF specific metrics? Or Nephio platform itself?

gvbalaji commented 1 year ago

I think this is about watching CNF orchestration status on workload clusters. @johnbelamaric can you please confirm?

johnbelamaric commented 1 year ago

Both "intent realization status" and also possibly other types of status. See https://github.com/GoogleContainerTools/kpt/issues/3543 for a discussion of some options. That is centered around "intent realization status".

Seed code should include an implementation of this that uses grpc for the transport, though whether we can use it directly may depend on our CRD structure decisions. Personally I would like to see about leveraging the metrics pipeline and not implement separate code for this. I like avoiding new code.

Also, from this doc:

JeanMarieCalmel commented 1 year ago

How is the status described here different from the status that would be in any well design CRD? Are we saying that we want to create an alternate mechanism than a get operation on the corresponding k8s cluster API for the CR?

johnbelamaric commented 1 year ago

At the base of it, that is necessary. But we need a few more things:

  1. A package will consist of an arbitrary number of resources. We need to be sure each of those has been successfully applied and reaches "Ready" state at the workload cluster, and we would like to report that as a single status back to the user. So it wouldn't always be clear which CRD to use for that. We do have https://github.com/GoogleContainerTools/kpt-resource-group as one way to do this.
  2. I don't think we want to rely on outbound network access from management cluster to the Kube API server on each workload cluster (although the seed code does rely on this).
  3. At the scale we hope to achieve, you can't do live polling of workload clusters for each user request, it won't perform well from a user point of view. You need to have a mgmt cluster local representation of that status.
  4. If we consider a "Set" style deployment, we want to aggregate status of individual NFs that are part of the set up to a status for the whole set ("100% sync'd to WL clusters, 85% ready state, 80% healthy, etc.), and allow drill-in from that point of aggreagation.
  5. Since we have differnet types of status, we may need different CRDs. For example, some status from the git-syncer (ConfigSync, etc.) itself for "successfully applied", resource group for "all resources ready", Pod health checks for health?, something NF-specific for NF details?

By the way, the "apply" stage seems simple but if we have webhooks, apply time mutations, etc. it can easily fail.