Implement edge watcher GRPC server

gvbalaji commented 1 year ago

Implement the GRPC server for the edge watcher. EdgeWatcher will be a service/pod running on management cluster, exposes List/Watch interface for client to access statuses

[ ] Design watcher server
[ ] Implement and unit test watcher agent
[ ] Readme with details and build procedure for watcher agent

grzegorzpnk commented 1 year ago

@gvbalaji could you please provide more info regarding this task? What watcher agent refers to? To monitor infrastructure layer or CNF specific metrics? Or Nephio platform itself?

gvbalaji commented 1 year ago

I think this is about watching CNF orchestration status on workload clusters. @johnbelamaric can you please confirm?

johnbelamaric commented 1 year ago

Both "intent realization status" and also possibly other types of status. See https://github.com/GoogleContainerTools/kpt/issues/3543 for a discussion of some options. That is centered around "intent realization status".

Seed code should include an implementation of this that uses grpc for the transport, though whether we can use it directly may depend on our CRD structure decisions. Personally I would like to see about leveraging the metrics pipeline and not implement separate code for this. I like avoiding new code.

Also, from this doc:

How do we get different types of status back to the management cluster?
How do we aggregate that status?
How does this relate to metrics?
Eventually: can we analyze the status / metrics and suggest config changes?
Types of status:
- Intent Realization Status: Have the resources in the package been applied successfully on the workload cluster (all gone to “Ready” status)
- Workload Health
- Network Function Specific Status

JeanMarieCalmel commented 1 year ago

How is the status described here different from the status that would be in any well design CRD? Are we saying that we want to create an alternate mechanism than a get operation on the corresponding k8s cluster API for the CR?

johnbelamaric commented 1 year ago

At the base of it, that is necessary. But we need a few more things:

A package will consist of an arbitrary number of resources. We need to be sure each of those has been successfully applied and reaches "Ready" state at the workload cluster, and we would like to report that as a single status back to the user. So it wouldn't always be clear which CRD to use for that. We do have https://github.com/GoogleContainerTools/kpt-resource-group as one way to do this.
I don't think we want to rely on outbound network access from management cluster to the Kube API server on each workload cluster (although the seed code does rely on this).
At the scale we hope to achieve, you can't do live polling of workload clusters for each user request, it won't perform well from a user point of view. You need to have a mgmt cluster local representation of that status.
If we consider a "Set" style deployment, we want to aggregate status of individual NFs that are part of the set up to a status for the whole set ("100% sync'd to WL clusters, 85% ready state, 80% healthy, etc.), and allow drill-in from that point of aggreagation.
Since we have differnet types of status, we may need different CRDs. For example, some status from the git-syncer (ConfigSync, etc.) itself for "successfully applied", resource group for "all resources ready", Pod health checks for health?, something NF-specific for NF details?

By the way, the "apply" stage seems simple but if we have webhooks, apply time mutations, etc. it can easily fail.

nephio-project / nephio

Implement edge watcher GRPC server #20