openconfig / reference

This repository contains reference implementations, specifications and tooling related to OpenConfig-based network management.
Apache License 2.0
155 stars 88 forks source link

Preventing overlap in Subscription lists #211

Open ejbrever opened 3 weeks ago

ejbrever commented 3 weeks ago

Today for gNMI.Subscribe a list of subscriptions can be provided. However, it is possible that overlap is created with different subscription behaviors.

For example, assume a subscription list specified these two:

  1. components/component - SAMPLE
  2. components/component/linecard - ON_CHANGE

In this case, would (1) override (2) and provide all data as SAMPLE? Or does (2) override (1) for its specific leaves because it comes after (1). That could also raise the question of 'are the subscriptions within the list order dependent?'.

I didn't see this topic discussed within the specifications or the proto, but please let me know if this has been discussed before.

My thought is that the simple approach might be to define that overlap should not exist and the device should error if a subscription is sent with overlapping fields. This ensures the device does not assume any behavior and forces clients to be explicit with their subscription if need be.

LimeHat commented 3 weeks ago

Is there an assumption that these two subscriptions should have any sort of conflict or override? If so, why?

Each subscription can (and, in my opinion, should) be treated independently.

ejbrever commented 3 weeks ago

Ah, sorry, I should have clarified. I agree, each gNMI.Subscribe() should be treated independently.

I was thinking in the case where a client might send a single RPC like:

gNMI.Subscribe(&gNMI.SubscribeRequest{ subscribe: { subscription: { path: { elem: { name: "components" } elem: { name: "component" } } mode: SAMPLE } subscription: { path: { elem: { name: "components" } elem: { name: "component" } elem: { name: "linecard" } } mode: ON_CHANGE } } )

robshakir commented 3 weeks ago

Within the same Subscription RPC, this does seem like an error -- either the client wanted ON_CHANGE or it wanted SAMPLE. If it wants both it can make two RPC calls. I'd be in favour of just declaring this case an error, and therefore returning INVALID_ARGUMENT.

A clarification in 3.5.1.1 where we talk about having an empty SubscriptionList would probably be the easiest way to make this amendment.

LimeHat commented 3 weeks ago

No, I understand what you mean; I just don't see a problem in this?

If a client wants to subscribe to both, he can receive the updates accordingly.

E.g. /components/ tree will be sampled every ~10 seconds (or whatever number the client specified), and /components/component/linecard, will be streamed in ON_CHANGE mode to the same client. As an implementer, I don't have any problem with that.

Of course the client can open two parallel RPCs to retrieve the same data, but is there a reason to force that behavior?

robshakir commented 3 weeks ago

Of course the client can open two parallel RPCs to retrieve the same data, but is there a reason to force that behavior?

Is there a reason to force the complexity of having to deal with this? It's not clear to me what the use case for such subscriptions would be -- in the case that you care about fast updates, use ON_CHANGE, and in the case that you care about periodic updates, use SAMPLE -- ON_CHANGE already got you the most up-to-date data, so SAMPLE+ON_CHANGE is essentially identical to ON_CHANGE with a heartbeat_interval. It's cleanest if we don't have two ways to do the same thing.

LimeHat commented 3 weeks ago

Is there a reason to force the complexity of having to deal with this?

I wouldn't necessarily agree about the complexity part: if we simply treat a subscription as a subscription, and return the results based on that, I don't even want to care about where this subscription came from (the same list/rpc or a different list/rpc).

And if an implementation has challenges with handling two different subscription modes to the same path, they will face the same challenge with two RPCs.

One of the use cases we see among our customers, for instance, is a subscription to the interface tree to retrieve packet counters and all other state periodically; at the same time, they subscribe specifically to the oper-status leaf in ON_CHANGE mode. I'm not sure why we should force them to use two streams instead of one, or refactor the subscription into multiple more specific paths[^1].

[^1]: And depending on the specific data model (since gNMI is widely used across the board, and not just with OC models) this refactoring might be very tricky.

robshakir commented 3 weeks ago

Is there a reason to force the complexity of having to deal with this?

I wouldn't necessarily agree about the complexity part: if we simply treat a subscription as a subscription, and return the results based on that, I don't even want to care about where this subscription came from (the same list/rpc or a different list/rpc).

And if an implementation has challenges with handling two different subscription modes to the same path, they will face the same challenge with two RPCs.

I don't necessarily agree with this. Asserting this kind of thing always ends up with us assuming more complexity at the target, which means more likelihood of divergence between implementations, which means a more inconsistent user experience. Thus, I'd like to try and make sure that we design for use cases that seem reasonable.

One of the use cases we see among our customers, for instance, is a subscription to the interface tree to retrieve packet counters and all other state periodically; at the same time, they subscribe specifically to the oper-status leaf in ON_CHANGE mode. I'm not sure why we should force them to use two streams instead of one, or refactor the subscription into multiple more specific paths[^1].

OK - so this makes sense, I assume the subscription is at two different nodes though - i.e., I do not allow ON_CHANGE -> /interfaces AND SAMPLE -> /interfaces at the same time? This will come down to how we define what overlapping means.

[^1]: And depending on the specific data model (since gNMI is widely used across the board, and not just with OC models) this refactoring might be very tricky.

Agreed -- but we also have to deal with "reasonable" data models :-) Catering to auto-generated schemas that have never considered their user experience seems a way to again negatively impact gNMI's usage.

LimeHat commented 3 weeks ago

Thus, I'd like to try and make sure that we design for use cases that seem reasonable.

That's fair; I think we just may have slightly different views on what is reasonable :-)

OK - so this makes sense, I assume the subscription is at two different nodes though This will come down to how we define what overlapping means.

yes, in OC that would be something like

  1. /interfaces/interface[name=*]
  2. /interfaces/interface[name=*]/state/oper-status

(very similar to the example from @ejbrever).

i.e., I do not allow ON_CHANGE -> /interfaces AND SAMPLE -> /interfaces at the same time?

I think we may allow this currently, but I agree that this is probably not required. However, the original proposal covers a much broader scope (to disallow any overlaps).

hellt commented 3 weeks ago

I agree with @LimeHat here, a single subcribe request with enclosed path list like

/interfaces/interface[name=*]                   (SAMPLE)
/interfaces/interface[name=*]/state/oper-status (ON_CHANGE)

is what I saw as well being used by the gNMI users. And often the reason why they pack these partially overlapping paths in a single request has to deal with scaling considerations as we all operational simplicity.

The scaling consideration often roots in a fact that there is a finite amount of subscriptions requests a device might allow a client to do. This limit is different from the subscription paths limit.

The operational simplification of a such request comes from the fact that users have less subscriptions opened/managed.

ejbrever commented 3 weeks ago

Perhaps there is an underlying challenge here regarding how to succinctly write a single subscription for a container which has some leaves or sub-containers that a client would want streamed differently than the other pieces of that container?

TARGET_DEFINED certainly reduces the complexity here in writing the subscription, but personally I prefer to have the behavior be explicit to avoid misunderstandings and to avoid potential changes release to release. Also, if we want to write tests to validate the behavior, this creates challenges there as well since we wouldn't really know what to test.

ON_CHANGE with a heartbeat solves many of my issues.

The last issue I see is very fast changing (i.e. some counters or analog) data. In these cases, ON_CHANGE is problematic because it would constantly be sending data. I wonder if an inverse concept to heartbeat would be interesting to solve this. Essentially, if there was an option to set "maximum updates per leaf per second" one could write a subscription with ON_CHANGE for anything at that point.

I still struggle with the idea of overlapping requests because the expected behavior doesn't feel obvious, but at least if it's defined in the spec that would be quite helpful. However, if clients could create an ON_CHANGE subscriptions w/heartbeat and w/max-updates-per-leaf-per-sec, would there still be a need for overlap?

LimeHat commented 3 weeks ago

Perhaps there is an underlying challenge here regarding how to succinctly write a single subscription for a container which has some leaves or sub-containers that a client would want streamed differently than the other pieces of that container?

Is there? :)

We don't see a problem with treating each subscription independently. Why does it matter that a series of subscriptions (with different parameters) came in the same SubscriptionList message?

Or, why does "overlap" matter? Is there a fundamental reason why a subscription to a child of a container (with another active subscription) should be any different from a subscription to a leaf that is not streamed to the client in any form yet?

ejbrever commented 3 weeks ago

Yea, the issue I ran into came down to being able to decipher what the behavior should be and also testability.

The first issue of deciphering the correct behavior can certainly be solved with just updating the specifications, even if overlap is allowed. Today, each vendor or even platform can make their own choices on what the behavior should be which is difficult to manage in a multi-vendor environment.

Regarding testability, I do have some concerns that this behavior is rather difficult to test. Each message streamed over the channel has no association to what subscription it is derived from, so being able to write a test to see if a SAMPLE leaf gave an update every X seconds or if an ON_CHANGE leaf reacted as expected gets quite messy. I don't really see a way to test this, other than to say that we are getting more >= amount of data needed. Although, with some platforms we are hitting scaling limits, so the idea of an approach that sends more data unnecessarily doesn't seem ideal either.

LimeHat commented 3 weeks ago

I don't really see a way to test this, other than to say that we are getting more >= amount of data needed.

If we stick to the example with

/interfaces/interface[name=*]                   (SAMPLE)
/interfaces/interface[name=*]/state/oper-status (ON_CHANGE)

the test should expect an update for all interfaces every x seconds (+- tolerance), and for the ON_CHANGE, you have to simulate an event to trigger the update (with the exception of the initial update). So I think it is totally possible to spread out the two conditions, or, at the very least, you know when a simulated event takes place and you should expect an extra update.

the idea of an approach that sends more data unnecessarily doesn't seem ideal either.

I wouldn't call this "more data unnecessarily". I think it is fair to assume that if a client requests to receive data with a certain frequency, he has a reason to do so (otherwise it wouldn't ask).

There are many ways to hit the scaling limits, and eliminating this one behavior on the server side doesn't seem to solve anything: