openconfig / reference

This repository contains reference implementations, specifications and tooling related to OpenConfig-based network management.
Apache License 2.0
155 stars 88 forks source link

Clarify gNMI expectation for persisting configuration #192

Open dplore opened 1 year ago

dplore commented 1 year ago

Currently there is no explicit requirement in gNMI regarding if a successful SetRequest should be saved to persistent storage on a device. gNMI intentionally doesn't define a method to make a configuration change that isn't persisted.

The gNMI reference should be clarified to require that a device must persist the configuration before sending a successful SetResponse.

hellt commented 1 year ago

@dplore are you referring to a behavior when a set request leads to saving dynamic configuration to a persistent storage -- e.g. making current running config saved to a file that is used when the device boots?

robshakir commented 1 year ago

couple of comments on this issue:

dplore commented 1 year ago

@hellt yes, saving dynamic configuration to a persistent storage which is used to load the configuration when the device reboots.

couple of comments on this issue:

  • (nit, but important) the RPC is Set, so we should think about this as a "successful Set" not a successful SetRequest.

This is correct of course. I'll use Set when referring the to RPC overall and the SetRequest message and SetResponse message when specifically referring to those messages and expected behavior. This issue is about fine grained behavior expectations of the Set RPC's Request and the Response behavior. So I've been using SetRequest and SetResponse terms to differentiate.

  • you say this should be before the SetResponse is sent. We should think carefully about this -- this will make the RPC blocking until the write of config has happened (could be latent) and means that you are mandating 1:1 write to Set correspondence. Consider rather whether this should be allowed asynchronously - i.e., the promise that is made is that the config will be persisted and telemetry can be used to determine if that write has been completed. This allows for higher Set throughput and optimisations like batch writing if you ended up with lots of simultaenous Set RPC calls.

This is a good point! The question you point out here is if the "persist" of the configuration should be part of the Set transaction or occur asynchronously from of the Set transaction. If persisting the configuration data is not part of the Set transaction, then the Set transaction could succeed and the persisting could fail.

Operationally we want the configuration to take effect and we want it persisted. Set is supposed to do both of these things. If persisting fails, we definitely still want the configuration to take effect. If the persisting fails, we want a warning/error which triggers an action to repair whatever it is that is causing the "persist" of the configuration to fail.

If we allow a Set to succeed asynchronously from the the configuration is being persisted, then we are reliant on undefined logging / debug information to detect failure of persisting a configuration. It seems bad to require a behavior in an RPC which is not also guaranteed. If we want to have the status of "persisting" a configuration to be explicit in a Set RPC, then I can think of three options:

  1. Require the Set to block on persisting a configuration. If the "persist" cannot be completed, the configuration should still be applied and an indication in the SetResponse indicating the configuration is applied, but not persisted. Probably this should also be a gRPC error condition.
  2. Do not require gNMI Set to persist a configuration, instead create another RPC specifically to "persist" a configuration.
  3. Define Set RPC to be asynchronous and introduce a new SetResponseStatus message which indicates the status of each step in the Set operation and if the total operation is complete. ie: configuration is 'validated' and 'persisted' as two separate steps.
  4. ClarifySet as a synchronous operation for "persisting" a configuration and introduce option 2 or 3 as a new RPC.

Option 1 keeps things simple and synchronous, but as you said, has tradeoffs. Options 2 and 3 allow a Set to asynchronous from "persisting" a configuration. (and are more complex) Option 4 is just a combination of 1 and (2 or 3).

IMO, Set should not be a frequent transaction. A configuration change once per minute would be a very high rate of configuration change. Given the latency we see for configuration changes today, depending on the platform and total size of a configuration, once per minute is within the same order of magnitude of the performance we are seeing for Seting large configurations on large devices.

Maybe there are other options.

My thoughts are to proceed with 1 now and add 2 or 3 based on OC Operator feedback. (Resulting in option 4, assuming something is added).

robshakir commented 1 year ago

I find the use of SetRequest and SetResponse in this context confusing :-) Set is an RPC that takes a payload of SetRequest, the response to the RPC is a message called SetResponse. SetRequest and SetResponse are just protos, they don't have "behaviours". The Set RPC has behaviours.

w.r.t your options... I would like to see data as to:

a. implementations conformance to each option b. performance implications of each option

and consideration of backwards compatibility. we have expected Set to persist configuration before - so something like 2 causes there to be cases that do not implement this.

I don't think relying on telemetry is a bad thing, which you assume above, I believe. Today, if we look at the consistency model we have we are saying:

thus, I'd argue that architecturally we've already said that telemetry is something that is required to understand the state of a configuration on a device. Equally, we've said that the Set being successful is a promise that the configuration will try to be applied.

We also see cases where we'd like faster Set operations, e.g., for fault mitigation. Requiring that write to persistent storage are synchronous seems counter to that aim.

hellt commented 1 year ago

There are applications that might require quite some number of consecutive sets (e.g. atomic-based configuration management systems that manage "small" configuration regions and thus responsible for sending changes to these regions). With this being said, I think ultimately requiring cfg persistency without being able to disable it is not flexible enough.

To add to the bucket of vendor's implementations, in SR Linux we have a configurable knob under the grpc server stanza (set to false by default) that when set will persist each successful Set.

robshakir commented 1 year ago

Based on the comments here, @dplore - I suggest we write a short document that describes the constraints on the design space, the existing guidance, and the potential solutions and review this with the various gNMI stakeholders.

dplore commented 1 year ago

Based on the comments here, @dplore - I suggest we write a short document that describes the constraints on the design space, the existing guidance, and the potential solutions and review this with the various gNMI stakeholders.

I am happy to do this. But this task has now grown to something that has to be scheduled in a future sprint.

robshakir commented 1 year ago

Thanks.

hellt commented 1 year ago

@robshakir you mentioned

we have expected Set to persist configuration before

Since this was not expressed in the spec I fear it might not be the case with implementations in the wild