gNMI.Set - Config Diff Extension

ejbrever commented 1 month ago

I'd like to propose a new gNMI extension to perform config diff on a device. Details are described below.

Objective Define an approach to validating that a configuration candidate sent using gNMI.Set() would result in zero configuration differences if applied to a device. If differences are identified, then they would need to be reported back.

Background In a production environment, migrating to OpenConfig based configuration from a CLI based configuration is a potentially risky activity. When a feature is migrated to OpenConfig, the migration should functionally be a no-op, however, there isn't a guarantee that the OpenConfig matches exactly with its CLI equivalent. The OpenConfig may be accepted and return success. However, a success could mean:

Config was applied and is exactly equivalent to the configuration currently on the device.
Config changes resulting in immediate undesired behavior.
No immediate/obvious changes, but small unknown changes are made.

Proper testing would provide a level of confidence for these migrations, however, it puts a burden on the tests being complete and comprehensive. In order to definitively validate that a feature works identically, test cases are required that exhaustively cover all edge cases. Whilst this is a desirable end state - the shorter-term requirement is to validate that existing behaviors have not changed.

Utilizing a defense-in-depth approach, relying on an additional config diff would provide more confidence to migrate configuration. Additionally, this would be helpful during development of new OpenConfig config generators as developers would have an easy mechanism to compare their code output against existing CLI based config generators.

Proposal A new mode could be supported within a gNMI.Set() to have the device perform a diff of the provided configuration against the existing configuration. This would result in two outcomes:

Zero configuration changes on the device.
The device describes what the diff would be if applied.

In either case, a configuration change is NEVER allowed to be applied to a device in this mode.

Specifically, the proposal of the mechanism to achieve this is to add an "impact" extension to the gNMI.SetRequest. It is suggested to utilize the gNMI.Set() RPC because an important need is to use the exact details within the SetRequest message to define the candidate release. This ensures the configuration sent in the normal mode or in this new proposed mode are identical. The SetRequest message could be reused in another RPC, but that seems more complex than simply having a new impact extension within the existing gNMI.Set().

ejbrever commented 1 month ago

@dplore @robshakir @marcushines

LimeHat commented 1 month ago

Can you please elaborate on a few things:

what is the format of the diff? openconfig, cli, or something else?
is there an assumption that the device uses separate datastores for openconfig and cli/native configurations?

In either case, a configuration change is NEVER allowed to be applied to a device in this mode.

I have a concern that the proposal may not be aligned with the following statement in the spec:

An extension that is defined to a gNMI RPC MUST NOT modify the base behaviour of
the RPC. That is to say fields MUST NOT be interpreted in a manner which does
not match the base specification. The success or failure of an RPC MAY be
impacted by the extension. If the base specification's behaviour is to be
modified, an implementation MUST define a new service which specifies the
modified RPCs.

https://github.com/openconfig/reference/blob/067d53c6250d198853a00677f1c087b1dd4e04e9/rpc/gnmi/gnmi-extensions.md?plain=1#L18-L24

dplore commented 1 month ago

Why not do a gnmi.Get and compare that to your planned config? ygot.Diff provides functionality to compare two OC structs.

ejbrever commented 1 month ago

Ah, yea, I think the device datastore is an important piece here. The running config may have been created using any combination of CLI, OC, native model, etc. A candidate config is also likely a combination of OC, native model, or even CLI via a union_replace. So from the start we know we are likely not comparing OC to OC or even YANG to YANG.

I think from experience we've seen any time we need the device to convert between config types (i.e. CLI -> OC) we introduce the risk of bugs. For this reason, I'd be hesitant to suggest we'd want to do something such as adding an extra step to normalize the entirety of both configs to something common like OC or CLI in an effort to perform this config diff.

The ultimate goal here is to ensure the candidate config will result in zero differences to the running config. I think the only source of truth at that point is the device's underlying datastore. I'd suspect what we'd really be asking to compare is the device's running config (as defined by its lowest layer datastore) against the candidate config (as defined by how the device would convert it into the lowest layer datastore).

In the case of a diff, that would result in output that perhaps won't be as readable or obvious as a bunch of clean OC paths, but that might be the tradeoff to guarantee and truly give the confidence that the diff is accurate. There shouldn't be any requirements for the output to be machine readable though.

Regarding the RPC, I'm not tied to using an extension, it just seemed like an option that would be on the less intrusive side, but if a new RPC is preferred (or even a new gNOI service) I'm open to that. We should give some consideration to the need to have something that matches 1:1 with the SetRequest and wouldn't deviate over time, so that might sway the choice one way or another.

LimeHat commented 1 month ago

The ultimate goal here is to ensure the candidate config will result in zero differences to the running config. I think the only source of truth at that point is the device's underlying datastore. I'd suspect what we'd really be asking to compare is the device's running config (as defined by its lowest layer datastore) against the candidate config (as defined by how the device would convert it into the lowest layer datastore).

I'm wondering if you can provide an example of how you plan to use this. Let's imagine you're starting with a config that has cli_statement_X=true, which can be converted to openconfig_statement_X=true

If we assume that the device has this extension, and you send a gnmi.set with an update (or replace) OC request openconfig_statement_X=true, then, depending on the implementation, a few things can happen: 1) the device accepts this config, and converts it to whatever format it uses internally, and this results in no-diff / no-op (without the extension) as you expect 2) the device accepts this config, and stores it in a separate OC datastore, and you still get a no-diff. the catch is that the cli native config can still live in a separate datastore and take precedence (or not, who knows at this point?)

in fact, i can imagine a scenario where you can send the oc openconfig_statement_X=false, and get a no-diff, just because the cli origin cli_statement_X=true is still in the candidate 3) the device rejects this config because there's a conflict with another origin

In the case of a diff, that would result in output that perhaps won't be as readable or obvious as a bunch of clean OC paths, but that might be the tradeoff to guarantee and truly give the confidence that the diff is accurate.

Still, what would be the format of the diff? Are we saying it can be vendor-specific and defined by the vendor?

ejbrever commented 1 month ago

Sure, I'll try to explain the use case better and how I'd think operators would use it:

=========================================================== Starting point: Device is in-service and configured with two features (featureA is using CLI, featureB is using OC).

Work needed: We'd like to migrate featureA from CLI -> OC.

Challenge: Sending a new config to an in-service device is really risky. The migration of featureA to use OC must be a no-op, but guaranteeing this is the challenge.

Migration work: First run a "config-diff" by providing the candidate config that sends both featureA and featureB using OC only.

If config diff reports zero diff, then proceed to migrate config to always send both featureA and featureB using OC.
If config diff reports a diff, then halt migration, go into an investigation mode to resolve issues with the config generation.

===========================================================

I think your points above highlight why I'd be nervous about the diff happening at any layer other than the lowest layer database in a device. Guaranteeing behavior between whatever higher layer config databases (OC, CLI, etc.) seems problematic as you highlighted.

===========================================================

As for the format of the diff, I see it really as just vendor-specific, referencing the lowest layer database. I'd hope the output could be minimally human readable, but the action taken from a diff is probably that dev's on the operator's side need to investigate the config gen. That investigation likely would also involve vendor dev's as well. But the diff's should not have a need for anything fancy or machine readable.

LimeHat commented 1 month ago

Migration work: First run a "config-diff" by providing the candidate config that sends both featureA and featureB using OC only.

How does that SetRequest look like? What is going to disable featureA, which is currently enabled via CLI, while simultaneously enabling featureA via OC[^1]? [^1]: assuming that the NOS makes that distinction, which may or may not be the case.

You can, potentially, find a way to do this via union replace[^2]. [^2]: assuming that you can specify the full configuration in a single union_replace transaction, and you tested that before on that specific vendor/device, and this is how you got your current running config, etc.

But even then, you'll face more challenges: not all config is actually specified via gNMI, according to the bootz specification and configuration namespaces defined there

The comlete "running config" is merged from many sources

boot configuration in native format
boot configuration in oc format
dynamic configuration in native format
dynamic configuration in oc format
security namespace

So now cli feature A can be enabled in boot config or in dynamic config or in both. And we can't solve that with union_replace or gnmi.

ejbrever commented 1 month ago

There isn't anything special with the SetRequest here. I also don't view it as enabling/disabling...just doing it with a different mechanism.

For example, trying to take the simplest config knob I can think of, the initial config generator may be:

Over SSH and via CLI:

config hostname lx01wr01.sfo01

Then when we are ready to migrate this config to OC, we could re-work the config generator and instead start doing this going forward (if deemed safe):

Over gRPC and via gNMI.Set():

replace: { path: { elem: { name: "openconfig-system:system" } } val: { json_val: "{ "openconfig-system:config": { "hostname": "lx01wr01.sfo01" }, } }

===========================================================

To some of your other questions:

I'm not sure I follow how union_replace would help here, but interested in what you are thinking there.
It's not actually a problem that all config cannot be sent through gNMI. This is because we are interested in comparing the complete running config from all sources (which the device already has) against the complete candidate config from all sources (which the device should know how to derive based on the gNMI.SetRequest going through the full logic the device does to process an incoming gNMI.SetRequest under normal circumstances).

LimeHat commented 1 month ago

The problem with multiple sources is that you don't know which source contributed the value to the running config.

If we stick with your example, and you start with config hostname lx01wr01.sfo01.

and then send a gnmi set/replace + diff

replace: {
path: {
elem: {
name: "openconfig-system:system"
}
}
val: {
json_val: "{
"openconfig-system:config": {
"hostname": "lx01wr01.sfo01"
},
}
}

you can receive a response that the final config is identical, and the change is no-op. But it doesn't guarantee that the first config statement can be replaced with the second.

If we modify the OC request to

replace: {
path: {
elem: {
name: "openconfig-system:system"
}
}
val: {
json_val: "{
"openconfig-system:config": {
"hostname": "THIS_SHOULD_FAIL"
},
}
}

You would expect to see a diff, but, depending on the specific implementation and configuration approach, it's not guaranteed. You can receive the "zero changes" response and think it is a no-op because the previous config lives in another datastore with higher precedence (e.g. bootconfig). And only when you remove it from the other source, the change will be visible.

In both scenarios, you can also receive a grpc error on gnmi set (with or without diff extension). If an implementation uses separate data stores for cli+oc and does not allow overlaps, that would be the result of trying to overwrite the value defined in another data store.

ejbrever commented 1 month ago

I think the underlying assumption here is that there is a datastore that each device maintains at the lowest layer (i.e. the datastore that has the highest precedence). This would be where the diff happens.

I should note, I did speak with one vendor who indicated this to be the case and seemed like supporting a feature like this would be relatively easy. NETCONF apparently offers a similar feature, although I don't have much knowledge of it, but because of that this vendor seems to have looked into doing this already.

That is an interesting scenario though where there might be a zero diff, but if some previous config persists in a datastore, you wouldn't really know its contribution to the final lowest layer datastore until it somehow got removed. The main goal is to know that the candidate config will be a zero diff operation, which it would still do though. There is now the risk that some future change doesn't behave as desired, but I think this could be acceptable (or at least dealt with using another approach).

LimeHat commented 1 month ago

NETCONF apparently offers a similar feature, although I don't have much knowledge of it, but because of that this vendor seems to have looked into doing this already.

NETCONF doesn't offer this as a part of the protocol, but a recent RFC (rfc9144), indeed, introduced a similar (optional) feature, and prior to that, there were a couple of other initiatives. The major difference, however, is that netconf (/ietf nmda) uses explicit and well-defined datastores for all operations (rfc8342), including the <compare> rpc, and has concepts of candidate datastore and commits. Everything that <compare> does can be as well implemented on the client side, by comparing the results of two <get-data> outputs; everything is explicit.

This proposal, on the other hand, introduces a layer of abstractions that is likely to produce different results in different implementations and environments, all while possibly not solving the original problem.

That is an interesting scenario though where there might be a zero diff, but if some previous config persists in a datastore, you wouldn't really know its contribution to the final lowest layer datastore until it somehow got removed. The main goal is to know that the candidate config will be a zero diff operation, which it would still do though.

Perhaps I don't understand the use case. My assumption was that in a config migration scenario (e.g. you want to replace cli config hostname with OC /system/config/hostname), you want to know what will happen when the two things occur simultaneously: 1) the cli config is removed (explicitly or implicitly), and 2) the new OC config is added

However, based on the conversation in this thread, it seems that you're ok to validate only the second part (config addition, which may coexist with the old config).

Personally, I don't think this solves the problem that you described initially.

liulk commented 1 month ago

I also think that gnmi.Set should focus on a single concern, namely to modify config in the simplest way as a parallel to Get. If we want to try a config, then maybe a separate call is better:

rpc Try(TryRequest) TryResponse;

message TryRequest {
  SetRequest set_request = 1;
  // More attributes to customize what to validate or compare.
}

message TryResponse {
  string diff = 1; // Could be more specific; open to suggestion.
}

I haven't looked into NETCONF, but it does sound more well-defined.

Somewhat off-topic, my other wishlist for gNMI that could have saved me a lot of headache during development is the ability to do config snapshots and auto-rollback if the config knocks the device offline for some reason. Being able to review the config diff during release qualification would also help greatly to catch programmer errors. Many vendors already support this in CLI, but it may require some thought how to design this in gRPC.

I think we would at least need a few more methods to make that possible: Stage/Compare, Commit/Abort/Rollback. It adds operational complexity that it might be a better fit for a new service.

Any opinions? @dplore @robshakir

robshakir commented 1 month ago

Let's keep this PR/issue focused on the core topic.

We deliberately did not expose these semantics (long-lived candidates, rollback across different transactions) in gNMI. They raise many corner cases that happen when humans are experimenting/during development, but generally are overhead otherwise. I'm sympathetic to the problems described, but do not support adding significant complexity to gNMI.

dplore commented 1 month ago

Regarding auto-rollback, there is a gnmi extension for commit (and confirm) which is for that purpose. https://github.com/openconfig/gnmi/blob/master/proto/gnmi_ext/gnmi_ext.proto#L97-L103

ejbrever commented 1 month ago

It seems there is some consensus to avoid adding to gNMI through an extension. In talking with @robshakir it sounds like a good alternative might be a new gNOI service dedicated to this. I'll explore that.

liulk commented 1 month ago

Is the plan to deprecate the gnmi_ext.proto and have this new gNOI subservice implement config staging, comparison and rollback? I agree that gnmi extension adds a lot of complexity to gnmi.Set semantics. On the other hand, if we split that to a separate service, then I think limiting the scope of that to just comparing might be a missed opportunity to streamline the config store lifecycle across all vendors.

robshakir commented 1 month ago

gNMI extensions provide more than this capability, so no, we will not deprecate gNMI extensions based on any decision in this issue.

liulk commented 1 month ago

Notwithstanding gNMI extension, will this new service be a solution to config store lifecycle? Limiting the scope to config comparison would be a missed opportunity to catch up to NETCONF.

dplore commented 1 month ago

The proposed scope of this issue / requirement for Diff encompasses back end data stores which is a significant expansion of scope for gNMI and what is IMO the root issue preventing this from being a simple addition or extension of gNMI.

We want to keep gNMI scoped to an external view of the configuration. One can compare a Get with a Set using ygot.Diff today which returns gNMI formatted notification messages.

One could also get the vendor native yang and CLI config as well using CLI. The CLI config will just be an ascii blob which could only be a text based diff on the client side. better than nothing, but not a 'semantic' diff like we get with yang based schema.

LimeHat commented 1 month ago

@ejbrever

It seems there is some consensus to avoid adding to gNMI through an extension. In talking with @robshakir it sounds like a good alternative might be a new gNOI service dedicated to this. I'll explore that.

For the record, I don't necessarily object to adding a diff operation to gNMI (via an extension or by other means)

But I do have a number of other concerns/questions about the proposal, such as

does it solve the problem stated in the description (I have doubts).
is it aligned with the overall architecture defined by openconfig standards (including the gnmi datastores and bootz namespaces).
opaque definitions and complexity that will lead to (potentially significant) differences in implementations and outcomes. this, by the way, will lead to validation difficulties as well (I think @ejbrever had raised this concern in another topic recently)
possibility of getting false negative (false "no diff") results due to the previous two points

And these challenges will not disappear if this idea will be moved out to gNOI

openconfig / gnmi

gNMI.Set - Config Diff Extension #179