Discuss resetting gRIBI election ID on Cisco

liulk commented 2 years ago

This is to continue discussion in PR #125 about how to reset gRIBI election ID on Cisco. @mojiiba @mingyangcisco

mingyangcisco commented 2 years ago

Hi @liulk , we didn't see API in gribi to reset the election ID. Is there specific way you want to get that reset? also adding @ashpatcisco to the discussion.

liulk commented 2 years ago

It's not currently part of gRIBI API for sure. I can see several options:

Add gRIBI reset to the API.
Restart gRIBI over gNOI, maybe KillProcess: https://github.com/openconfig/gnoi/blob/master/system/system.proto#L86
Restart gRIBI when applying gNMI config changes to shutdown and "no shutdown" gRIBI agent.

Adding @robshakir to comment on which approach is the correct one.

FYI, @greg-dennis this may need to be part of device config reset between Ondatra tests.

robshakir commented 2 years ago

I'm not sure that resetting the election ID makes sense for production use cases, but I could be convinced, we haven't had any specific ask for this.

Killing the gRIBI daemon seems a reasonable step here - via gNOI. We'd have to figure out here how to reference gRIBI specifically. Restarting the daemon via disabling/enabling the daemon sounds a bit more kludgy, but also possible.

Primarily -- let's start out with the discussion of whether we think that this is a production requirement. Could you also describe what the use case is in featureprofiles too please? //cc @xw-g

liulk commented 2 years ago

@robshakir this came up in a testing situation where each test should start with a well-defined state, free from the leftover state of a previous test whether the previous test crashed or just didn't clean up.

@dplore, @greg-dennis and I discussed these options today, and we think that config disabling/enabling seem like a good idea because it also standardizes on how to configure the other g* agents in OpenConfig, something that I think is missing.

It could be something like how we configure /system/ssh-server or /system/ntp.

http://ops.openconfig.net/branches/models/master/docs/openconfig-system.html

dplore commented 2 years ago

gNOI process restart sounds elegant and more aligned with the desired operational action, but won't it be vendor specific due to process names etc? With the config approach it is openconfig generic. Now, we still need some way in openconfig to turn on gRIBI as that is currently not defined.

On Thu, Apr 21, 2022 at 3:15 PM Likai Liu @.***> wrote:

@robshakir https://github.com/robshakir this came up in a testing situation where each test should start with a well-defined state, free from the leftover state of a previous test whether the previous test crashed or just didn't clean up.

@dplore https://github.com/dplore, @greg-dennis https://github.com/greg-dennis and I discussed these options today, and we think that config disabling/enabling seem like a good idea because it also standardizes on how to configure the other g* agents in OpenConfig, something that I think is missing.

It could be something like how we configure /system/ssh-server or /system/ntp.

http://ops.openconfig.net/branches/models/master/docs/openconfig-system.html

— Reply to this email directly, view it on GitHub https://github.com/openconfig/featureprofiles/issues/147#issuecomment-1105813565, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABMCGM64CQ3AMP6HARN6YB3VGHHP3ANCNFSM5T5QXL5Q . You are receiving this because you were mentioned.Message ID: @.***>

mojiiba commented 2 years ago

While, we may need a mechanism to reset the election id on the server side. I still think assumptions based on a certain election id for tests is unnecessary. Gribi provides a way to learn the last known election id so we need to use the last know election id as the base when we define test cases. For example rather than assuming "the clientB to sends election id 10", we can assume "clientB sends election id = last known_election_id +1". I have add a function BecomeLeader that handles this scenario.

xw-g commented 2 years ago

Sorry, missed this thread. +1 to @mojiiba's point.

Also, an option to further improve the solution is to wrap the value in an atomic uint64 that we can only increase it during the test. Here is one example with electionID.Inc()

Agree with @robshakir that currently we don't have a prod use case for resetting the election_id.

liulk commented 2 years ago

Sure, we can make the test more resilient about not hard coding election ID (and we need to update the test plan to do that), but from the test infrastructure's point of view, we still need a way to reset device state. There is a long-standing Google testing philosophy that testing environment should be hermetic, so tests are reproducible.

https://testing.googleblog.com/2012/10/hermetic-servers.html

xw-g commented 2 years ago

I agree, also I see there might be different level of "hermetic" here, each with different methods of doing it:

Reset gRIB content in RIB/FIB -- To avoid the influence of gRIBI objects from previous tests. This can be done via Flush().
Reset election_id -- To make sure correct Primary and Secondary clients are decided. My question is that if we can archive the same goal via @mojiiba 's helper function, why we want to reset it?
Reset the whole gRIBI server state -- gRIBI server might be in a bad/problem state from previous test. This can be done either via gNOI or the configs. Delay of shutdown/start the daemon should be expected.
Reset the whole device -- The device might be in a bad/problem state from previous test. This can be done by rebooting the device. If we can afford the relatively long delay of rebooting the device between tests, of course this would be cleanest way.

My point is that sounds like Reset gRIB content RIB/FIB and Reset election_id are good enough for us for now to maintain the development velocity? Meanwhile, we can start developing gNOI and gRIBI OC config (which takes time).

liulk commented 2 years ago

After speaking to @greg-dennis yesterday, we decided that the first step is to implement a hook in the static binding to run something to the effect of "copy startup-config running-config" or other user-specified commands that could facilitate resetting the device config and state at the beginning of a wbb test. This would be similar to KNE config_data except it will be run over SSH.

https://github.com/google/kne/blob/main/proto/topo.proto#L126

greg-dennis commented 2 years ago

I will be clarifying the Ondatra Reserve() contract and will probably be adding a Config().Reset() method as well.

mojiiba commented 2 years ago

Support for loading base config is necessary and useful. However using ssh will be vendor-dependent in terms of the configuration text and the way ssh session is handled by devices. I would suggest using GNMI Set rather than SSH. As far as I know, we can even push text configuration (ascii_val) using GNMI Set.

liulk commented 2 years ago

While we do require the test proper to be vendor neutral, the infrastructure code that supports these tests will be interfacing with some vendor dependent parts. I think the static binding could provide SSH, gNMI SetRequest, and maybe some gNOI hooks to be called, and these need not be mutually exclusive.

openconfig / featureprofiles

Discuss resetting gRIBI election ID on Cisco #147