napalm-automation / napalm

Network Automation and Programmability Abstraction Layer with Multivendor support
Apache License 2.0
2.26k stars 554 forks source link

commit confirm support #1118

Open rbcollins123 opened 4 years ago

rbcollins123 commented 4 years ago

I'm interested in contributing a PR to add this ability (and subsequently pass it upstream to napalm_ansible via another PR also), but wanted to gauge interest from maintainers on accepting something along those lines? We have a number of use-cases where it would be advantageous to have the commit auto-reverted quickly if connectivity to the target EOS device was lost due to a change in the config-session (this is a lot cleaner than scheduling and cancelling a reload), but I'd like to understand if efforts to add would be well received here or if we should fork to do what we need?

This feature was added back in EOS v4.18.0F which already went end-of-support in August 2019. So, it could be made the default operation method for all commits and the min-supported release of EOS for Napalm could be raised from 4.15.0F to 4.18.0F fairly safely these days. Or, leave min release at 4.15.0F and add optional keyword var to enable/disable it. In all cases new keyword vars would be needed to control the timer value at minimum. Ideally that would be via something like revert_in to control when it auto-reverts and confirm_in to control how long you wait before confirming the session (which would default to 0 secs but allow you to add some delay if you want to allow for network convergence events before confirming the commit).

I see there was similar work in #550 on this previously for some non-EOS drivers, but it was abandoned and I couldn't see why from the issue comments?

tlund42 commented 4 years ago

This would be fantastically useful for us. We currently do the scheduled reload trick to revert in case we make a bad deploy and it seems a bit messy.

cdorry commented 4 years ago

Agreed this would be so much cleaner than the scheduled reload and also reduce the impact as config replace would be so much quicker than device reload.

rbcollins123 commented 4 years ago

Bump! @ktbyers i noticed you did a lot of work on #550. Any thoughts on the below?

I'm interested in contributing a PR to add this ability (and subsequently pass it upstream to napalm_ansible via another PR also), but wanted to gauge interest from maintainers on accepting something along those lines? We have a number of use-cases where it would be advantageous to have the commit auto-reverted quickly if connectivity to the target EOS device was lost due to a change in the config-session (this is a lot cleaner than scheduling and cancelling a reload), but I'd like to understand if efforts to add would be well received here or if we should fork to do what we need?

This feature was added back in EOS v4.18.0F which already went end-of-support in August 2019. So, it could be made the default operation method for all commits and the min-supported release of EOS for Napalm could be raised from 4.15.0F to 4.18.0F fairly safely these days. Or, leave min release at 4.15.0F and add optional keyword var to enable/disable it. In all cases new keyword vars would be needed to control the timer value at minimum. Ideally that would be via something like revert_in to control when it auto-reverts and confirm_in to control how long you wait before confirming the session (which would default to 0 secs but allow you to add some delay if you want to allow for network convergence events before confirming the commit).

I see there was similar work in #550 on this previously for some non-EOS drivers, but it was abandoned and I couldn't see why from the issue comments?

ktbyers commented 4 years ago

Yes, we tried to do this in late 2017 and it eventually got abandoned due to how much work it was (abandoned was mostly be me i.e. implementing it + tests and documentation was going to take too much time).

We could open up discussions on it, but some caveats:

  1. Any solution has to apply to all drivers (excluding NX-OS which I don't think has any viable solution for this). This doesn't mean we have to have an implementation for all drivers at the the beginning, but the methods/arguments we decide on need to consider all the core platforms (excluding NX-OS).
  2. Probably would start by agreeing on new methods and new arguments.
  3. After that tests would need built out for all the new methods and new arguments.
  4. At that point we would stub out all of the drivers with the methods and with a NotImplementedError (but that would get all of the underlying structure and tests in place).
  5. commit_config() would default to its current behavior (i.e. no commit confirm by default). Reason for that is a lot of people don't wouldn't want commit confirm to be the default behavior and since NX-OS has no way of supporting this consistency across drivers would require commit_config() to remain as it is.

Definitely open to more discussions on this, but it is a lot of work.

ktbyers commented 4 years ago

Here is relevant info from issue #550


Did some more research and review (which also helped me remember past research):

IOS - Should be able to do reasonable system-wide solution.

NX-OS - Does not have any reasonable solution (AFAIK). The only thing I see is a kludge that we shouldn't do. This is also what I found in past research. We can ask in Cisco channel or NAPALM channel and see if anyone know of a reasonable solution, but I am not seeing one.

EOS - I think there is support, but a specific version is required. I believe I had this discussion with @bewing previously.

Junos - should be fine

IOS-XR - Can you commit-confirm in a different SSH session on IOS-XR? I wasn't seeing a good way of doing this in brief testing.

rbcollins123 commented 4 years ago

Yes, we tried to do this in late 2017 and it eventually got abandoned due to how much work it was (abandoned was mostly be me i.e. implementing it + tests and documentation was going to take too much time).

We could open up discussions on it, but some caveats:

  1. Any solution has to apply to all drivers (excluding NX-OS which I don't think has any viable solution for this). This doesn't mean we have to have an implementation for all drivers at the the beginning, but the methods/arguments we decide on need to consider all the core platforms (excluding NX-OS).
  2. Probably would start by agreeing on new methods and new arguments.
  3. After that tests would need built out for all the new methods and new arguments.
  4. At that point we would stub out all of the drivers with the methods and with a NotImplementedError (but that would get all of the underlying structure and tests in place).
  5. commit_config() would default to its current behavior (i.e. no commit confirm by default). Reason for that is a lot of people don't wouldn't want commit confirm to be the default behavior and since NX-OS has no way of supporting this consistency across drivers would require commit_config() to remain as it is.

Definitely open to more discussions on this, but it is a lot of work.

Thanks @ktbyers for the feedback. My main concern was if I could avoid the implementation for the other drivers as we need the EOS use-case 1st. I'm fine with just adding NotImplemented stubs for all the other core drivers and tests, etc. I'll start taking a look at it next week and making some plans for initial implementation

ktbyers commented 4 years ago

Yes, I think we need to consider all of the core NAPALM platforms as far as the behavior and methods we define (so that as they eventually get implemented for other platforms the methods make sense/don't need to be changed).

That being said we should not require all of the platforms to actually be implemented i.e. we should just implement them one at a time (as someone who has interests in that platform decides to do a PR). This was probably the biggest problem we did in 2017 (i.e. tried to say we needed N-platforms to have this support before we integrated the PR).

mayuresh82 commented 4 years ago

so why not just add it to junos to begin with ? Junos has supported this since day 1, all the other vendors are playing catch-up. Why are we imposing the need for additional methods here ? A revert_commit() method is not required, as it defeats the purpose of doing a commit confirmed which by definition, will revert the commit anyway after the given time. The correct way of doing is is by simply requesting a rollback(1) before the timer expires, which should happen in client code. A has_pending_commit() method is also a nice-to-have and I see no reason why this is being imposed as a requirement. It doesnt make sense to impose the burden of figuring out pending commits on the library when it should be done by client by other means (e.g running a show|compare on junos) Lets not block an essential feature for the sake of it especially when this can be easily achieved by simply passing an optional argument similar to how ignore_warning was implemented for junos.

ktbyers commented 4 years ago

@mayuresh82 This already exists on Juniper's PyEZ and consequently is accessible by NAPALM (i.e. you can access the underlying PyEZ driver via NAPALM). Consequently, if your primary concern is Juniper, then this is already available.

A main purpose of napalm (probably the main purpose) is to make a common set of methods across different dissimilar platforms. This implies that the we need to consider the other platforms when creating the methods (including which methods should exist or not exist).

rbcollins123 commented 4 years ago

Sorry, took me a while to get back to this. I took a 1st pass at the base methods/args from the past work from @ktbyers and threw PR #1179 up as initial code for discussion. Let me know thoughts or suggestions for edits. Once we agree on the base implementation I'll submit a PR for the EOS driver to implement all of these and then others can do the same or other drivers. Thanks!

ktbyers commented 4 years ago

@rbcollins123 Nag me if I don't start looking at this by mid-May. I am worried it might just slip through the cracks...

rbcollins123 commented 4 years ago

@rbcollins123 Nag me if I don't start looking at this by mid-May. I am worried it might just slip through the cracks...

OK will do @ktbyers!

mirceaulinic commented 4 years ago

Just wanted to add a note to make sure this doesn't lead to false assumptions. For example, I noticed this:

We currently do the scheduled reload trick to revert in case we make a bad deploy and it seems a bit messy.

If you do a bad deploy and you lost access to the device because of it, NAPALM won't be able to save you unless you're using a platform that supports commit confirm natively. In other words, if you're on, for example, Cisco IOS or others similar, NAPALM can at most revert to the previous config IF and only if still able to connect to the device.

rbcollins123 commented 4 years ago

Yep understood. I assumed we would address and document it accordingly in the future PRs for each OS driver. This PR was to ONLY get the base class methods and args agreed upon and in place so we could all do PRs to enhance each driver in the coming months. I’ve already stubbed our the EOS stuff but won’t submit until we all agree and merge this base PR first.

On Thu, Apr 30, 2020 at 6:19 AM Mircea Ulinic notifications@github.com wrote:

Just wanted to add a note to make sure this doesn't lead to false assumptions. For example, I noticed this:

We currently do the scheduled reload trick to revert in case we make a bad deploy and it seems a bit messy.

If you do a bad deploy and you lost access to the device because of it, NAPALM won't be able to save you unless you're using a platform that supports commit confirm natively. In other words, if you're on, for example, Cisco IOS or others similar, NAPALM can at most revert to the previous config IF and only if still able to connect to the device.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/napalm-automation/napalm/issues/1118#issuecomment-621744869, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC2S4CUUZDSCGHLAIY3BR2TRPFGDNANCNFSM4KQUVLKA .

rbcollins123 commented 4 years ago

@rbcollins123 Nag me if I don't start looking at this by mid-May. I am worried it might just slip through the cracks...

Hey @ktbyers, bumping this and PR #1179 on your stack. Just looking for thoughts on the methods/args/naming with everything set to raise NotImplementedError for now. Then we will do small individual PRs for each OS driver once this is merged.

ktbyers commented 4 years ago

Correcting something here (from statement above)...Cisco IOS can do this (I basically implemented it in a POC at one point).

Will try to start working on this a bit today through Wednesday.

ktbyers commented 4 years ago

@rbcollins123 Haven't forgotten...just haven't got to it yet.

gcotone commented 4 years ago

:+1: for this feature. We've implemented this in a napalm-community driver by passing the confirm (optional) argument.

It would be nice if we could standardize behavior/naming as @rbcollins123 was suggesting.

ktbyers commented 4 years ago

@gcotone Relevant PR is here:

https://github.com/napalm-automation/napalm/pull/1179

gcotone commented 3 years ago

Description for method confirm_commit() should be updated since commit_confirm=True is not used at all: https://github.com/napalm-automation/napalm/blob/b44332f88336b6fd42fdb967958a2661bcf2c958/napalm/base/base.py#L248

ktbyers commented 3 years ago

Yes, that needs updated... @gcotone

Regards, Kirk

mirceaulinic commented 3 years ago

As this has been merged into the develop branch, before releasing rc1, would anyone mind adding some documentation around the new features? Thanks.

mirceaulinic commented 3 years ago

@ktbyers @rbcollins123 any thoughts? I'd like to release 3.3.0 soon, but I can't do this without having this major feature documented; I'll check back in a couple of weeks - and unless anyone that has been in the loop contributes with some documentation on this, I'm afraid I'll revert it from the develop branch, and leave it open as a PR till someone will come up with the docs. Sounds good?

ktbyers commented 3 years ago

I will see if I can write some docs on it. Give me 10 days though. Sound reasonable?

mirceaulinic commented 3 years ago

Sure, thanks @ktbyers - as long as you need, just a confirmation is all I needed. :smile:

ktbyers commented 3 years ago

First pass at docs are here:

https://github.com/napalm-automation/napalm/pull/1379

Kircheneer commented 2 years ago

This comment can be ignored, work is being done in #1538.

I would be willing to give commit confirm for config replacement for the IOS driver a shot. I see that there was some work on that done by @DanSheps here but the effort seems to have slowed down.

Proposed solution

Use the built-in Cisco IOS feature to accomplish automatic rollback on error/timeout:

config replace flash:candidate.cfg force revert trigger error time $SOMETHING

Caveats

_See here._

This solution comes with the following caveats:


Any thoughts on this?

DanSheps commented 2 years ago

I am not sure what you mean by slowed down. I am pretty sure what I provided does exactly what it needs to, plus, unlike your proposal it also covers merging and not just replace.

I believe that PR just needs to be reviewed for completeness by the team.

DanSheps commented 2 years ago

Nevermind, I see that it failed on a check and I didn't notice, I will see if I can fix that up shortly.

Kircheneer commented 2 years ago

Very sorry, I did not see that there was a PR for it - I thought that you just had commits on a branch and never submitted a PR. In that case you can ignore that comment. Thanks!