nickrusso42518 / nots

[Ansible] Nick's OSPF TroubleShooter
BSD 3-Clause "New" or "Revised" License
176 stars 46 forks source link

Build Status

published

Nick's OSPF TroubleShooter (nots)

A simple but powerful Ansible playbook to troubleshoot OSPF network problems on a variety of platforms. It is simple because it does not require extensive preparatory configuration for individual host state checking. It is powerful because despite not having the aforementioned level of granularity, it rapidly discovers the vast majority of OSPF problems.

Contact information:\ Email: njrusmc@gmail.com\ Twitter: @nickrusso42518

Supported platforms

Today, Cisco IOS/IOS-XE, IOS-XR, and NX-OS are supported. Valid device_type options used for inventory groups are enumerated below. Each platform has a folder in the devices/ directory, such as devices/ios/. The file named main.yml is the task list that is included from the main playbook which begins the device-specific tasks.

Testing was conducted on the following platforms and versions:

Control machine information:

$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.4 (Maipo)

$ uname -a
Linux ip-10-125-0-100.ec2.internal 3.10.0-693.el7.x86_64 #1 SMP
  Thu Jul 6 19:56:57 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

$ ansible --version
ansible 2.8.7
  config file = /home/ec2-user/racc/ansible.cfg
  configured module search path = ['/home/ec2-user/.ansible/plugins/modules',
    '/usr/share/ansible/plugins/modules']
  ansible python module location =
    /home/ec2-user/environments/racc287/lib/python3.7/site-packages/ansible
  executable location = /home/ec2-user/environments/racc287/bin/ansible
  python version = 3.7.3 (default, Aug 27 2019, 16:56:53)
    [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]

Summarized test cases

The following tests are run in sequence. Note that the exact items tested varies between platforms since command outputs and feature sets also vary. Administrative tasks, such as creating directories and logging on the control machine, are not detailed here for brevity.

Per device testing

Ansible logs into each OSPF router for the purpose of collecting information and validating its correctness based on a small amount of pre-identified state configuration. As discussed in the "variables" section, some of these tests can be skipped by modifying the appropriate key/value pairs.

The list of tests run on each specific device are enumerated under each README.md file inside the subdirectories of devices/ correlating to each unique device type.

Whole network testing

After individual routers are validated, additional tests based on the aggregated data from all routers are run. It is possible to run these tests on a per-host basis, but that would effectively cause the same test to be run N times rather than one time.

Operations

This solution uses a GNU Makefile to simplify setup and daily operations. The following make targets are supported.

Variables

The following subsections detail the different types of variables, their scopes, and their purposes within the playbook.

Process-level

This playbook assumes that all OSPF routers are in a single process, and if they are not, only a single process can be checked at a time.

Process-level variables differ between device types. For a list of supported process variables, reference the individual README.md files in each devices/ subdirectory correlated to each device type.

Area-level

This playbook allows an unlimited number of areas to be specified, each with their own area-specific configuration. The playbook assumes that there are no duplicate areas in the network. For example, while it is possible to have two disparate area 1 sections of the network tied into area 0, this playbook does not support it.

The top-level key is the area ID, specified as a string in the format "area#" where # is the ID itself. For example: "area0" and "area51"

Device group level

Each device type (ios, iosxr, etc.) has its own group_vars/ file which contains OS-specific parameters. These should never be changed by consumers as their main purpose is abstraction, not user input.

One special consideration is extended to ios. Because classic IOS and IOS-XE have minor differences in the commands they support, they use difference command lists. The IOS-related group variables are as follows:

An example inventory might look like this:

all:
  children:
    ospf_routers:
      children:
        ios:
          children:
            iosxe:
              hosts:
                CSR1000:
                ISR4451:
            iosclassic:
              hosts:
                C3945E:
                C3750X:

Note that some extra commands are appended to the end of the commands list which are used for collection only. The output from these commands is written to the host log which can assist with troubleshooting, but it is not parsed or checked in any way within the logic of the playbook.

Host level

This playbook aims to minimize the number of host-specific variables as managing these inventory variables becomes burdensome in large networks.

Logging

Given the generic nature of the playbook, some tests will fail with generic error messages. For example, one host may fail because a router had an incorrect number of actual neighbors, either greater than or less than the user-configured my_nbr_count expectation. By design, the playbook lacks granularity to determine which neighbor failed and on which interface. Logging can be toggled off an on by adjusting the log variable which can be true or false.

CLI output from all commands is written to a file in the logs/ directory. A subdirectory for every execution of the playbook is created using the format nots_<date/time>/ which contains all the individual log files. The date/time uses ISO8601 short format, such as20180522T134558`. Log files are not version controlled and are excluded from git automatically. An example log directory after three playbook runs against an inventory of two hosts (csr1 and csr2), would yield something like this:

$ tree logs/
logs
├── nots_20180522T192916
│   ├── csr1.txt
│   └── csr2.txt
├── nots_20180522T194610
│   ├── csr1.txt
│   └── csr2.txt
└── nots_20180522T197133
    ├── csr1.txt
    └── csr2.txt

The contents of each log file begin with heading and trailing comment blocks to show the command issued with its output. These logs are useful for finding out why the playbook failed without having to manually log into failing hosts. The example below shows the beginning of an IOS-based platform log file with many redactions for brevity:

$ cat logs/nots_20180522T194610/csr1.txt
!!!
!!! Start command: show ip ospf 1
!!!
Routing Process "ospf 1" with ID 10.0.0.1
 Start time: 00:02:24.532, Time elapsed: 00:48:30.920
 Supports only single TOS(TOS0) routes
[snip, more output]
!!!
!!! End command:   show ip ospf 1
!!!
!!!
!!! Start command: show ip ospf 1 neighbor
!!!
Neighbor ID     Pri   State           Dead Time   Address         Interface
10.0.0.2          1   FULL/DR         00:00:39    192.168.102.2   Tunnel102
10.0.0.2          0   FULL/  -        00:00:37    192.168.101.2   Tunnel101
!!!
!!! End command:   show ip ospf 1 neighbor
!!!
[snip, more commands]

FAQ

Q: Most code across IOS, IOS-XR, and NX-OS is the same. Why not combine it?\ A: The goal is to support more platforms in the future such as Cisco ASA-OS, and possibly non-Cisco devices. These devices will likely return different sets of information. This tool is designed to be simple, not particularly advanced through layered abstractions.

Q: Why not use an API like RESTCONF or NETCONF instead of SSH + CLI?\ A: This tool is designed for risk-averse users or managers that are not rapidly migrating to API-based management. It is not an infrastructure-as-code solution and does not manage device configurations. All of the commands used in the playbook can be issued at privilege level 1 to further reduce risk. With the exception of updating the login credentials and populating the necessary variables, there is no complex setup work required.

Q: Why not parse the OSPF interfaces? Many errors occur at this level.\ A: Parsing individual interfaces would require state declarations on a per-host basis to determine what each interface should have. This defeats the purpose of a simple, low-effort solution which uses only area and process level parameters for verification. Furthermore, the detailed statistics checking will alert the user to many errors (authentication, MTU mismatch, etc) at a more general level. The user can check the logs to see the exact commands, which includes the non-parsed interface text.

Q: For NX-OS why didn't you use the | json filter from the CLI?\ A: While this would have saved a lot of parsing code, I did not want to have an inconsistent overall strategy for one network device. Additionally, the filter does not render milliseconds properly (eg, SPF throttle timers) which reduced my confidence in its overall accuracy.