Rare ConnectionError causing playbook fails

chris-tomkins-flexgrid commented 4 months ago

ISSUE TYPE

Bug Report

SOFTWARE VERSIONS

pynautobot

pip freeze | grep pynautobot pynautobot==2.2.0

Ansible:

ansible --version ansible [core 2.16.0] config file = /root/flexgrid-netbuild/ansible.cfg configured module search path = ['/root/.ansible/plugins/modules', '/usr/share/ansible/plugins/modules'] ansible python module location = /usr/local/lib/python3.10/dist-packages/ansible ansible collection location = /root/.ansible/collections:/usr/share/ansible/collections executable location = /usr/local/bin/ansible python version = 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (/usr/bin/python3) jinja version = 3.1.2 libyaml = True

Nautobot:

2.2.6

Collection:

ansible-galaxy collection list | grep nautobot networktocode.nautobot 5.2.1

SUMMARY

We have runs that make hundreds of API calls to Nautobot. Extremely rarely, these fail with ConnectionError, which causes the entire run to appear to have failed, causing reporting issues.

STEPS TO REPRODUCE

- name: Creating prefix records in Nautobot
  when: not temporaryskiprun and upstream_agg_switch is defined
  networktocode.nautobot.prefix:
    api_version: "2.1"
    prefix: "{{ wansubnetusuallyslash29 }}/29"
    location:
      name: "{{ nautobot_site }}"
    namespace: "INTERNET.inet.0"
    state: present
    status: "{{ 'Active' if not awaitingfirstprovision else 'Reserved' }}"
    tenant: "{{ nautobot_tenant }}"
    token: "{{ nautobot_read_write_token }}"
    type: Network
    url: https://{{ nautobot_api_ip }}
    validate_certs: false  # TODO - sort the certs on Nautobot so this isn't required
    vlan:
      name: "{{ inventory_hostname | replace(\"-\", \".\") | lower }}.inner"
      site: "{{ hostvars[upstream_agg_switch].nautobot_site }}"
      tenant: "{{ nautobot_tenant }}"
  register: result_for_tagging

EXPECTED RESULTS

ok: [cpe.xxx.xxx.xxx]

(this is what I see almost all times)

ACTUAL RESULTS

error scenario (very approximately 1 in 1000 calls):

      fatal: [cpe.xxx.xxx.xxx]: FAILED! =>
        msg: |-
          Traceback (most recent call last):
            File "/usr/local/lib/python3.10/dist-packages/ansible/module_utils/connection.py", line 210, in send
              response = recv_data(sf)
            File "/usr/local/lib/python3.10/dist-packages/ansible/module_utils/connection.py", line 79, in recv_data
              d = s.recv(header_len - len(data))
          ConnectionResetError: [Errno 104] Connection reset by peer

          During handling of the above exception, another exception occurred:

          Traceback (most recent call last):
            File "/usr/local/lib/python3.10/dist-packages/ansible/cli/scripts/ansible_connection_cli_stub.py", line 315, in main                  conn.set_options(direct=options)
            File "/usr/local/lib/python3.10/dist-packages/ansible/module_utils/connection.py", line 194, in __rpc__
              response = self._exec_jsonrpc(name, *args, **kwargs)
            File "/usr/local/lib/python3.10/dist-packages/ansible/module_utils/connection.py", line 155, in _exec_jsonrpc
              out = self.send(data)
            File "/usr/local/lib/python3.10/dist-packages/ansible/module_utils/connection.py", line 214, in send
              raise ConnectionError(
          ansible.module_utils.connection.ConnectionError: unable to connect to socket /root/.ansible/pc/c317ea50ac. See the socket path issue category in Network Debug and Troubleshooting Guide

          During handling of the above exception, another exception occurred:

          Traceback (most recent call last):
            File "/usr/local/bin/ansible-connection", line 8, in <module>
              sys.exit(main())
            File "/usr/local/lib/python3.10/dist-packages/ansible/cli/scripts/ansible_connection_cli_stub.py", line 318, in main
              raise ConnectionError('Unable to decode JSON from response set_options. See the debug log for more information.')               ansible.module_utils.connection.ConnectionError: Unable to decode JSON from response set_options. See the debug log for more information.

Running the exact same command again will result in a successful run.

This happens regardless of whether the API call was going to make any actual change.

joewesch commented 4 months ago

I believe using ansible retries should work. Can you try that?

We also have a retries arg on pynautobot, but we don't expose it (outside of the lookup plugin) in lieu of the built-in ansible retry option.

chris-tomkins-flexgrid commented 4 months ago

Thanks. It's undesirable to add this to every single play (as any of them could be vulnerable to the issue), but I'll investigate if we can somehow do it globally.

joewesch commented 4 months ago

One possible solution I would suggest would be to add the ability to ingest the number of retries on pynautobot via an environment variable (e.g. PYNAUTOBOT_RETRIES).

nautobot / nautobot-ansible