kernelci / kernelci-pipeline

Modular pipeline based on the KernelCI API
GNU Lesser General Public License v2.1
8 stars 20 forks source link

Sudden connection error in pipeline services #506

Open JenySadadia opened 7 months ago

JenySadadia commented 7 months ago

After starting API and Pipeline services, the services worked fine for some time. Then suddenly monitor, tarball, and scheduler-k8s services stopped. Other pipeline and API services were running OK while this issue was observed.

Error logs:

today at 10:13:0403/27/2024 04:43:04 AM UTC [ERROR] Traceback (most recent call last):
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 198, in _new_conn
today at 10:13:04    sock = connection.create_connection(
today at 10:13:04           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 60, in create_connection
today at 10:13:04    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
today at 10:13:04               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/socket.py", line 962, in getaddrinfo
today at 10:13:04    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
today at 10:13:04               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04socket.gaierror: [Errno -5] No address associated with hostname
today at 10:13:04
today at 10:13:04The above exception was the direct cause of the following exception:
today at 10:13:04
today at 10:13:04Traceback (most recent call last):
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 793, in urlopen
today at 10:13:04    response = self._make_request(
today at 10:13:04               ^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 491, in _make_request
today at 10:13:04    raise new_e
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 467, in _make_request
today at 10:13:04    self._validate_conn(conn)
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 1099, in _validate_conn
today at 10:13:04    conn.connect()
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 616, in connect
today at 10:13:04    self.sock = sock = self._new_conn()
today at 10:13:04                       ^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 205, in _new_conn
today at 10:13:04    raise NameResolutionError(self.host, self, e) from e
today at 10:13:04urllib3.exceptions.NameResolutionError: <urllib3.connection.HTTPSConnection object at 0x7f073630b1d0>: Failed to resolve 'staging.kernelci.org' ([Errno -5] No address associated with hostname)
today at 10:13:04
today at 10:13:04The above exception was the direct cause of the following exception:
today at 10:13:04
today at 10:13:04Traceback (most recent call last):
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 486, in send
today at 10:13:04    resp = conn.urlopen(
today at 10:13:04           ^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 847, in urlopen
today at 10:13:04    retries = retries.increment(
today at 10:13:04              ^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/urllib3/util/retry.py", line 515, in increment
today at 10:13:04    raise MaxRetryError(_pool, url, reason) from reason  # type: ignore[arg-type]
today at 10:13:04    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='staging.kernelci.org', port=9000): Max retries exceeded with url: /latest/listen/18845 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f073630b1d0>: Failed to resolve 'staging.kernelci.org' ([Errno -5] No address associated with hostname)"))
today at 10:13:04
today at 10:13:04During handling of the above exception, another exception occurred:
today at 10:13:04
today at 10:13:04Traceback (most recent call last):
today at 10:13:04  File "/home/kernelci/pipeline/base.py", line 69, in run
today at 10:13:04    status = self._run(context)
today at 10:13:04             ^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/home/kernelci/./pipeline/monitor.py", line 60, in _run
today at 10:13:04    event = self._api.receive_event(sub_id)
today at 10:13:04            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/kernelci/api/latest.py", line 138, in receive_event
today at 10:13:04    resp = self._get(path)
today at 10:13:04           ^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/kernelci/api/__init__.py", line 66, in _get
today at 10:13:04    resp = requests.get(
today at 10:13:04           ^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/api.py", line 73, in get
today at 10:13:04    return request("get", url, params=params, **kwargs)
today at 10:13:04           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/api.py", line 59, in request
today at 10:13:04    return session.request(method=method, url=url, **kwargs)
today at 10:13:04           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
today at 10:13:04    resp = self.send(prep, **send_kwargs)
today at 10:13:04           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
today at 10:13:04    r = adapter.send(request, **kwargs)
today at 10:13:04        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
today at 10:13:04  File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 519, in send
today at 10:13:04    raise ConnectionError(e, request=request)
today at 10:13:04requests.exceptions.ConnectionError: HTTPSConnectionPool(host='staging.kernelci.org', port=9000): Max retries exceeded with url: /latest/listen/18845 (Caused by NameResolutionError("<urllib3.connection.HTTPSConnection object at 0x7f073630b1d0>: Failed to resolve 'staging.kernelci.org' ([Errno -5] No address associated with hostname)"))
today at 10:13:04
today at 10:24:09Container stopped

It seems like something is blocking the pipeline services from accessing API. Maybe some Sysadmin related issue? @nuclearcat

nuclearcat commented 7 months ago

I noticed DNS resolution is unreliable for last few days on Azure services in general, it is affecting even deploy scripts. Unfortunately not much we can do yet,we might add more DNS servers in network config

r-c-n commented 7 months ago

It's happening again, it seems. If these services are meant to be long-lived could we introduce any kind of mechanism to re-launch them before we move to production. Not a good idea at this moment, since some of them are still under development and could exit due to a programming error, and we don't want to keep re-launching them in those cases.

nuclearcat commented 7 months ago

I added 3 more resolver entry on staging host, but not sure it will help anyhow with docker services, will investigate more now