lipoja / URLExtract

URLExtract is python class for collecting (extracting) URLs from given text based on locating TLD.
MIT License
241 stars 61 forks source link

[Errno 11002] Temporary failure in name resolution after using URLExtract #163

Open jackjyq opened 5 months ago

jackjyq commented 5 months ago

After running URLExtract, the requests module raises [Errno 11002] Temporary failure in name resolution.

See codes below:

import requests
from urlextract import URLExtract

def call_extract_url():
    extractor = URLExtract()
    urls = extractor.find_urls(
        "https://www.baidu.com", check_dns=True, get_indices=False
    )
    print(f"call_extract_url() returns {urls}")

def call_request() -> str | None:
    r = requests.get("https://qr.1688.com/s/Q7XG2SzD", timeout=30)
    print(f"call_request() returns {r.status_code}")

call_request()
call_extract_url()
call_request()
The results ```shell call_request() returns 200 call_extract_url() returns ['https://www.baidu.com'] Traceback (most recent call last): File "D:\Git\chatwoot-connector\venv\lib\site-packages\dns\resolver.py", line 1874, in _getaddrinfo answers = _resolver.resolve_name(host, family) File "D:\Git\chatwoot-connector\venv\lib\site-packages\dns\resolver.py", line 1440, in resolve_name v6 = self.resolve( File "D:\Git\chatwoot-connector\venv\lib\site-packages\dns\resolver.py", line 1321, in resolve timeout = self._compute_timeout(start, lifetime, resolution.errors) File "D:\Git\chatwoot-connector\venv\lib\site-packages\dns\resolver.py", line 1075, in _compute_timeout raise LifetimeTimeout(timeout=duration, errors=errors) dns.resolver.LifetimeTimeout: The resolution lifetime expired after 2.002 seconds: Server Do53:172.24.248.17@53 answered The DNS operation timed out. During handling of the above exception, another exception occurred: Traceback (most recent call last): File "D:\Git\chatwoot-connector\venv\lib\site-packages\urllib3\connection.py", line 203, in _new_conn sock = connection.create_connection( File "D:\Git\chatwoot-connector\venv\lib\site-packages\urllib3\util\connection.py", line 60, in create_connection for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM): File "D:\Git\chatwoot-connector\venv\lib\site-packages\dns\resolver.py", line 1883, in _getaddrinfo raise socket.gaierror(socket.EAI_AGAIN, "Temporary failure in name resolution") socket.gaierror: [Errno 11002] Temporary failure in name resolution The above exception was the direct cause of the following exception: Traceback (most recent call last): File "D:\Git\chatwoot-connector\venv\lib\site-packages\urllib3\connectionpool.py", line 790, in urlopen response = self._make_request( File "D:\Git\chatwoot-connector\venv\lib\site-packages\urllib3\connectionpool.py", line 491, in _make_request raise new_e File "D:\Git\chatwoot-connector\venv\lib\site-packages\urllib3\connectionpool.py", line 467, in _make_request self._validate_conn(conn) File "D:\Git\chatwoot-connector\venv\lib\site-packages\urllib3\connectionpool.py", line 1096, in _validate_conn conn.connect() File "D:\Git\chatwoot-connector\venv\lib\site-packages\urllib3\connection.py", line 611, in connect self.sock = sock = self._new_conn() File "D:\Git\chatwoot-connector\venv\lib\site-packages\urllib3\connection.py", line 210, in _new_conn raise NameResolutionError(self.host, self, e) from e urllib3.exceptions.NameResolutionError: : Failed to resolve 'qr.1688.com' ([Errno 11002] Temporary failure in name resolution) The above exception was the direct cause of the following exception: Traceback (most recent call last): File "D:\Git\chatwoot-connector\venv\lib\site-packages\requests\adapters.py", line 486, in send resp = conn.urlopen( File "D:\Git\chatwoot-connector\venv\lib\site-packages\urllib3\connectionpool.py", line 844, in urlopen retries = retries.increment( File "D:\Git\chatwoot-connector\venv\lib\site-packages\urllib3\util\retry.py", line 515, in increment raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='qr.1688.com', port=443): Max retries exceeded with url: /s/Q7XG2SzD (Caused by NameResolutionError(": Failed to resolve 'qr.1688.com' ([Errno 11002] Temporary failure in name resolution)")) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "d:\Git\chatwoot-connector\scratch.py", line 20, in call_request() File "d:\Git\chatwoot-connector\scratch.py", line 14, in call_request r = requests.get("https://qr.1688.com/s/Q7XG2SzD", timeout=30) File "D:\Git\chatwoot-connector\venv\lib\site-packages\requests\api.py", line 73, in get return request("get", url, params=params, **kwargs) File "D:\Git\chatwoot-connector\venv\lib\site-packages\requests\api.py", line 59, in request return session.request(method=method, url=url, **kwargs) File "D:\Git\chatwoot-connector\venv\lib\site-packages\requests\sessions.py", line 589, in request resp = self.send(prep, **send_kwargs) File "D:\Git\chatwoot-connector\venv\lib\site-packages\requests\sessions.py", line 703, in send r = adapter.send(request, **kwargs) File "D:\Git\chatwoot-connector\venv\lib\site-packages\requests\adapters.py", line 519, in send raise ConnectionError(e, request=request) requests.exceptions.ConnectionError: HTTPSConnectionPool(host='qr.1688.com', port=443): Max retries exceeded with url: /s/Q7XG2SzD (Caused by NameResolutionError(": Failed to resolve 'qr.1688.com' ([Errno 11002] Temporary failure in name resolution)")) ```

As we can see from the result, the call_request succeeds before call_url_extract, and then fails after call_request. I think this is caused by dns_cache_install(), as I comment out dns_cache_install(), the call_request succeeds. I am wondering if we can remove this side effect in URLExtract?

Environment

lipoja commented 5 months ago

Thank you for reporting it. I would like to ask @jayvdb if he has time to have a look on that since he did the implementation of DNS check.