nsidc / earthaccess

Python Library for NASA Earthdata APIs
https://earthaccess.readthedocs.io/
MIT License
418 stars 84 forks source link

Allow custom proxy settings with requests sessions #501

Open maawoo opened 7 months ago

maawoo commented 7 months ago

I'm trying to download GEDI data on my university's HPC system. The following sample code results in a ConnectionError:

results = earthaccess.search_data(
    short_name='GEDI02_A',
    bounding_box=(31.52,-25.08,31.64,-24.99),
    temporal=("2019-01-01", "2024-01-01"),
    count=-1
)
ConnectionError: HTTPSConnectionPool(host='cmr.earthdata.nasa.gov', port=443): Max retries exceeded with url: [/search/granules.umm_json](https://vscode-remote+ssh-002dremote-002bdraco2.vscode-resource.vscode-cdn.net/search/granules.umm_json)?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f69ef2d40b0>: Failed to establish a new connection: [Errno 111] Connection refused'))

My initial thought was that the API is not whitelisted in our HTTP/HTTPS proxies, which are set via environment variables. However, according to our sysadmin this should not be an issue. I was able to confirm by requesting the same URL via curl:

>> curl "https://cmr.earthdata.nasa.gov/search/granules.umm_json?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0"
{"hits":92,"took":394,"items":[]}

Any ideas / workarounds would be appreciated!

mfisher87 commented 7 months ago

It's really hard to say what's going on here without knowing more about your university HPC system. Based on the error, it looks like VSCode is somehow involved?

https://vscode-remote+ssh-002dremote-002bdraco2.vscode-resource.vscode-cdn.net/search/granules.umm_json

Can you provide some more detail on how VSCode is involved in your workflow? host='cmr.earthdata.nasa.gov' indicates that earthaccess is at least attempting to talk to the correct host, and the Requests library seems to agree!

maawoo commented 7 months ago

Hi @mfisher87, I overlooked that, so thanks for pointing it out. However, I still get an error when executing the code outside of VSCode.

Here is the complete error traceback:
```python --------------------------------------------------------------------------- ConnectionRefusedError Traceback (most recent call last) File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py:203, in HTTPConnection._new_conn(self) 202 try: --> 203 sock = connection.create_connection( 204 (self._dns_host, self.port), 205 self.timeout, 206 source_address=self.source_address, 207 socket_options=self.socket_options, 208 ) 209 except socket.gaierror as e: File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py:85, in create_connection(address, timeout, source_address, socket_options) 84 try: ---> 85 raise err 86 finally: 87 # Break explicitly a reference cycle File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py:73, in create_connection(address, timeout, source_address, socket_options) 72 sock.bind(source_address) ---> 73 sock.connect(sa) 74 # Break explicitly a reference cycle ConnectionRefusedError: [Errno 111] Connection refused The above exception was the direct cause of the following exception: NewConnectionError Traceback (most recent call last) File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:791, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw) 790 # Make the request on the HTTPConnection object --> 791 response = self._make_request( 792 conn, 793 method, 794 url, 795 timeout=timeout_obj, 796 body=body, 797 headers=headers, 798 chunked=chunked, 799 retries=retries, 800 response_conn=response_conn, 801 preload_content=preload_content, 802 decode_content=decode_content, 803 **response_kw, 804 ) 806 # Everything went great! File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:492, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length) 491 new_e = _wrap_proxy_error(new_e, conn.proxy.scheme) --> 492 raise new_e 494 # conn.request() calls http.client.*.request, not the method in 495 # urllib3.request. It also calls makefile (recv) on the socket. File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:468, in HTTPConnectionPool._make_request(self, conn, method, url, body, headers, retries, timeout, chunked, response_conn, preload_content, decode_content, enforce_content_length) 467 try: --> 468 self._validate_conn(conn) 469 except (SocketTimeout, BaseSSLError) as e: File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:1097, in HTTPSConnectionPool._validate_conn(self, conn) 1096 if conn.is_closed: -> 1097 conn.connect() 1099 if not conn.is_verified: File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py:611, in HTTPSConnection.connect(self) 610 sock: socket.socket | ssl.SSLSocket --> 611 self.sock = sock = self._new_conn() 612 server_hostname: str = self.host File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py:218, in HTTPConnection._new_conn(self) 217 except OSError as e: --> 218 raise NewConnectionError( 219 self, f"Failed to establish a new connection: {e}" 220 ) from e 222 # Audit hooks are only available in Python 3.8+ NewConnectionError: : Failed to establish a new connection: [Errno 111] Connection refused The above exception was the direct cause of the following exception: MaxRetryError Traceback (most recent call last) File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/adapters.py:486, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies) 485 try: --> 486 resp = conn.urlopen( 487 method=request.method, 488 url=url, 489 body=request.body, 490 headers=request.headers, 491 redirect=False, 492 assert_same_host=False, 493 preload_content=False, 494 decode_content=False, 495 retries=self.max_retries, 496 timeout=timeout, 497 chunked=chunked, 498 ) 500 except (ProtocolError, OSError) as err: File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connectionpool.py:845, in HTTPConnectionPool.urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, preload_content, decode_content, **response_kw) 843 new_e = ProtocolError("Connection aborted.", new_e) --> 845 retries = retries.increment( 846 method, url, error=new_e, _pool=self, _stacktrace=sys.exc_info()[2] 847 ) 848 retries.sleep() File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/retry.py:515, in Retry.increment(self, method, url, response, error, _pool, _stacktrace) 514 reason = error or ResponseError(cause) --> 515 raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type] 517 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry) MaxRetryError: HTTPSConnectionPool(host='cmr.earthdata.nasa.gov', port=443): Max retries exceeded with url: /search/granules.umm_json?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused')) During handling of the above exception, another exception occurred: ConnectionError Traceback (most recent call last) Cell In[3], line 1 ----> 1 results = earthaccess.search_data( 2 short_name='GEDI02_A', 3 bounding_box=(31.52,-25.08,31.64,-24.99), 4 temporal=("2019-01-01", "2024-01-01"), 5 count=-1 6 ) File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/earthaccess/api.py:120, in search_data(count, **kwargs) 118 else: 119 query = DataGranules().parameters(**kwargs) --> 120 granules_found = query.hits() 121 print(f"Granules found: {granules_found}") 122 if count > 0: File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/earthaccess/search.py:388, in DataGranules.hits(self) 379 """Returns the number of hits the current query will return. 380 This is done by making a lightweight query to CMR and inspecting the returned headers. 381 382 Returns: 383 The number of results reported by CMR. 384 """ 386 url = self._build_url() --> 388 response = self.session.get(url, headers=self.headers, params={"page_size": 0}) 390 try: 391 response.raise_for_status() File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/sessions.py:602, in Session.get(self, url, **kwargs) 594 r"""Sends a GET request. Returns :class:`Response` object. 595 596 :param url: URL for the new :class:`Request` object. 597 :param \*\*kwargs: Optional arguments that ``request`` takes. 598 :rtype: requests.Response 599 """ 601 kwargs.setdefault("allow_redirects", True) --> 602 return self.request("GET", url, **kwargs) File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/sessions.py:589, in Session.request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json) 584 send_kwargs = { 585 "timeout": timeout, 586 "allow_redirects": allow_redirects, 587 } 588 send_kwargs.update(settings) --> 589 resp = self.send(prep, **send_kwargs) 591 return resp File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/sessions.py:703, in Session.send(self, request, **kwargs) 700 start = preferred_clock() 702 # Send the request --> 703 r = adapter.send(request, **kwargs) 705 # Total elapsed time of the request (approximately) 706 elapsed = preferred_clock() - start File ~/micromamba/envs/woody_env/lib/python3.12/site-packages/requests/adapters.py:519, in HTTPAdapter.send(self, request, stream, timeout, verify, cert, proxies) 515 if isinstance(e.reason, _SSLError): 516 # This branch is for urllib3 v1.22 and later. 517 raise SSLError(e, request=request) --> 519 raise ConnectionError(e, request=request) 521 except ClosedPoolError as e: 522 raise ConnectionError(e, request=request) ConnectionError: HTTPSConnectionPool(host='cmr.earthdata.nasa.gov', port=443): Max retries exceeded with url: /search/granules.umm_json?short_name=GEDI02_A&bounding_box=31.52,-25.08,31.64,-24.99&temporal%5B%5D=2019-01-01T00:00:00Z,2024-01-01T00:00:00Z&page_size=0 (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 111] Connection refused')) ```

Same error also in a clean environment with Python 3.11.8 instead of 3.12.2.

I also tried downgrading the package (to 0.7.0) and noticed that it prints out the number of granules found before the error:

>>> earthaccess.search_data(
...     short_name='GEDI02_A',
...     bounding_box=(31.52,-25.08,31.64,-24.99),
...     temporal=("2019-01-01", "2024-01-01"),
...     count=-1
... )
Granules found: 92
Traceback (most recent call last):
  File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py", line 203, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:
...

Any other ideas of what I could do?

maawoo commented 7 months ago

Okay, I found the explanation in this icepyx discussion. Ping @betolink 🙂 Any suggestion on using earthaccess.search_data and earthaccess.download with an updated requests session?

betolink commented 7 months ago

Hi @maawoo, I think this could be resolved if we let users pass the proxy settings to requests, in the meantime you can manually get a session modify it and get the files but that defeats the purpose!

import earthaccess
from itertools import chain # to flatten the results

earthaccess.login()

# Define your proxy
proxy = {
    'http': 'http://your_proxy_address:port',
    'https': 'https://your_proxy_address:port'
}

results = earthaccess.search_data(
    short_name='GEDI02_A',
    bounding_box=(31.52,-25.08,31.64,-24.99),
    temporal=("2019-01-01", "2024-01-01"),
    count=-1
)

links = list(chain.from_iterable([r.data_links() for r in  results]))
session = earthaccess.get_requests_https_session()
session.proxies.update(proxy)

for url in links:
    local_filename = url.split("/")[-1]
    path = f"temp_dir/{local_filename}"
    with session.get(
          url,
          stream=True,
          allow_redirects=True,
      ) as r:
          r.raise_for_status()
          with open(path, "wb") as f:
              shutil.copyfileobj(r.raw, f, length=1024 * 1024)

This is not concurrent so there is room for improvement, as I said we should implement the proxy here but my guess is that it won't be ready in the next week.

maawoo commented 7 months ago

Thank you for the possible workaround!

my guess is that it won't be ready in the next week

No worries! I already have the data I need. My plan was to implement earthaccess into some scripts but that can wait for now.

chuckwondo commented 1 month ago

The requests library makes use of urllib and urllib recognizes environment variables of the form <scheme>_proxy (either uppercase or lowercase). Therefore, you should be able to simply set the environment variable https_proxy or HTTPS_PROXY to the appropriate value.

However, whether or not those env vars are used is determined by the boolean value of trust_env on the requests.Session object used for making requests. By default, trust_env is True, and the env vars for proxies are used, but if trust_env is False, they are not used. Thus there might be situations in which earthaccess will not use the env vars because there are situations where it sets trust_env to False.

I suggest attempting to export your https_proxy env var appropriately, and retrying your example to see if that works for you.

chuckwondo commented 1 month ago

@maawoo, if you want a workaround until we can come up with a robust and secure solution, here's something based upon the thread from https://github.com/nsidc/earthaccess/pull/823. This is pulled from a combination of code from a few comments in that PR, and some minor renaming/refactoring.

First, define a set_proxies function:

import os
from functools import cache, wraps
from typing import Callable
from typing_extensions import ParamSpec

import earthaccess
import requests

P = ParamSpec("P")

def set_proxies(f: Callable[P, requests.Session]) -> Callable[P, requests.Session]:
    @wraps(f)
    def wrapper(*args: P.args, **kwargs: P.kwargs) -> requests.Session:
        session = f(*args, **kwargs)
        session.proxies.update(
            {
                scheme: v
                for scheme in ("http", "https")
                if (
                    v := os.environ.get(
                        k := f"{scheme}_proxy", os.environ.get(k.upper())
                    )
                )
            }
        )

        return session

    return wrapper

Now you can use set_proxies to decorate the earthaccess.Auth.get_session method after you login (so that you can get an authenticated Auth instance to get a session from):

earthaccess.login()
auth: earthaccess.Auth = earthaccess.__store__.auth
auth.get_session = cache(set_proxies(auth.get_session))

From here, any further earthaccess calls to open or download files will use the same requests session with the proxies set on the session by set_proxies.