Open maawoo opened 7 months ago
It's really hard to say what's going on here without knowing more about your university HPC system. Based on the error, it looks like VSCode is somehow involved?
https://vscode-remote+ssh-002dremote-002bdraco2.vscode-resource.vscode-cdn.net/search/granules.umm_json
Can you provide some more detail on how VSCode is involved in your workflow? host='cmr.earthdata.nasa.gov'
indicates that earthaccess is at least attempting to talk to the correct host, and the Requests library seems to agree!
Hi @mfisher87, I overlooked that, so thanks for pointing it out. However, I still get an error when executing the code outside of VSCode.
Same error also in a clean environment with Python 3.11.8
instead of 3.12.2
.
I also tried downgrading the package (to 0.7.0
) and noticed that it prints out the number of granules found before the error:
>>> earthaccess.search_data(
... short_name='GEDI02_A',
... bounding_box=(31.52,-25.08,31.64,-24.99),
... temporal=("2019-01-01", "2024-01-01"),
... count=-1
... )
Granules found: 92
Traceback (most recent call last):
File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/connection.py", line 203, in _new_conn
sock = connection.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/home/du23yow/micromamba/envs/woody_env/lib/python3.12/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused
The above exception was the direct cause of the following exception:
...
Any other ideas of what I could do?
Okay, I found the explanation in this icepyx discussion. Ping @betolink 🙂 Any suggestion on using earthaccess.search_data
and earthaccess.download
with an updated requests session?
Hi @maawoo, I think this could be resolved if we let users pass the proxy settings to requests, in the meantime you can manually get a session modify it and get the files but that defeats the purpose!
import earthaccess
from itertools import chain # to flatten the results
earthaccess.login()
# Define your proxy
proxy = {
'http': 'http://your_proxy_address:port',
'https': 'https://your_proxy_address:port'
}
results = earthaccess.search_data(
short_name='GEDI02_A',
bounding_box=(31.52,-25.08,31.64,-24.99),
temporal=("2019-01-01", "2024-01-01"),
count=-1
)
links = list(chain.from_iterable([r.data_links() for r in results]))
session = earthaccess.get_requests_https_session()
session.proxies.update(proxy)
for url in links:
local_filename = url.split("/")[-1]
path = f"temp_dir/{local_filename}"
with session.get(
url,
stream=True,
allow_redirects=True,
) as r:
r.raise_for_status()
with open(path, "wb") as f:
shutil.copyfileobj(r.raw, f, length=1024 * 1024)
This is not concurrent so there is room for improvement, as I said we should implement the proxy here but my guess is that it won't be ready in the next week.
Thank you for the possible workaround!
my guess is that it won't be ready in the next week
No worries! I already have the data I need. My plan was to implement earthaccess into some scripts but that can wait for now.
The requests
library makes use of urllib
and urllib
recognizes environment variables of the form <scheme>_proxy
(either uppercase or lowercase). Therefore, you should be able to simply set the environment variable https_proxy
or HTTPS_PROXY
to the appropriate value.
However, whether or not those env vars are used is determined by the boolean value of trust_env
on the requests.Session
object used for making requests. By default, trust_env
is True
, and the env vars for proxies are used, but if trust_env
is False
, they are not used. Thus there might be situations in which earthaccess will not use the env vars because there are situations where it sets trust_env
to False
.
I suggest attempting to export your https_proxy
env var appropriately, and retrying your example to see if that works for you.
@maawoo, if you want a workaround until we can come up with a robust and secure solution, here's something based upon the thread from https://github.com/nsidc/earthaccess/pull/823. This is pulled from a combination of code from a few comments in that PR, and some minor renaming/refactoring.
First, define a set_proxies
function:
import os
from functools import cache, wraps
from typing import Callable
from typing_extensions import ParamSpec
import earthaccess
import requests
P = ParamSpec("P")
def set_proxies(f: Callable[P, requests.Session]) -> Callable[P, requests.Session]:
@wraps(f)
def wrapper(*args: P.args, **kwargs: P.kwargs) -> requests.Session:
session = f(*args, **kwargs)
session.proxies.update(
{
scheme: v
for scheme in ("http", "https")
if (
v := os.environ.get(
k := f"{scheme}_proxy", os.environ.get(k.upper())
)
)
}
)
return session
return wrapper
Now you can use set_proxies
to decorate the earthaccess.Auth.get_session
method after you login (so that you can get an authenticated Auth
instance to get a session from):
earthaccess.login()
auth: earthaccess.Auth = earthaccess.__store__.auth
auth.get_session = cache(set_proxies(auth.get_session))
From here, any further earthaccess
calls to open or download files will use the same requests
session with the proxies set on the session by set_proxies
.
I'm trying to download GEDI data on my university's HPC system. The following sample code results in a
ConnectionError
:My initial thought was that the API is not whitelisted in our HTTP/HTTPS proxies, which are set via environment variables. However, according to our sysadmin this should not be an issue. I was able to confirm by requesting the same URL via curl:
Any ideas / workarounds would be appreciated!