Open JimFluke opened 3 weeks ago
@JimFluke thanks for reporting this issue.
It looks like you are missing lxml
. Can you try pip install lxml
and try again? hopefully it is just a dependency issue.
EDIT:
I recently moved beautifulsoup4
and lxml
to be installed as extra dependencies (and not as required dependencies) to make pydap more lightweight. This may have caused some trouble with authentication. Will investigate and report
Alternatively you can try install the complete server dependencies (as opposed to minimal dependencies) via conda:
conda install pydap-server
Let me know if that works
@Mikejmnez This is what I get when I pip install
the lxml package:
2024-11-07 21:23:36,466 INFO __main__: url: https://gcin01.cira.colostate.edu/thredds/dap4/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf
/usr/local/lib/python3.11/site-packages/pydap/cas/get_cookies.py:129: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
soup = BeautifulSoup(resp.content, "lxml")
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 199, in _new_conn
sock = connection.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
OSError: [Errno 113] No route to host
I can get to the host with a browser from the same host I'm running the python script on, so I don't know why it's giving this error.
I'll try the conda install pydap-server
method next.
But, it I try the same thing with the dap2 protocol it gives me this:
2024-11-07 22:08:49,782 INFO __main__: url: https://gcin01.cira.colostate.edu/thredds/dodsC/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf
Traceback (most recent call last):
File "/app/opendap_pydap.py", line 50, in <module>
dataset = open_url(url, session=session, protocol=od_protocol)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/client.py", line 78, in open_url
handler = pydap.handlers.dap.DAPHandler(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 98, in __init__
self.make_dataset()
File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 134, in make_dataset
self.dataset_from_dap2()
File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 178, in dataset_from_dap2
raise_for_status(r)
File "/usr/local/lib/python3.11/site-packages/pydap/net.py", line 37, in raise_for_status
raise HTTPError(
webob.exc.HTTPError: 401 Unauthorized
<!doctype html><html lang="en"><head><title>HTTP Status 401 – Unauthorized</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 401 – Unauthorized</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Description</b> The request has not been applied to the target resource because it lacks valid authentication credentials for that resource.</p><hr class="line" /><h3>Apache Tomcat</h3></body></html>
Again, the authentication works through the browser, so I'm still confused.
The semantics of HTTP 401 Unauthorized include that the 401 error is an invitation for the client to resubmit the request with credentials if the client has them. I wonder - if the server that pyDAP is accessing is using a Single Sign-on Service for authentication, then the URL which returns the 401 may not be the same URL as the DAP service:
https://gcin01.cira.colostate.edu/thredds/dap4/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf
But rather the URL of the authentication service.
I see that pretty frequently as an issue, but I don't know how pyDAP does it.
It might be the auth service URL could/would be passed into this call:
session = setup_session(username, password, check_url=url)
@Mikejmnez ?.
@Mikejmnez When I try this with conda install pydap-server
I get the same results - with both dap2 and dap4 - as with adding lxml to the pip install
. I'll look into the "auth service URL" and see what I find. Thanks!
Thanks @JimFluke that was useful - lxml needs to be included, but overall that does not fix your issue.
Like @ndp-opendap mentioned, we need to look at the auth process and I am not very familiar with this aspect so will need to some to look at and test.
@Mikejmnez @ndp-opendap That worked! I was eventually able to figure out what the check_url should be set to:
https://gcin01.cira.colostate.edu/thredds/restrictedAccess/DPCData
in my case. I got this from looking at the tomcat localhost_access_log.* file for the URL it was accessing when I was logging in with the browser. I was expecting setup_session() to need my digested password since I have the server configured to use those, but it requires my undigested password instead.
Thanks for all your help!
Nice work @JimFluke - It's a lot easier when the SSO is made a more visible part of the recipe. NASA's Earth Data Login requires similar invocation, but NASA makes a big deal about documenting EDL and how to use it.
@JimFluke Great news!
But, it only works with dap2. With dap4 I get the same No route to host
error I got before.
But, it only works with dap2. With dap4 I get the same
No route to host
error I got before.
And this happens when you use:
url = 'https://gcin01.cira.colostate.edu/thredds/dap4/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf'
or
url = 'dap4://gcin01.cira.colostate.edu/thredds/dap4/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf'
???
@ndp-opendap The dap4
instead of https
is effectively the same as specifying protocol= 'dap4'
as argument. I am certain that that is how it is being used based on the original comment
session = setup_session(username, password, check_url=url) dataset = open_url(url, session=session, protocol='dap4')
It is very odd that you get two different behaviors if one use dap2 and dap4, because the check_url = "https://gcin01.cira.colostate.edu/thredds/restrictedAccess/DPCData"
has no indication of dodsC
or dap4
there... From pydap's perspective the auth is going through the same function, independent of dap2
or dap4
. I wonder if it may be a TDS thing, since the URLs for DAP2 or DAP4 differ (as opposed to Hyrax, where the url to the data is exactly the same).
@ndp-opendap Substituting in dap4://
for the web protocol did not make any difference. Here is the full exception traceback:
2024-11-11 16:15:39,495 INFO __main__: url: dap4://gcin01.cira.colostate.edu/thredds/dap4/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf
2024-11-11 16:15:39,495 INFO __main__: check_url: https://gcin01.cira.colostate.edu/thredds/restrictedAccess/DPCData
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 199, in _new_conn
sock = connection.create_connection(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection
raise err
File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
OSError: [Errno 113] No route to host
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 495, in _make_request
conn.request(
File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 441, in request
self.endheaders()
File "/usr/local/lib/python3.11/http/client.py", line 1298, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/usr/local/lib/python3.11/http/client.py", line 1058, in _send_output
self.send(msg)
File "/usr/local/lib/python3.11/http/client.py", line 996, in send
self.connect()
File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 279, in connect
self.sock = self._new_conn()
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 214, in _new_conn
raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fab500e2310>: Failed to establish a new connection: [Errno 113] No route to host
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 667, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 843, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/util/retry.py", line 519, in increment
raise MaxRetryError(_pool, url, reason) from reason # type: ignore[arg-type]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='gcin01.cira.colostate.edu', port=80): Max retries exceeded with url: /thredds/dap4/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf.dmr (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fab500e2310>: Failed to establish a new connection: [Errno 113] No route to host'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/opendap_pydap.py", line 57, in <module>
dataset = open_url(url, session=session, protocol=od_protocol)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/client.py", line 78, in open_url
handler = pydap.handlers.dap.DAPHandler(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 98, in __init__
self.make_dataset()
File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 132, in make_dataset
self.dataset_from_dap4()
File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 148, in dataset_from_dap4
r = GET(
^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/net.py", line 26, in GET
response = follow_redirect(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/net.py", line 78, in follow_redirect
req = create_request(url, session=session, timeout=timeout, verify=verify)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/net.py", line 125, in create_request
return create_request_from_session(url, session, timeout=timeout, verify=verify)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/net.py", line 139, in create_request_from_session
session.head(url, allow_redirects=True, timeout=timeout, verify=verify)
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 624, in head
return self.request("HEAD", url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 700, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='gcin01.cira.colostate.edu', port=80): Max retries exceeded with url: /thredds/dap4/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf.dmr (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fab500e2310>: Failed to establish a new connection: [Errno 113] No route to host'))
I know it's long, but I think I need to include it to show you something else that is confusing me. The top traceback - from the original exception - does not start with the open_url()
call in my script. The last one does but not the first one. So maybe that's a clue?
@Mikejmnez I don't know why the TDS would be different, but it sure seems like it is.
I have now managed to notice that it is trying to connect through port 80! For both 'https://' and 'dap4://'. When I specify port 443 it still doesn't work, but I get a different error:
2024-11-11 16:50:06,283 INFO __main__: url: dap4://gcin01.cira.colostate.edu:443/thredds/dap4/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf
2024-11-11 16:50:06,283 INFO __main__: check_url: https://gcin01.cira.colostate.edu:443/thredds/restrictedAccess/DPCData
2024-11-11 16:50:06,283 INFO __main__: od_protocol: dap4
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 536, in _make_request
response = conn.getresponse()
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 507, in getresponse
httplib_response = super().getresponse()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/http/client.py", line 1395, in getresponse
response.begin()
File "/usr/local/lib/python3.11/http/client.py", line 325, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/http/client.py", line 286, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/socket.py", line 718, in readinto
return self._sock.recv_into(b)
^^^^^^^^^^^^^^^^^^^^^^^
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 667, in send
resp = conn.urlopen(
^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 843, in urlopen
retries = retries.increment(
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/util/retry.py", line 474, in increment
raise reraise(type(error), error, _stacktrace)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/util/util.py", line 38, in reraise
raise value.with_traceback(tb)
File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 789, in urlopen
response = self._make_request(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/connectionpool.py", line 536, in _make_request
response = conn.getresponse()
^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 507, in getresponse
httplib_response = super().getresponse()
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/http/client.py", line 1395, in getresponse
response.begin()
File "/usr/local/lib/python3.11/http/client.py", line 325, in begin
version, status, reason = self._read_status()
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/http/client.py", line 286, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/socket.py", line 718, in readinto
return self._sock.recv_into(b)
^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/app/opendap_pydap.py", line 59, in <module>
dataset = open_url(url, session=session, protocol=od_protocol)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/client.py", line 78, in open_url
handler = pydap.handlers.dap.DAPHandler(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 98, in __init__
self.make_dataset()
File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 132, in make_dataset
self.dataset_from_dap4()
File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 148, in dataset_from_dap4
r = GET(
^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/net.py", line 26, in GET
response = follow_redirect(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/net.py", line 78, in follow_redirect
req = create_request(url, session=session, timeout=timeout, verify=verify)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/net.py", line 125, in create_request
return create_request_from_session(url, session, timeout=timeout, verify=verify)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/pydap/net.py", line 139, in create_request_from_session
session.head(url, allow_redirects=True, timeout=timeout, verify=verify)
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 624, in head
return self.request("HEAD", url, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
resp = self.send(prep, **send_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/sessions.py", line 703, in send
r = adapter.send(request, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/requests/adapters.py", line 682, in send
raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
And now I see that port 443 is being specified for the check_url
. I'll see what happens without that.
Leaving out the port :443
string for the check_url does not make any difference.
@Mikejmnez @ndp-opendap When I went to upgrade from the thredds-docker:5.4 image to the 5.5 image I saw that we never had the dap4 service enabled, so it's no surprise that it did not work for me. Sorry for the red herring.
Note that adding dap4 does not work well for us. It ignores our authentication configuration. At least when using the website. And I still can't get it to work from Python.
@JimFluke - thanks for the heads up. I have not had much time to look at the authentication issue with pydap and thredds. I think it makes sense to stick with DAP2 for now, as I come to understand that Thredds has focused more on DAP2 than DAP4 in the past. Full disclosure both @ndp-opendap and I are not well versed with Thredds so it is taking us a bit of time regarding the thredds, dap4 and authentication issue. We are developers of Hyrax, the OPeNDAP server developed and maintained by OPeNDAP, Inc, and through the many years working with NASA, the Hyrax data server has focused more on DAP4.
That said - we are working closely with the Unidata folks to offer better pydap support/access to Thredds with DAP4.
I am trying to use authentication credentials to connect to our TDS. I have tried embedding the credentials into the url, but I get this error:
But I understand this authentication method is from old documentation and will not work. So I have recently tried setting up a connection session:
With this result:
This is an HDF4-EOS file being accessed from a THREDDS server, so the problem described in issue #401 will probably show up but only after the code gets passed this authentication problem.
Thanks!