pydap / pydap

A Python library implementing the Data Access Protocol (DAP, aka OPeNDAP).
https://pydap.github.io/pydap/
MIT License
138 stars 87 forks source link

Can't get an authenticated connection to work to a THREDDS server #411

Open JimFluke opened 1 day ago

JimFluke commented 1 day ago

I am trying to use authentication credentials to connect to our TDS. I have tried embedding the credentials into the url, but I get this error:

url: https://fluke:d1ef3ce7e7c41de74192a362524ad0a460692a222d9dd796ee383b56e446d749%241%24d03ce0f88475505a68bd0eb37fa570df8120e59ccf62a4f580a55ad612f695c0e385893fe7205f7c181b221ab49bc817d4a33a2b2bb727fdc0ee3420e7e5b99e@gcin01.cira.colostate.edu/thredds/dodsC/cloudsat-data/2B-GEOPROF.P1_R05/2008/366/2008366031107_14239_CS_2B-GEOPROF_GRANULE_P1_R05_E02_F00.hdf

Traceback (most recent call last):
  File "/app/opendap_pydap_two.py", line 64, in <module>
    dataset = open_url(url)
              ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pydap/client.py", line 68, in open_url
    handler = pydap.handlers.dap.DAPHandler(url, application, session, output_grid,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 71, in __init__
    self.make_dataset()
  File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 96, in make_dataset
    self.dataset_from_dap2()
  File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 109, in dataset_from_dap2
    pydap.net.raise_for_status(r)
  File "/usr/local/lib/python3.11/site-packages/pydap/net.py", line 38, in raise_for_status
    raise HTTPError(
webob.exc.HTTPError: 401 Unauthorized
<!doctype html><html lang="en"><head><title>HTTP Status 401 – Unauthorized</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 401 – Unauthorized</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Description</b> The request has not been applied to the target resource because it lacks valid authentication credentials for that resource.</p><hr class="line" /><h3>Apache Tomcat</h3></body></html>

But I understand this authentication method is from old documentation and will not work. So I have recently tried setting up a connection session:

url = 'https://gcin01.cira.colostate.edu/thredds/dap4/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf'

session = setup_session(username, password, check_url=url)
dataset = open_url(url, session=session, protocol='dap4')

With this result:

Traceback (most recent call last):
  File "/app/opendap_pydap.py", line 49, in <module>
    session = setup_session(username, password, check_url=url)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pydap/cas/urs.py", line 25, in setup_session
    session = get_cookies.setup_session(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pydap/cas/get_cookies.py", line 81, in setup_session
    response = soup_login(
               ^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pydap/cas/get_cookies.py", line 144, in soup_login
    soup = BeautifulSoup(resp.content, "lxml")
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/bs4/__init__.py", line 250, in __init__
    raise FeatureNotFound(
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

This is an HDF4-EOS file being accessed from a THREDDS server, so the problem described in issue #401 will probably show up but only after the code gets passed this authentication problem.

Thanks!

Mikejmnez commented 1 day ago

@JimFluke thanks for reporting this issue.

It looks like you are missing lxml. Can you try pip install lxml and try again? hopefully it is just a dependency issue.

EDIT: I recently moved beautifulsoup4 and lxml to be installed as extra dependencies (and not as required dependencies) to make pydap more lightweight. This may have caused some trouble with authentication. Will investigate and report

Mikejmnez commented 1 day ago

Alternatively you can try install the complete server dependencies (as opposed to minimal dependencies) via conda:

conda install pydap-server

Let me know if that works

JimFluke commented 1 day ago

@Mikejmnez This is what I get when I pip install the lxml package:

2024-11-07 21:23:36,466 INFO    __main__: url: https://gcin01.cira.colostate.edu/thredds/dap4/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf

/usr/local/lib/python3.11/site-packages/pydap/cas/get_cookies.py:129: XMLParsedAsHTMLWarning: It looks like you're parsing an XML document using an HTML parser. If this really is an HTML document (maybe it's XHTML?), you can ignore or filter this warning. If it's XML, you should know that using an XML parser will be more reliable. To parse this document as XML, make sure you have the lxml package installed, and pass the keyword argument `features="xml"` into the BeautifulSoup constructor.
  soup = BeautifulSoup(resp.content, "lxml")
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/urllib3/connection.py", line 199, in _new_conn
    sock = connection.create_connection(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 85, in create_connection
    raise err
  File "/usr/local/lib/python3.11/site-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
OSError: [Errno 113] No route to host

I can get to the host with a browser from the same host I'm running the python script on, so I don't know why it's giving this error.

I'll try the conda install pydap-server method next.

JimFluke commented 1 day ago

But, it I try the same thing with the dap2 protocol it gives me this:

2024-11-07 22:08:49,782 INFO    __main__: url: https://gcin01.cira.colostate.edu/thredds/dodsC/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf

Traceback (most recent call last):
  File "/app/opendap_pydap.py", line 50, in <module>
    dataset = open_url(url, session=session, protocol=od_protocol)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pydap/client.py", line 78, in open_url
    handler = pydap.handlers.dap.DAPHandler(
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 98, in __init__
    self.make_dataset()
  File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 134, in make_dataset
    self.dataset_from_dap2()
  File "/usr/local/lib/python3.11/site-packages/pydap/handlers/dap.py", line 178, in dataset_from_dap2
    raise_for_status(r)
  File "/usr/local/lib/python3.11/site-packages/pydap/net.py", line 37, in raise_for_status
    raise HTTPError(
webob.exc.HTTPError: 401 Unauthorized
<!doctype html><html lang="en"><head><title>HTTP Status 401 – Unauthorized</title><style type="text/css">body {font-family:Tahoma,Arial,sans-serif;} h1, h2, h3, b {color:white;background-color:#525D76;} h1 {font-size:22px;} h2 {font-size:16px;} h3 {font-size:14px;} p {font-size:12px;} a {color:black;} .line {height:1px;background-color:#525D76;border:none;}</style></head><body><h1>HTTP Status 401 – Unauthorized</h1><hr class="line" /><p><b>Type</b> Status Report</p><p><b>Description</b> The request has not been applied to the target resource because it lacks valid authentication credentials for that resource.</p><hr class="line" /><h3>Apache Tomcat</h3></body></html>

Again, the authentication works through the browser, so I'm still confused.

ndp-opendap commented 1 day ago

The semantics of HTTP 401 Unauthorized include that the 401 error is an invitation for the client to resubmit the request with credentials if the client has them. I wonder - if the server that pyDAP is accessing is using a Single Sign-on Service for authentication, then the URL which returns the 401 may not be the same URL as the DAP service:

https://gcin01.cira.colostate.edu/thredds/dap4/cloudsat-data/2B-GEOPROF.P1_R05/2013/180/2013180111833_38146_CS_2B-GEOPROF_GRANULE_P1_R05_E06_F00.hdf

But rather the URL of the authentication service.

I see that pretty frequently as an issue, but I don't know how pyDAP does it.

It might be the auth service URL could/would be passed into this call:

session = setup_session(username, password, check_url=url)

@Mikejmnez ?.

JimFluke commented 1 day ago

@Mikejmnez When I try this with conda install pydap-server I get the same results - with both dap2 and dap4 - as with adding lxml to the pip install. I'll look into the "auth service URL" and see what I find. Thanks!

Mikejmnez commented 1 day ago

Thanks @JimFluke that was useful - lxml needs to be included, but overall that does not fix your issue.

Like @ndp-opendap mentioned, we need to look at the auth process and I am not very familiar with this aspect so will need to some to look at and test.

JimFluke commented 5 hours ago

@Mikejmnez @ndp-opendap That worked! I was eventually able to figure out what the check_url should be set to:

https://gcin01.cira.colostate.edu/thredds/restrictedAccess/DPCData

in my case. I got this from looking at the tomcat localhost_access_log.* file for the URL it was accessing when I was logging in with the browser. I was expecting setup_session() to need my digested password since I have the server configured to use those, but it requires my undigested password instead.

Thanks for all your help!

ndp-opendap commented 5 hours ago

Nice work @JimFluke - It's a lot easier when the SSO is made a more visible part of the recipe. NASA's Earth Data Login requires similar invocation, but NASA makes a big deal about documenting EDL and how to use it.

Mikejmnez commented 3 hours ago

@JimFluke Great news!

JimFluke commented 3 hours ago

But, it only works with dap2. With dap4 I get the same No route to host error I got before.