pydata / xarray

N-D labeled arrays and datasets in Python
https://xarray.dev
Apache License 2.0
3.55k stars 1.06k forks source link

Use xarray.open_dataset() for password-protected Opendap files #1068

Closed jenfly closed 6 years ago

jenfly commented 7 years ago

I've been using xarray.open_dataset() to read Opendap netcdf files from NASA's MERRA-2 data archive. Recently they changed their site so that now you must enter a username and password to read any files. They describe here how to access data with Pydap: http://disc.sci.gsfc.nasa.gov/registration/registration-for-data-access#python.

I experimented with a similar approach (adding username and password to the url) with xarray.open_dataset() and specifying engine='pydap', but no luck. Is there a way to use xarray.open_dataset() to read password-protected Opendap files? Thanks!

shoyer commented 7 years ago

If you write engine='pydap' in open_dataset, the URL should be passed directly on to pydap, but you'll still need to follow all of their other instructions. If you're getting an error message from xarray, let us know but otherwise I'm at a loss -- you should check with the folks at NASA.

jenfly commented 7 years ago

Thanks very much for your reply! I still get an error from xarray when I use the engine='pydap' option. Here's a minimum (almost) working example (almost because you need an account with the server so you can substitute your username/password into the url string):

import xarray
from pydap.client import open_url

url = 'http://<username>:<password>@goldsmr5.sci.gsfc.nasa.gov/opendap/MERRA2/M2I3NPASM.5.12.4/1986/01/MERRA2_100.inst3_3d_asm_Np.19860101.nc4'

ds1 = open_url(url)    # Works but data isn't in xarray format
ds2 = xarray.open_dataset(url, engine='pydap')    # Error message, see attached

I've attached the error message here -- error_msg.txt I don't know enough about the inner workings of xarray to trace through it. Please let me know if any of this means anything to you and has a reasonably easy fix or workaround. Thank you!

shoyer commented 7 years ago

If the dataset has a "time" dimension, try accessing the first few values. Can you view them in pydap? Xarray's open_dataset does a little more work than pydap's open_url, insofar as it actually downloads some array data.

jenfly commented 7 years ago

Ah, I see. Thanks for the suggestion. Using Pydap I'm able to see all the variables and their metadata, so I thought it was working, but when I try to actually access the data values, I get the same error message as from Xarray. The issue must be something unrelated to Xarray -- I'll keep investigating. Thanks for your help!

j08lue commented 7 years ago

@jenfly did you find a solution how to make opendap authentication work with xarray? Might be worthwhile posting it here, even though the issue has to do with the backends.

jenfly commented 7 years ago

@j08lue no, not yet. I've been in touch with the folks at NASA who run the server, but their suggestions didn't work for me and I haven't had time to keep troubleshooting. I will need to sort out this issue in the next couple of months to get some data that I need, so if/when I ever resolve it, I'll post the solution here.

jenfly commented 7 years ago

I've finally found something useful online and am able to use Pydap to open these files -- hoping someone can help me find a way to integrate this into an xarray.open_dataset() function call and then I will be a very happy camper!

Turns out much of the info posted by NASA online is out of date and based on a different implementation of Pydap than what is actually being used currently (argh). Here is something that actually works, from http://www.pydap.org/en/latest/client.html#urs-nasa-earthdata:

from pydap.client import open_url
from pydap.cas.urs import setup_session

url = 'https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/M2T1NXSLV.5.12.4/2016/06/MERRA2_400.tavg1_2d_slv_Nx.20160601.nc4'

session = setup_session(username, password)
dataset = open_url(url, session=session)

where I've assigned the username and password variables with the appropriate values in another function.

I've tested this and it is working, but I would prefer to do things within Xarray since all my code is already using it. Just for fun, I tried ds = xarray.open_dataset(url, engine='pydap', session=session), to see if the extra keyword would be magically sent to the pydap engine, but got an error message. Is there a way to incorporate this functionality into xarray.open_dataset? Thank you so much for any assistance!

rabernat commented 7 years ago

Hi @jenfly, it's great to see that you have tracked down this root issue! I agree we should be able to support direct access to these sort of opendap resources within xarray. It should not be too tricky to implement, and in fact, if you are interested, it could be a great opportunity for you to open a pull request and become directly involved in the project. We would be very happy to gain another contributor.

You can see the line where pydap.open_url gets called here: https://github.com/pydata/xarray/blob/master/xarray/backends/pydap_.py#L64

We just need a mechanism to pass the username and password from open_dataset to the pydap backend. There are two possible options I see:

  1. we could add new username and password keyword args to open_dataset. This is the most straightforward, but open_dataset already has a ton of arguments, so maybe it is not ideal.
  2. we could parse out the username and password from a url like https://username:password@... within the pydap backend.

It would be good to get some other opinions on which approach would be preferable.

jenfly commented 7 years ago

Thanks, @rabernat! I'd be happy to try implementing this in the project. I'm a newbie when it comes to contributing to big projects like this (so far I've just used Github for my own little projects) so I might have some naive questions as I figure out how things work.

The two options you mentioned for passing username and password info to open_dataset both sound good to me. I don't have any strong preference between them. How do I get other opinions on which approach to use? Should I start a new issue thread?

Also, I realized that there is another hiccup along the way. When I try to specify engine='pydap' in open_dataset, I get the same error message as mentioned in #1174, that the object has no attribute iteritems. When I wrote the first post in this thread, back in October, I was able to use engine='pydap' without any problems. This seems to be related to recent upstream changes in Pydap: https://github.com/pydap/pydap/issues/43 and I presume might require more substantial changes either in Xarray or Pydap so that they can work together again. Any thoughts on how to handle this?

shoyer commented 7 years ago

Parsing username/password from the URL would be very easy to add.

We need to figure out a solution for the proliferating arguments on open_dataset before we add many more, so I would prefer that for now.

Another option is to add session as an argument on xarray.backends.PydapDataStore, and encourage passing PydapDataStore objects into xarray.open_dataset for extra customizability, e.g.,

store = xarray.backends.PydapDataStore(url, session)
ds = xarray.open_dataset(store)
shoyer commented 7 years ago

Pydap has a new v3.2 release, but it still needs some fixes to work with xarray -- or xarray needs to be updated to work with the new version of pydap. I think https://github.com/pydap/pydap/pull/48 once merged would probably be enough to restore xarray compatibility.

laliberte commented 7 years ago

I like the idea of passing PydapDataStore objects that include the session object. It seems more likely to be forward compatible, especially if Central Authentication Services multiply (as one would expect) with different authentication mechanisms.

jenfly commented 7 years ago

I also like the idea of passing PydapDataStore objects that include the session object. Delving deeper into the pydap authentication, I found that there are already several different setup_session functions available to create the session object, corresponding to different authentication procedures (pydap.cas.get_cookies.setup_session, pydap.cas.urs.setup_session, pydap.cas.esgf.setup_session) as well as additional arguments to setup_session beyond username and password. Best to deal with all this separately with pydap rather than trying to embed it within xarray.

I'm still having problems trying to get xarray.open_dataset to work with pydap. Using the latest commit on pydap/master (in which https://github.com/pydap/pydap/pull/48 is merged) I'm now getting a new error: AttributeError: '<class 'pydap.model.BaseType'>' object has no attribute 'encode'. When I have some time, I'll look into it further and try to see what else is needed to restore compatibility.

shoyer commented 7 years ago

I'm still having problems trying to get xarray.open_dataset to work with pydap. Using the latest commit on pydap/master (in which pydap/pydap#48 is merged) I'm now getting a new error: AttributeError: '<class 'pydap.model.BaseType'>' object has no attribute 'encode'. When I have some time, I'll look into it further and try to see what else is needed to restore compatibility.

Indeed, it would be great if someone using pydap could take a look into this. You can find our logic for interoperating with pydap here: https://github.com/pydata/xarray/blob/master/xarray/backends/pydap_.py

laliberte commented 7 years ago

@shoyer @jenfly: Good news, I think I was able to track down the bug in pydap that was preventing compatibility. I'm putting a PR together and we could expect it to be merged pretty soon into the master. I wanted to give you a heads up so that you don't waste more time on this.

jenfly commented 7 years ago

Awesome, thanks so much @laliberte!

laliberte commented 7 years ago

@jenfly and @shoyer pydap version 3.2.2 (newly released last week) should have fixed this issue. Could you verify?

shoyer commented 7 years ago

I spent a few minutes on this but am still getting AttributeError. It would be great if someone could put some time into debugging this. Should be as simple as installing pydap (in both python 2 and 3 virtual/conda environments) and getting py.test -k PydapTest to pass.

shoyer commented 7 years ago

Nevermind, I figured it out (I was using an old version of pydap by mistake). See #1439 for the pydap fix.

mrpgraae commented 7 years ago

@shoyer @jenfly Has this been implemented? I can't see any open PRs relating to this, so I guess no one is working on it?

I would be happy to try and implement it, if that's fine with you? It seems like you settled on the solution of passing a session object to a PydapDataStore and then passing that to open_dataset(), correct?

Thanks in advance!

shoyer commented 7 years ago

@mrpgraae no, I don't think this has been implemented yet.

Please take a look at #1508 for an example of the model to use:

You are also welcome to add any keyword parameters (e.g., session) that open_url accepts to the open method.

So the user API becomes:

pydap_ds = pydap.client.open_url(url, session=session)
store = xarray.backends.PydapDataStore(pydap_ds)
ds = xarray.open_dataset(store)

or

store = xarray.backends.PydapDataStore.open(url, session=session)
ds = xarray.open_dataset(store)
mrpgraae commented 7 years ago

Thank you @shoyer, I'll start work on the implementation.

juliancanellas commented 5 years ago

Dear all, Thank you very much for all the time you've put into fixing this issue. I'm a fresh PhD student, started working on solar radiation forecast four months ago, and right now I'm trying to use MERRA 2 aerosol data to initialize WRF Solar. The bug fix on this thread has helped me a lot, since I was trying to avoid the straight forward method of downloading the files by date and then merging them in a single python object. This way I can directly create my python object without downloading one by one and then merging! It's awesome! Thank you all very much!

mrpgraae commented 5 years ago

@juliancanellas Great! Good to see that someone else actually benefits from this feature, years after it was implemented 😄

rabernat commented 4 years ago

I am trying to load MERRA2 data via the NASA password-protected opendap server. Although it sounds like both pydap and xarray have been fixed to support this, I still am having basically the same problem @jenfly described over three years ago. At this point it feels like a pydap issue, but I ask on this thread anyway.

Here's a fully reproducible example, password and all 😄

from pydap.client import open_url
from pydap.cas.urs import setup_session

username = 'rabernat'
password = '%8rTMU6VT37r&%3e'
url = 'https://goldsmr5.gesdisc.eosdis.nasa.gov:443/opendap/MERRA2_MONTHLY/M2IMNPANA.5.12.4/2019/MERRA2_400.instM_3d_ana_Np.201901.nc4'

session = setup_session(username, password, check_url=url)
dataset = open_url(url, session=session)
assert 'USVS' in dataset
_ = dataset['USVS'][:]

raises

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-7-56bfca618586> in <module>
----> 1 _ = dataset['USVS'][:]

/srv/conda/envs/notebook/lib/python3.7/site-packages/pydap/model.py in __getitem__(self, index)
    318     def __getitem__(self, index):
    319         out = copy.copy(self)
--> 320         out.data = self._get_data_index(index)
    321         return out
    322 

/srv/conda/envs/notebook/lib/python3.7/site-packages/pydap/model.py in _get_data_index(self, index)
    347             return np.vectorize(decode_np_strings)(self._data[index])
    348         else:
--> 349             return self._data[index]
    350 
    351     def _get_data(self):

/srv/conda/envs/notebook/lib/python3.7/site-packages/pydap/handlers/dap.py in __getitem__(self, index)
    140         logger.info("Fetching URL: %s" % url)
    141         r = GET(url, self.application, self.session, timeout=self.timeout)
--> 142         raise_for_status(r)
    143         dds, data = r.body.split(b'\nData:\n', 1)
    144         dds = dds.decode(r.content_encoding or 'ascii')

/srv/conda/envs/notebook/lib/python3.7/site-packages/pydap/net.py in raise_for_status(response)
     37             detail=response.status+'\n'+response.text,
     38             headers=response.headers,
---> 39             comment=response.body
     40         )
     41 

HTTPError: 302 Found
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://urs.earthdata.nasa.gov/oauth/authorize/?scope=uid&amp;app_type=401&amp;client_id=e2WVk8Pw6weeLUKZYOxvTQ&amp;response_type=code&amp;redirect_uri=http%3A%2F%2Fgoldsmr5.gesdisc.eosdis.nasa.gov%2Fdata-redirect&amp;state=aHR0cHM6Ly9nb2xkc21yNS5nZXNkaXNjLmVvc2Rpcy5uYXNhLmdvdi9vcGVuZGFwL01FUlJBMl9NT05USExZL00ySU1OUEFOQS41LjEyLjQvMjAxOS9NRVJSQTJfNDAwLmluc3RNXzNkX2FuYV9OcC4yMDE5MDEubmM0LmRvZHM%2FVVNWUyU1QjA6MTowJTVEJTVCMDoxOjQxJTVEJTVCMDoxOjM2MCU1RCU1QjA6MTo1NzUlNUQ">here</a>.</p>
</body></html>

Is this a problem with pydap? Or the NASA server?

dcherian commented 4 years ago

https://en.wikipedia.org/wiki/HTTP_302

Looks like you need a better URL? and that pydap can't deal with redirects?

j08lue commented 4 years ago

Yes, seems like a redirect issue. The URL is fine.

rabernat commented 4 years ago

No, actually the problem was with my authorization. I had to accept a EULA before my password was valid. Once I did that, everything worked.

ahahmann commented 4 years ago

One can also add username and password to the .netrc file and all works very smoothly, without a need for explicit username and password in the script.

However, there was one more issue. With Python 3.7.6, I was getting the following error:

  File "MERRA2.py", line 16, in <module>
    session = setup_session(username, password, check_url=url)
  File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/urs.py", line 19, in setup_session
    verify=verify)
  File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/get_cookies.py", line 75, in setup_session
    password_field=password_field)
  File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/get_cookies.py", line 123, in soup_login
    soup = BeautifulSoup(resp.content, 'lxml')
  File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/bs4/__init__.py", line 228, in __init__
    % ",".join(features))
bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

That was solved by pip install lxml

juliancanellas commented 4 years ago

So, I tried Ryan's example, and got to the same error, where do you accept the EULA? It doesn't pop up on screen.

El dom., 22 mar. 2020 a las 6:29, ahahmann (notifications@github.com) escribió:

One can also add username and password to the .netrc file and all works very smoothly, without a need for explicit username and password in the script.

However, there was one more issue. With Python 3.7.6, I was getting the following error:

File "MERRA2.py", line 16, in session = setup_session(username, password, check_url=url) File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/urs.py", line 19, in setup_session verify=verify) File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/get_cookies.py", line 75, in setup_session password_field=password_field) File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/pydap/cas/get_cookies.py", line 123, in soup_login soup = BeautifulSoup(resp.content, 'lxml') File "/groups/FutureWind/xesmf_env/lib/python3.7/site-packages/bs4/init.py", line 228, in init % ",".join(features)) bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

That was solved by pip install lxml

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/pydata/xarray/issues/1068#issuecomment-602170564, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKIV6EWOBU7AJOJNQWZQEGLRIXK7XANCNFSM4CUSDJ5A .

rabernat commented 4 years ago

So, I tried Ryan's example, and got to the same error, where do you accept the EULA? It doesn't pop up on screen.

https://urs.earthdata.nasa.gov/app_eula/nasa_gesdisc_data_archive

wallissoncarvalho commented 3 years ago

No, actually the problem was with my authorization. I had to accept a EULA before my password was valid. Once I did that, everything worked.

I'm trying this example:

url = 'https://gpm1.gesdisc.eosdis.nasa.gov:443/opendap/hyrax/GPM_L3/GPM_3IMERGHH.06/2019/087/3B-HHR.MS.MRG.3IMERG.20190328-S000000-E002959.0000.V06B.HDF5'
try:
    session = setup_session(username, password, check_url=url)
    pydap_ds = open_url(url, session=session)
    store = xr.backends.PydapDataStore(pydap_ds)
    ds = xr.open_dataset(store)
except Exception as err:
    print(err)

which returns:

302 Found
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://urs.earthdata.nasa.gov/oauth/authorize/?scope=uid&amp;app_type=401&amp;client_id=e2WVk8Pw6weeLUKZYOxvTQ&amp;response_type=code&amp;redirect_uri=https%3A%2F%2Fgpm1.gesdisc.eosdis.nasa.gov%2Fdata-redirect&amp;state=aHR0cHM6Ly9ncG0xLmdlc2Rpc2MuZW9zZGlzLm5hc2EuZ292L29wZW5kYXAvaHlyYXgvR1BNX0wzL0dQTV8zSU1FUkdISC4wNi8yMDE5LzA4Ny8zQi1ISFIuTVMuTVJHLjNJTUVSRy4yMDE5MDMyOC1TMDAwMDAwLUUwMDI5NTkuMDAwMC5WMDZCLkhERjUuZG9kcz90aW1lX2JuZHMlNUIwOjE6MCU1RCU1QjA6MTowJTVE">here</a>.</p>
</body></html>
/usr/local/lib/python3.8/site-packages/xarray/backends/common.py:87: FutureWarning: The ``variables`` property has been deprecated and will be removed in xarray v0.11.
  return len(self.variables)

The error message just comes when I try to use xr.open_dataset I've already accepted the EULA. Does anyone know what can be?

ikhomyakov commented 3 years ago

Dear all, anyone knows if it is possible in xarray.open_dataset (pydap or netcdf engines) to pass Authorization or Cookie header along with opendap request? For example: Authorization: Bearer u32t4o3tb3gg43 or Cookie: foo=u32t4o3tb3gg43

vlvalenti commented 3 years ago

@wallissoncarvalho Were you ever able to make that example work? I have been getting this error using the same example as well and haven't been able to find a solution

AyrtonB commented 3 years ago

I'm also getting the same error when running xr.open_dataset(store) even though I have accepted the EULA. Has anyone had success solving this?

I'm using pydap==3.2.2 and xarray==0.18.0, any help would be much appreciated!

import xarray as xr
from pydap.client import open_url
from pydap.cas.urs import setup_session

username = "my_username"
password= "my_password"

url = 'https://goldsmr4.gesdisc.eosdis.nasa.gov/opendap/MERRA2/M2T1NXSLV.5.12.4/2016/06/MERRA2_400.tavg1_2d_slv_Nx.20160601.nc4'

session = setup_session(username, password, check_url=url)
pydap_ds = open_url(url, session=session)

store = xr.backends.PydapDataStore(pydap_ds)
ds = xr.open_dataset(store)
HTTPError: 302 Found
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>302 Found</title>
</head><body>
<h1>Found</h1>
<p>The document has moved <a href="https://urs.earthdata.nasa.gov/oauth/authorize/?scope=uid&amp;app_type=401&amp;client_id=e2WVk8Pw6weeLUKZYOxvTQ&amp;response_type=code&amp;redirect_uri=http%3A%2F%2Fgoldsmr4.gesdisc.eosdis.nasa.gov%2Fdata-redirect&amp;state=aHR0cHM6Ly9nb2xkc21yNC5nZXNkaXNjLmVvc2Rpcy5uYXNhLmdvdi9vcGVuZGFwL01FUlJBMi9NMlQxTlhTTFYuNS4xMi40LzIwMTYvMDYvTUVSUkEyXzQwMC50YXZnMV8yZF9zbHZfTnguMjAxNjA2MDEubmM0LmRvZHM%2FdGltZSU1QjA6MTowJTVE">here</a>.</p>
</body></html>
zjans commented 2 years ago

@AyrtonB I'm getting the same error now, did you manage to solve it?

AyrtonB commented 2 years ago

Unfortunately not @zjans

rabernat commented 2 years ago

I'd like to tag @betolink in this issue. He knows quite a bit about both Xarray and Earthdata login. Maybe he can help us get to the bottom of these problems. Luis, any ideas?

betolink commented 2 years ago

This looks familiar. I'm going to take a look at this when I get home and will report back. @rabernat

betolink commented 2 years ago

Looks like the dataset got updated and when that happens NASA requires users to accept the end user license agreement (again). That's why the request ends up in a redirect. This EULA is also required the first time a user requests the data. Here are the instructions for accepting GESDISC EULA. https://disc.gsfc.nasa.gov/earthdata-login

After the GESDIC data archive app shows up in our authorized apps list the code above works like a charm. I'll ask to see if there is a way to automate this @rabernat @zjans

zjans commented 2 years ago

@betolink Thanks for looking into this. GESDISC was already in my lists of accepted EULAS & authorized Apps. I also deleted them and re-authorized, but no change. I still get the "302 The document has moved" message when trying to access the HDF-datasets under https://gpm1.gesdisc.eosdis.nasa.gov/opendap/hyrax/GPM_L3/... with xr.backends.PydapDataStore and ds.open_dataset()

In the meantime, I changed my scripts to download the entire HDF files from https://gpm1.gesdisc.eosdis.nasa.gov/data/GPM_L3/... and open them locally with xarray (and do spatial subsetting etc) - which works fine but is not quite ideal.

betolink commented 2 years ago

Yeah, definitely not ideal. I'm going to test it again this evening with a new Earthdata user. I'll send you a binder link to a notebook to test it with both accounts.

rabernat commented 2 years ago

At what point do we escalate this issue to NASA? Is there a channel via which they can receive and respond to user feedback?

betolink commented 2 years ago

I just asked on Slack about how to check for these changes (if at the end this issue is indeed related to an updated EULA) and unfortunately there is no way around it other than doing what Jan did(and still got the 302s). About feedback, yes there are channels but they are on a per-DAAC basis (cries). In this case that would be going to https://daac.gsfc.nasa.gov/ and clicking on the feedback button. I'll keep looking at this after the cloud hackathon today.

betolink commented 2 years ago

Quick update, MERRA2 worked as expected after accepting the EULA again. GPM_L3 redirects to an empty .dods file, I guess that's a bug. I'll ask the OpenDAP team tomorrow if they are aware of this behavior and what would be a workaround/solution.

rabernat commented 2 years ago

Just wanted to say how much I appreciate @betolink acting as a communication channel between Xarray and NASA. Users often end up on our issue tracker because Xarray raises errors whenever it can't read data. But the source of these problems is not with Xarray, it's with the upstream data provider.

This also happens all the time with xmitgcm, e.g. https://github.com/MITgcm/xmitgcm/issues/266

It would be great if NASA had a better way to respond to these issues which didn't require that you "know a guy".

betolink commented 2 years ago

I'm happy to help! @rabernat, Makhan @virdi from NASA Langley just reminded me the other day that there is a forum for NASA Earthdata users with a direct line to the program managers and scientists that may be the best place to ask data related questions. I think you only need to register with EDL.

https://forum.earthdata.nasa.gov/

rabernat commented 2 years ago

One solution to this problem might be the creation of a custom Xarray backend for NASA EarthData. This backend could manage authentication with EDL and have its own documentation. If this package were maintained by NASA, it would close the feedback loop more effectively.