pangaea-data-publisher / pangaeapy

PANGAEA Python Client
https://www.pangaea.de/
GNU General Public License v3.0
28 stars 18 forks source link

Nested datasets do not load #28

Closed pgierz closed 2 years ago

pgierz commented 2 years ago

Dear Pangaea Colleagues,

I've been experimenting with the PyPangaea interface to directly load data without needing to use the web interface. Very nice I've been running into some difficulty, however. It seems some datasets are "nested", and contain several child datasets inside. An example is the MARGO Sea Surface Temperature reconstruction for the Last Glacial Maximum (10.1594/PANGAEA.760904). If I try to access the child datasets, I am unable to actually get a pandas table back. Any hints?

Here is what I have tried:

>>> import pangaeapy as panpy
>>> pds = panpy.PanDataSet("10.1594/PANGAEA.760904")
>>> pds.title
'Various paleoclimate proxy parameters compiled within the MARGO project'
>>> pds.citation
'Barrows, Timothy T; Chen, Min-Te; de Vernal, Anne; Eynaud, Frédérique; Hillaire-Marcel, Claude; Kiefer, Thorsten; Lee, Kyung Eun; Marret, Fabienne; Henry, Maryse; Juggins, Stephen; Londeix, Laurent; Mangin, Sylvie; Matthiessen, Jens; Radi, Taoufik; Rochon, André; Solignac, Sandrine; Turon, Jean-Louis; Waelbroeck, Claire; Weinelt, Mara (2011): Various paleoclimate proxy parameters compiled within the MARGO project. PANGAEA, https://doi.org/10.1594/PANGAEA.760904'
>>> pds.children
['doi:10.1594/PANGAEA.227326','doi:10.1594/PANGAEA.127383','doi:10.1594/PANGAEA.227620','doi:10.1594/PANGAEA.227319','doi:10.1594/PANGAEA.227318','doi:10.1594/PANGAEA.103069','doi:10.1594/PANGAEA.103070']

So far, so good. Let's try to grab the last "sub-dataset":

>>> # Weinelt, M (2004): Compilation of global planktic foraminifera LGM SST data.
>>> pds_planktic_foraminifera = panpy.PanDataSet(pds.children[-1])
>>> # Above does not work, it is the same as:
>>> pds_planktic_foraminifera = panpy.PanDataSet("doi:10.1594/PANGAEA.103070")
>>> # Maybe remove doi, yet this does not work either:
>>> pds_planktic_foraminifera = panpy.PanDataSet("10.1594/PANGAEA.103070")
>>> # Maybe just the dataset ID
>>> pds_planktic_foraminifera = panpy.PanDataSet("103070")
>>> pds_planktic_foraminifera.title # <-- empty
>>> pds_planktic_foraminifera.abstract # <-- empty
>>> pds_planktic_foraminifera.data # <-- empty

If tracebacks are helpful, I can post those as well, but I guess this is something easy enough to reproduce locally without copy/pasting walls of error text...

Thanks, Paul

huberrob commented 2 years ago

Thanks Paul!

I think you discovered a bug, it seems as if pangaeapy does not properly identify datasets which are not in tabular form. For example doi:10.1594/PANGAEA.103070 is a dataset which just contains an binary file which can be downloaded but there is no data table. I will fix this in the next release.

Robert

pgierz commented 2 years ago

Moin Robert,

I know that sometimes there are also NetCDF files passed around within Pangaea. As a recommendation: I would here return back an xarray dataset. I'm also happy to contribute: not just here, but in general: I find a programatic interface to data repositories a very useful thing to have, but it may take a while for me to get familiar with the code base.

Best PG