oceanmodeling / seaset

Sea relevant Observational sources Dataset
1 stars 3 forks source link

Efficiently parsing erddap server metadata #3

Open brey opened 2 years ago

brey commented 2 years ago

When using erddapy to retrieve the metadata, the full set of data is parsed, including data variables. This results in a long wait depending on the volume of data.

There has to be a way to simplify/expedite this.

ocefpaf commented 2 years ago

When using erddapy to retrieve the metadata, the full set of data is parsed, including data variables.

The get_info method should download only the metadata, it is something like:

info_url = e.get_info_url(dataset_id, response="csv")

info = pd.read_csv(info_url)

However, that is quite "low level" and ideally we should allow for a "dataset-like" class with the metadata and load the data lazily afterwards. We are working on a refactor to go into this direction.

With that said I believe that libraries that build on top of erddapy should use the low level interface. The high level is mostly for end users.

brey commented 9 months ago

Hi @ocefpaf. I have finally came back to this issue. Thanks for the tip above but my problem remains. So using the get_info_url I am getting all the variables/Attributes which is fine. But then I would like to retrieve a subset of these and I can't see how I can avoid the time parameter.

I am posting below an example. I am using the EMODNET server as an example:

from erddapy import ERDDAP
import pandas as pd

e.response = "csv"

info_url = e.get_info_url(response='csv')
info = pd.read_csv(info_url)

info['Variable Name'].unique()

info['Attribute Name'].unique()

So far so good. However, what I need is the following

e.variables = [

If I use

df = e.to_pandas(low_memory=False)

I get all times. How I can get the above info without the time dimension?

pmav99 commented 9 months ago

@brey is this still an issue? I just tested it, and I don't see time in the returned results:

> df.head()
  StationName EP_PLATFORM_CODE EP_PLATFORM_TYPE                                   EP_PLATFORM_LINK StationCountry  longitude (degrees_east)  latitude (degrees_north)
0      Dalian           Dalian               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    121.68                     38.87
1      Kanmen           Kanmen               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    121.28                     28.08
2      Nansha           Nansha               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    112.88                      9.55
3       Xisha            Xisha               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    112.30                     16.80
4       Zhapo            Zhapo               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    111.81                     21.58
brey commented 9 months ago



You get for each station one entry per timestamp

pmav99 commented 9 months ago

All the rows are identical, are they not? Then maybe,


might be enough?

Or maybe even:

brey commented 9 months ago

I know but that means that if another server has a longer time range the amount of data you'll download will be quite large.

ocefpaf commented 6 months ago

Sorry, this one flew under the radar but I just found it. Maybe

df = e.to_pandas(distinct=True)

can help you there. That would return only unique values, filtered on the server-side first. It should be similar to the post pandas unique method call.

Screenshot from 2024-01-09 15-54-10

pmav99 commented 6 months ago

Aha! https://github.com/ioos/erddapy/blob/109ddec1efc223c2dfeea450efa2245b6ab9c5ef/erddapy/core/url.py#L61-L75C9 http://erddap.ioos.us/erddap/tabledap/documentation.html#distinct

Thank you Felipe.