oceanmodeling / seaset

Sea relevant Observational sources Dataset
1 stars 3 forks source link

Efficiently parsing erddap server metadata #3

Open brey opened 2 years ago

brey commented 2 years ago

When using erddapy to retrieve the metadata, the full set of data is parsed, including data variables. This results in a long wait depending on the volume of data.

There has to be a way to simplify/expedite this.

ocefpaf commented 2 years ago

When using erddapy to retrieve the metadata, the full set of data is parsed, including data variables.

The get_info method should download only the metadata, it is something like:

info_url = e.get_info_url(dataset_id, response="csv")

info = pd.read_csv(info_url)
info.head()

However, that is quite "low level" and ideally we should allow for a "dataset-like" class with the metadata and load the data lazily afterwards. We are working on a refactor to go into this direction.

With that said I believe that libraries that build on top of erddapy should use the low level interface. The high level is mostly for end users.

brey commented 9 months ago

Hi @ocefpaf. I have finally came back to this issue. Thanks for the tip above but my problem remains. So using the get_info_url I am getting all the variables/Attributes which is fine. But then I would like to retrieve a subset of these and I can't see how I can avoid the time parameter.

I am posting below an example. I am using the EMODNET server as an example:

from erddapy import ERDDAP
import pandas as pd

e = ERDDAP(
  server="https://erddap.emodnet-physics.eu/erddap",
  protocol="tabledap",
)
e.response = "csv"
e.dataset_id = "EMODPACE_NMDIS_PSMSL_L2A_SLEV_TG_TS"

info_url = e.get_info_url(response='csv')
info = pd.read_csv(info_url)

info['Variable Name'].unique()

info['Attribute Name'].unique()

So far so good. However, what I need is the following


e.variables = [
    "StationName",
    "EP_PLATFORM_CODE",
    "EP_PLATFORM_TYPE",
    "EP_PLATFORM_LINK",
    "StationCountry",
    "longitude",
    "latitude",
]

If I use

df = e.to_pandas(low_memory=False)

I get all times. How I can get the above info without the time dimension?

pmav99 commented 9 months ago

@brey is this still an issue? I just tested it, and I don't see time in the returned results:

> df.head()
  StationName EP_PLATFORM_CODE EP_PLATFORM_TYPE                                   EP_PLATFORM_LINK StationCountry  longitude (degrees_east)  latitude (degrees_north)
0      Dalian           Dalian               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    121.68                     38.87
1      Kanmen           Kanmen               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    121.28                     28.08
2      Nansha           Nansha               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    112.88                      9.55
3       Xisha            Xisha               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    112.30                     16.80
4       Zhapo            Zhapo               TG  https://www.emodnet-physics.eu/map/spi.aspx?id...             CN                    111.81                     21.58
brey commented 9 months ago

Try

df.loc[df.EP_PLATFORM_CODE=='Xisha']

You get for each station one entry per timestamp

pmav99 commented 9 months ago

All the rows are identical, are they not? Then maybe,

df.loc[df.EP_PLATFORM_CODE=='Xisha'].iloc[0]

might be enough?

Or maybe even:

df.groupby(df.EP_PLATFORM_CODE).first()
brey commented 9 months ago

I know but that means that if another server has a longer time range the amount of data you'll download will be quite large.

ocefpaf commented 6 months ago

Sorry, this one flew under the radar but I just found it. Maybe

df = e.to_pandas(distinct=True)

can help you there. That would return only unique values, filtered on the server-side first. It should be similar to the post pandas unique method call.

Screenshot from 2024-01-09 15-54-10

pmav99 commented 6 months ago

Aha! https://github.com/ioos/erddapy/blob/109ddec1efc223c2dfeea450efa2245b6ab9c5ef/erddapy/core/url.py#L61-L75C9 http://erddap.ioos.us/erddap/tabledap/documentation.html#distinct

Thank you Felipe.