Open brey opened 2 years ago
When using
erddapy
to retrieve the metadata, the full set of data is parsed, including data variables.
The get_info method should download only the metadata, it is something like:
info_url = e.get_info_url(dataset_id, response="csv")
info = pd.read_csv(info_url)
info.head()
However, that is quite "low level" and ideally we should allow for a "dataset-like" class with the metadata and load the data lazily afterwards. We are working on a refactor to go into this direction.
With that said I believe that libraries that build on top of erddapy should use the low level interface. The high level is mostly for end users.
Hi @ocefpaf. I have finally came back to this issue. Thanks for the tip above but my problem remains. So using the get_info_url
I am getting all the variables/Attributes which is fine. But then I would like to retrieve a subset of these and I can't see how I can avoid the time parameter.
I am posting below an example. I am using the EMODNET server as an example:
from erddapy import ERDDAP
import pandas as pd
e = ERDDAP(
server="https://erddap.emodnet-physics.eu/erddap",
protocol="tabledap",
)
e.response = "csv"
e.dataset_id = "EMODPACE_NMDIS_PSMSL_L2A_SLEV_TG_TS"
info_url = e.get_info_url(response='csv')
info = pd.read_csv(info_url)
info['Variable Name'].unique()
info['Attribute Name'].unique()
So far so good. However, what I need is the following
e.variables = [
"StationName",
"EP_PLATFORM_CODE",
"EP_PLATFORM_TYPE",
"EP_PLATFORM_LINK",
"StationCountry",
"longitude",
"latitude",
]
If I use
df = e.to_pandas(low_memory=False)
I get all times. How I can get the above info without the time dimension?
@brey is this still an issue? I just tested it, and I don't see time
in the returned results:
> df.head()
StationName EP_PLATFORM_CODE EP_PLATFORM_TYPE EP_PLATFORM_LINK StationCountry longitude (degrees_east) latitude (degrees_north)
0 Dalian Dalian TG https://www.emodnet-physics.eu/map/spi.aspx?id... CN 121.68 38.87
1 Kanmen Kanmen TG https://www.emodnet-physics.eu/map/spi.aspx?id... CN 121.28 28.08
2 Nansha Nansha TG https://www.emodnet-physics.eu/map/spi.aspx?id... CN 112.88 9.55
3 Xisha Xisha TG https://www.emodnet-physics.eu/map/spi.aspx?id... CN 112.30 16.80
4 Zhapo Zhapo TG https://www.emodnet-physics.eu/map/spi.aspx?id... CN 111.81 21.58
Try
df.loc[df.EP_PLATFORM_CODE=='Xisha']
You get for each station one entry per timestamp
All the rows are identical, are they not? Then maybe,
df.loc[df.EP_PLATFORM_CODE=='Xisha'].iloc[0]
might be enough?
Or maybe even:
df.groupby(df.EP_PLATFORM_CODE).first()
I know but that means that if another server has a longer time range the amount of data you'll download will be quite large.
Sorry, this one flew under the radar but I just found it. Maybe
df = e.to_pandas(distinct=True)
can help you there. That would return only unique values, filtered on the server-side first. It should be similar to the post pandas unique
method call.
When using
erddapy
to retrieve the metadata, the full set of data is parsed, including data variables. This results in a long wait depending on the volume of data.There has to be a way to simplify/expedite this.