Open MathewBiddle opened 2 years ago
Need to check my code against @kbailey-noaa
Mine:
import pandas as pd
df_glider = pd.read_csv('https://gliders.ioos.us/erddap/tabledap/allDatasets.csvp?minTime%2CmaxTime%2CdatasetID')
df_glider.dropna(
axis=0,
inplace=True,
)
# drop delayed datasets
df_glider = df_glider[df_glider["datasetID"].str.contains("delayed")==False]
df_glider[['minTime (UTC)','maxTime (UTC)']] = df_glider[
['minTime (UTC)','maxTime (UTC)']
].apply(pd.to_datetime)
start_date = '2000-01-01'
end_date = '2022-12-31'
# find glider deployments between 10/01 and 12/31
glider_day_within = df_glider.loc[
(df_glider['minTime (UTC)'] > pd.to_datetime(start_date,utc=True)) &
(df_glider['maxTime (UTC)'] < pd.to_datetime(end_date,utc=True))
]
# gliders that start before 10/01 and end after 12/31
glider_day_outside = df_glider.loc[
(df_glider['minTime (UTC)'] < pd.to_datetime(start_date,utc=True)) &
(df_glider['maxTime (UTC)'] > pd.to_datetime(end_date,utc=True))
]
glider_day_outside.loc[:, 'maxTime (UTC)'] = pd.to_datetime(end_date, utc=True)
glider_day_outside.loc[:, 'minTime (UTC)'] = pd.to_datetime(start_date, utc=True)
# drop the ones from above as they will be duplicates in the next round of filtering
df_glider.drop(axis=0, index=glider_day_outside.index, inplace=True)
# Find gliders that start before 10/01 and end after 10/01
glider_day_lower = df_glider.loc[
(df_glider['minTime (UTC)'] < pd.to_datetime(start_date,utc=True)) &
(df_glider['maxTime (UTC)'] > pd.to_datetime(start_date,utc=True))
]
glider_day_lower.loc[:,'minTime (UTC)'] = pd.to_datetime(start_date, utc=True)
# Find gliders that start before 12/31 and end after 12/31.
glider_day_upper = df_glider.loc[
(df_glider['minTime (UTC)']<pd.to_datetime(end_date,utc=True)) &
(df_glider['maxTime (UTC)']>pd.to_datetime(end_date,utc=True))
]
glider_day_upper.loc[:,'maxTime (UTC)'] = pd.to_datetime(end_date, utc=True)
# Combine it all together into one DF.
glider_subset = pd.concat([glider_day_lower,
glider_day_within,
glider_day_upper,
glider_day_outside],
verify_integrity=True)
# Calculate the days between min time and max time.
glider_subset['glider_days'] = (glider_subset['maxTime (UTC)'] - glider_subset['minTime (UTC)']).dt.days
# Calculate total glider days.
glider_days = glider_subset['glider_days'].sum()
print("Glider days between %s and %s: %s" % (start_date,end_date,glider_days))
Kathy's:
# import ipydatetime
# from datetime import datetime
# from ipywidgets import VBox, HBox, Label, BoundedFloatText
# dt0 = ipydatetime.NaiveDatetimePicker(value=datetime(2000, 1, 1, 0, 0, 0), description='start time:')
# dt1 = ipydatetime.NaiveDatetimePicker(value=datetime(2022, 12, 31, 23, 59, 59), description='end time:')
# south = BoundedFloatText(
# value=-90,
# min=-90,
# max=90,
# description='min lat:',
# )
# north = BoundedFloatText(
# value=90,
# min=-90,
# max=90,
# description='max lat:',
# )
# west = BoundedFloatText(
# value=-180,
# min=-180,
# max=180,
# description='min lon:',
# )
# east = BoundedFloatText(
# value=180,
# min=-180,
# max=180,
# description='max lon:',
# )
# from ipywidgets import Layout, Button, Box
# box_layout = Layout(
# display="flex",
# flex_flow="column",
# align_items="center"
# )
# items = [
# VBox([dt0, dt1]),
# north,
# HBox([west, east]),
# south,
# ]
# box = Box(children=items, layout=box_layout)
# box
# dt0 = dt0.value
# dt1 = dt1.value
# south = south.value
# north = north.value
# west = west.value
# east = east.value
# params = {
# "min_time": dt0,
# "max_time": dt1,
# "min_lat": south,
# "max_lat": north,
# "min_lon": west,
# "max_lon": east,
# }
# from gdutils import GdacClient
# client = GdacClient()
# client.search_datasets(params=params, include_delayed_mode=False)
# client.datasets
# # Count the total number of deployments within the dt0:dt1 time window
# num_deployments = client.datasets.shape[0]
# # Count the number of glider days within the dt0:dt1 time window
# glider_days = client.glider_days_per_yyyymmdd.loc[dt0:dt1].sum()
# # count the number of profiles per dataset
# profile_count = client.profiles_per_yyyymmdd.loc[dt0:dt1].sum()
# datasets = client.datasets.copy()
# import warnings
# sea_names = []
# funding_sources = []
# for dataset_id, row in datasets.iterrows():
# # Fetch the dataset description from ERDDAP
# info = client.get_dataset_metadata(dataset_id)
# if info.empty:
# continue
# # Find all global NetCDF attributes
# nc_globals = info.loc[info["Variable Name"] == "NC_GLOBAL"]
# # Find the sea_name global attribute
# sea_name_attr = nc_globals.loc[nc_globals["Attribute Name"] == "sea_name"]
# sea_name = "unknown"
# if not sea_name_attr.empty:
# sea_name = sea_name_attr.Value.iloc[0]
# sea_name = sea_name or "unknown"
# else:
# warnings.warn(f"{dataset_id}: sea_name NC_GLOBAL not found")
# # Find all global attributes that begin with "acknowledg" as this attribute typically contains the funding sources
# funding_attr = nc_globals.loc[nc_globals["Attribute Name"].str.startswith("acknowledg")]
# funding = "unknown"
# if not funding_attr.empty:
# funding = funding_attr.Value.iloc[0]
# funding = funding or "unknown"
# else:
# warnings.warn(f"{dataset_id}: acknowledgment NC_GLOBAL not found")
# sea_names.append(sea_name)
# funding_sources.append(funding)
# # Count only the days and profiles that are dt0:dt1 inclusive
# days = client.datasets_days[dataset_id].loc[dt0:dt1].sum()
# profiles = client.datasets_profiles[dataset_id].loc[dt0:dt1].sum()
# datasets.loc[dataset_id, ("days", "num_profiles")] = [days, profiles]
# print(f"time ranged from {dt0} to {dt1}")
# print(f"""
# {num_deployments = }
# {glider_days = }
# {profile_count = }
# """
# )
Should be
Glider Days Reported to the IOOS Glider DAC
@ocefpaf can I get some help with this at some point? I could never get gdutils
installed and functioning (mainly due to lack of time - a simple conda install gdutils
would help) to actually compare the two.
cc: @kbailey-noaa
@MathewBiddle gdutils was never packaged. See https://github.com/kerfoot/gdutils/pull/2, https://github.com/kerfoot/gdutils/pull/3, and https://github.com/kerfoot/gdutils/pull/4.
However, I do have a lockfile and a local clone that could make it work. With that said, we should probably build the necessary functionalities you all need into gliderpy. Let's chat a bit on slack and I can try to get that going ASAP.
Can you check your functions against https://nbviewer.org/gist/ocefpaf/2f5a844d556a41b54f894e17541ec4db ? That is is "kind of" validated against gdutils
and we know what is causing the differences. Some of those differences we may want to fix, others... Maybe not.
using my script, which harvests dates from the global metadata on the glider erddap (https://gliders.ioos.us/erddap/tabledap/allDatasets.htmlTable?minTime%2CmaxTime%2CdatasetID):
Glider days between 2021-06-01 and 2021-08-10: 2187
In your gist:
print(f"We computed {metadata['days'].sum().days} in gdutils we get 2288.")
We computed 2280 in gdutils we get 2288.
print(f"We computed {metadata['num_profiles'].sum()} in gdutils we get 58978.")
We computed 58187 in gdutils we get 58978.
So, off by 100 days. That's a lot.
The off by 100 is in your script, right? My guess is that it is b/c your are rounding days using the default in pandas instead of using floor/ceil for each start/end date. In my code that is days = df.index[-1].ceil("D") - df.index[0].floor("D")
and we have to run it for each dataset_id
to get an accurate measure. Still, I don't think 100 is bad if you think that most of those are maybe minutes/hours in a single day that we are not counting. If we want to be "precise" and use days/hours/minutes, my rounded days would be closer to yours.
I don't think your estimate is bad if you look at it from the point of view that you requested the minimum amount of metadata/data possible from the allDatasets in ERDDAP. It is an elegant and fast way to get an approximate answer. However, if I understood what John used to do and what Kathy needs, we should consider a glider that was in the water just a minute past midnight, as a new glider day, right?
TL;DR IMO both are correct answers depending on the message we want to convey.
Similar to the GTS calculations, it would be useful to have gliderDAC quarterly metrics too.
https://github.com/MathewBiddle/ioos_by_the_numbers/tree/main/gts