Add quarterly glider metrics

MathewBiddle commented 2 years ago

Similar to the GTS calculations, it would be useful to have gliderDAC quarterly metrics too.

https://github.com/MathewBiddle/ioos_by_the_numbers/tree/main/gts

MathewBiddle commented 1 year ago

Need to check my code against @kbailey-noaa

Mine:

import pandas as pd
df_glider = pd.read_csv('https://gliders.ioos.us/erddap/tabledap/allDatasets.csvp?minTime%2CmaxTime%2CdatasetID')
df_glider.dropna(
    axis=0, 
    inplace=True,
    )

# drop delayed datasets
df_glider = df_glider[df_glider["datasetID"].str.contains("delayed")==False]

df_glider[['minTime (UTC)','maxTime (UTC)']] = df_glider[
                                                         ['minTime (UTC)','maxTime (UTC)']
                                                         ].apply(pd.to_datetime)

start_date = '2000-01-01'
end_date = '2022-12-31'

# find glider deployments between 10/01 and 12/31
glider_day_within = df_glider.loc[
    (df_glider['minTime (UTC)'] > pd.to_datetime(start_date,utc=True)) &
    (df_glider['maxTime (UTC)'] < pd.to_datetime(end_date,utc=True))
]

# gliders that start before 10/01 and end after 12/31
glider_day_outside = df_glider.loc[
    (df_glider['minTime (UTC)'] < pd.to_datetime(start_date,utc=True)) &
    (df_glider['maxTime (UTC)'] > pd.to_datetime(end_date,utc=True))
]

glider_day_outside.loc[:, 'maxTime (UTC)'] = pd.to_datetime(end_date, utc=True)
glider_day_outside.loc[:, 'minTime (UTC)'] = pd.to_datetime(start_date, utc=True)

# drop the ones from above as they will be duplicates in the next round of filtering
df_glider.drop(axis=0, index=glider_day_outside.index, inplace=True)

# Find gliders that start before 10/01 and end after 10/01
glider_day_lower = df_glider.loc[
    (df_glider['minTime (UTC)'] < pd.to_datetime(start_date,utc=True)) &
    (df_glider['maxTime (UTC)'] > pd.to_datetime(start_date,utc=True))
]

glider_day_lower.loc[:,'minTime (UTC)'] = pd.to_datetime(start_date, utc=True)

# Find gliders that start before 12/31 and end after 12/31.
glider_day_upper = df_glider.loc[
    (df_glider['minTime (UTC)']<pd.to_datetime(end_date,utc=True)) &
    (df_glider['maxTime (UTC)']>pd.to_datetime(end_date,utc=True))
]

glider_day_upper.loc[:,'maxTime (UTC)'] = pd.to_datetime(end_date, utc=True)

# Combine it all together into one DF.
glider_subset = pd.concat([glider_day_lower, 
                           glider_day_within, 
                           glider_day_upper, 
                           glider_day_outside], 
                          verify_integrity=True)

# Calculate the days between min time and max time.
glider_subset['glider_days'] = (glider_subset['maxTime (UTC)'] - glider_subset['minTime (UTC)']).dt.days

# Calculate total glider days.
glider_days = glider_subset['glider_days'].sum()

print("Glider days between %s and %s: %s" % (start_date,end_date,glider_days))

Kathy's:

# import ipydatetime
# from datetime import datetime
# from ipywidgets import VBox, HBox, Label, BoundedFloatText

# dt0 = ipydatetime.NaiveDatetimePicker(value=datetime(2000, 1, 1, 0, 0, 0), description='start time:')
# dt1 = ipydatetime.NaiveDatetimePicker(value=datetime(2022, 12, 31, 23, 59, 59), description='end time:')

# south = BoundedFloatText(
#     value=-90,
#     min=-90,
#     max=90,
#     description='min lat:',
# )

# north = BoundedFloatText(
#     value=90,
#     min=-90,
#     max=90,
#     description='max lat:',
# )

# west = BoundedFloatText(
#     value=-180,
#     min=-180,
#     max=180,
#     description='min lon:',
# )

# east = BoundedFloatText(
#     value=180,
#     min=-180,
#     max=180,
#     description='max lon:',
# )

# from ipywidgets import Layout, Button, Box

# box_layout = Layout(
#     display="flex",
#     flex_flow="column",
#     align_items="center"
# )

# items = [
#     VBox([dt0, dt1]),
#     north,
#     HBox([west, east]),
#     south,
# ]

# box = Box(children=items, layout=box_layout)
# box

# dt0 = dt0.value
# dt1 = dt1.value
# south = south.value
# north = north.value
# west = west.value
# east = east.value

# params = {
#     "min_time": dt0,
#     "max_time": dt1,
#     "min_lat": south,
#     "max_lat": north,
#     "min_lon": west,
#     "max_lon": east,
# }

# from gdutils import GdacClient

# client = GdacClient()

# client.search_datasets(params=params, include_delayed_mode=False)
# client.datasets

# # Count the total number of deployments within the dt0:dt1 time window
# num_deployments = client.datasets.shape[0]

# # Count the number of glider days within the dt0:dt1 time window
# glider_days = client.glider_days_per_yyyymmdd.loc[dt0:dt1].sum()

# # count the number of profiles per dataset
# profile_count = client.profiles_per_yyyymmdd.loc[dt0:dt1].sum()

# datasets = client.datasets.copy()

# import warnings

# sea_names = []
# funding_sources = []
# for dataset_id, row in datasets.iterrows():

#     # Fetch the dataset description from ERDDAP
#     info = client.get_dataset_metadata(dataset_id)

#     if info.empty:
#         continue

#     # Find all global NetCDF attributes
#     nc_globals = info.loc[info["Variable Name"] == "NC_GLOBAL"]

#     # Find the sea_name global attribute
#     sea_name_attr = nc_globals.loc[nc_globals["Attribute Name"] == "sea_name"]
#     sea_name = "unknown"
#     if not sea_name_attr.empty:
#         sea_name = sea_name_attr.Value.iloc[0]
#         sea_name = sea_name or "unknown"
#     else:
#         warnings.warn(f"{dataset_id}: sea_name NC_GLOBAL not found")

#     # Find all global attributes that begin with "acknowledg" as this attribute typically contains the funding sources
#     funding_attr = nc_globals.loc[nc_globals["Attribute Name"].str.startswith("acknowledg")]
#     funding = "unknown"
#     if not funding_attr.empty:
#         funding = funding_attr.Value.iloc[0]
#         funding = funding or "unknown"
#     else:
#         warnings.warn(f"{dataset_id}: acknowledgment NC_GLOBAL not found")

#     sea_names.append(sea_name)
#     funding_sources.append(funding)

#     # Count only the days and profiles that are dt0:dt1 inclusive
#     days = client.datasets_days[dataset_id].loc[dt0:dt1].sum()
#     profiles = client.datasets_profiles[dataset_id].loc[dt0:dt1].sum()
#     datasets.loc[dataset_id, ("days", "num_profiles")] = [days, profiles]

# print(f"time ranged from {dt0} to {dt1}")
# print(f"""
# {num_deployments = }
# {glider_days = }
# {profile_count = }
# """
# )

MathewBiddle commented 1 year ago

Should be

Glider Days Reported to the IOOS Glider DAC

MathewBiddle commented 1 year ago

@ocefpaf can I get some help with this at some point? I could never get gdutils installed and functioning (mainly due to lack of time - a simple conda install gdutils would help) to actually compare the two.

cc: @kbailey-noaa

ocefpaf commented 1 year ago

@MathewBiddle gdutils was never packaged. See https://github.com/kerfoot/gdutils/pull/2, https://github.com/kerfoot/gdutils/pull/3, and https://github.com/kerfoot/gdutils/pull/4.

However, I do have a lockfile and a local clone that could make it work. With that said, we should probably build the necessary functionalities you all need into gliderpy. Let's chat a bit on slack and I can try to get that going ASAP.

ocefpaf commented 8 months ago

Can you check your functions against https://nbviewer.org/gist/ocefpaf/2f5a844d556a41b54f894e17541ec4db ? That is is "kind of" validated against gdutils and we know what is causing the differences. Some of those differences we may want to fix, others... Maybe not.

MathewBiddle commented 8 months ago

using my script, which harvests dates from the global metadata on the glider erddap (https://gliders.ioos.us/erddap/tabledap/allDatasets.htmlTable?minTime%2CmaxTime%2CdatasetID): Glider days between 2021-06-01 and 2021-08-10: 2187

In your gist:

print(f"We computed {metadata['days'].sum().days} in gdutils we get 2288.")
We computed 2280 in gdutils we get 2288.
print(f"We computed {metadata['num_profiles'].sum()} in gdutils we get 58978.")
We computed 58187 in gdutils we get 58978.

So, off by 100 days. That's a lot.

MathewBiddle commented 8 months ago

Here is my code https://gist.github.com/MathewBiddle/be67a380228a7da3d084e6e4da7786a0

ocefpaf commented 8 months ago

The off by 100 is in your script, right? My guess is that it is b/c your are rounding days using the default in pandas instead of using floor/ceil for each start/end date. In my code that is days = df.index[-1].ceil("D") - df.index[0].floor("D") and we have to run it for each dataset_id to get an accurate measure. Still, I don't think 100 is bad if you think that most of those are maybe minutes/hours in a single day that we are not counting. If we want to be "precise" and use days/hours/minutes, my rounded days would be closer to yours.

I don't think your estimate is bad if you look at it from the point of view that you requested the minimum amount of metadata/data possible from the allDatasets in ERDDAP. It is an elegant and fast way to get an approximate answer. However, if I understood what John used to do and what Kathy needs, we should consider a glider that was in the water just a minute past midnight, as a new glider day, right?

TL;DR IMO both are correct answers depending on the message we want to convey.

ioos / ioos_metrics

Add quarterly glider metrics #4