Metrics: Number of time series longer than X years

dpsnowden commented 8 years ago

Long time series are one of the primary goals fo OceanSITES. Can we calculate the number of products in the PRODUCT( or whatever replaces DATA_GRIDDED) directory that have length greater than 2, 5, 10, 20 years?

@MBARIMike this is the application I was thinking of for the thredds_crawler script.

tcarval commented 5 years ago

Calculate this statistics on long time series available in the DATA_GRIDDED directory and put the result on the web site.

nanderson123 commented 5 years ago

As of July 10, 2018, there are 5 folders in DATA_GRIDDED: KEO, PAPA, PIRATA, RAMA, and TRITON, with a files per folder breakdown of: 9 11 115 126 19

Adding these up, there are 280 files according to the index file. The mode breakdown is: 181 delayed mode, 99 realtime

The breakdown of standard names is shown below. This is what's in the files, and I haven't checked if each and every name is CF compliant. Note that standard names can recur in 1 file (e.g. multiple depth axes). 'height' 922 'depth' 563 'latitude' 351 'longitude' 351 'time' 280 'sea_water_temperature' 153 'surface_downwelling_shortwave_flux_in_air' 150 'air_temperature' 146 'relative_humidity' 146 'wind_speed' 146 'eastward_wind' 145 'northward_wind' 145 'wind_to_direction' 145 'rainfall_rate' 141 'sea_water_salinity' 139 'sea_water_sigma_theta' 138 'eastward_sea_water_velocity' 127 'northward_sea_water_velocity' 127 'direction_of_sea_water_velocity' 102 'sea_water_speed' 102 'air_pressure_at_sea_level' 81 'surface_downwelling_shortwave_flux_in_air_standard_deviation' 53 'surface_downwelling_longwave_flux_in_air' 42 'sea_water_electrical_conductivity' 19 'sea_water_sigma_t' 19 'upward_sea_water_velocity' 19 'sea_water_pressure' 3

Please let me know if additional, specific stats are desired, but this gives us a good idea of what's being measured. Next, I'll try to rework my script to give us an idea of the time spans of the files.

nanderson123 commented 5 years ago

Here's a quick summary of DATA_GRIDDED. Others may want to check the work (zipped .m file). This is based off the index file from NDBC's ftp site (attached), shortened to include only DATA_GRIDDED files. It does not guarantee data are present (variables could be NaN-filled). The distribution appears trimodal, with concentrations of time-series at ~1, ~11, and ~18 yrs. This is likely due to the operational lengths of the programs contained in DATA_GRIDDED.

Someone may want to write a script that actually opens each file and assesses the percentage of data present in each file. However, it's difficult to characterize gaps (instrument failure? depth mismatch between deployments requiring NaN-filling of a larger array? other?).

Basic Statistics: Min file length = 0.02 yr Max file length = 24.47 yr Mean file length = 8.53 yr Median file length = 8.64 yr

data_gridded_histogram

index.txt - Truncated Index File

DATA.m.zip - Matlab Script

Regards, Nathan

ngalbraith commented 5 years ago

Why don't we calculate the number of years of data at all the sites, not just the ones that are currently presented as long time series files in the 'gridded' or 'product' directories? We have uploaded met data for Stratus and NTAS starting in 2000 and 2001, respectively. The time-merged versions of these will be submitted ... very soon, but the data is already there, in the 'data' directory.

nanderson123 commented 5 years ago

Here's a short summary of DATA. Again, interpret cautiously, and provide feedback. For example, a large number of 8 hr files biases statistics. A few index file entries were corrupt, or not parsed correctly due to an incorrect number of entries (~1%).

data

DATA_writeup.docx - Variable Breakdown

oceansites_index.txt - DATA Index File

DATA.m.zip - Matlab Script

MBARIMike commented 5 years ago

I'm curious about what techniques are used to combine multiple deployments into a single long time series file. In the past I've used Ferret because of its re-gridding functions. What is your tool of choice?

ngalbraith commented 5 years ago

We do this 'manually' in Matlab.

As part of the process, we turn some of the fields, like sensor heights, water depth, range ring size, serial numbers, surface current velocity depth and instrument model, deployment and recovery cruises, etc etc, into arrays. I'm sure Ferret could do that too, but I'm not sure if our files are Ferret-friendly.

I don't do the actual merging, but I'm cc'ing the person who does, Kelan Huang, in case there's more to the process than just concatenating the data arrays.

Since we apply a single magnetic correction to each deployment, based on the center point of the deployment year, I'm thinking we have jumps in the values when there's a redeployment. I don't think we address that, but ... maybe we do.

Regards - Nan

On 10/4/18 2:18 PM, Mike McCann wrote:

I'm curious about what techniques are used to combine multiple deployments into a single long time series file. In the past I've used Ferret because of its re-gridding functions. What is your tool of choice?

--

Nan Galbraith Information Systems Specialist *
Upper Ocean Processes Group Mail Stop 29 *
Woods Hole Oceanographic Institution *
Woods Hole, MA 02543 (508) 289-2444 *

MBARIMike commented 5 years ago

Thanks for sharing @ngalbraith !

This question seems to also apply to https://github.com/oceansites/dmt/issues/28 and https://github.com/oceansites/dmt/issues/46.

oceansites / dmt

Metrics: Number of time series longer than X years #27