nanoos-pnw / NCEI-archiving

Code, documentation and issue tracking for NANOOS NCEI archiving
Apache License 2.0
1 stars 1 forks source link

CMOP final test files -- issue listing #1

Open emiliom opened 7 years ago

emiliom commented 7 years ago

Opening this issue to track the issues / bugs identified by Matt Biddle (or us) in his assessment of the first batch of "near final" files submitted by NANOOS to NCEI for final assessment. These files were made ready for NCEI on 2016-11-30 and are at http://data.nanoos.org/ncei/ohsucmop/

These tasks are all on @cseaton, unless we find an exemption that's on @emiliom.

Updated 12/14/2016 08:30 PT

Variables

instrument variable

Global attributes

bagit files

Other

emiliom commented 7 years ago

Another possibly outstanding task we talked about (copying and pasting from Matt):

We did talk about creating a 'documentation-only' archival information package for this data set, which would contain information about the QA/QC processing. Here is an example of a 'documentation only' archive package, http://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.nodc:0070493. This is a bare-bones example, but we can add more metadata to provide the appropriate references to other archival packages.

MathewBiddle commented 7 years ago

This all looks good. I'll add items as we see fit. Thanks!

MathewBiddle commented 7 years ago

In some of the bag-info.txt files, the bag-size attribute is as follows:

bag-size: 0.000000 MB

One example is the seahs/ package. I get the following:

seahs$ du -sh .
832K    .

Is this an expected result?

MathewBiddle commented 7 years ago

Should this always be the same (in the bag-info.txt file)?

bag-group-identifier: Center for Coastal Margin Observation and Prediction: SATURN

MathewBiddle commented 7 years ago

There are a few variables missing long_name attributes:

$ find . -type f -iname "*.nc" | while read i; do ncdump -h "$i" | grep "long_name ="; done | sort -n | uniq | grep "\"\""
                airtemp:long_name = "" ;
                clay:long_name = "" ;
                eastward_velocity:long_name = "" ;
                eastward_velocity_stderror:long_name = "" ;
                elevation:long_name = "" ;
                elev_stdev:long_name = "" ;
                heading_stderror:long_name = "" ;
                humidity:long_name = "" ;
                instrument_apna_mode:long_name = "" ;
                instrument_leak:long_name = "" ;
                instrument_nh4+:long_name = "" ;
                instrument_no2:long_name = "" ;
                instrument_nox:long_name = "" ;
                instrument_po43-:long_name = "" ;
                instrument_si:long_name = "" ;
                northward_velocity:long_name = "" ;
                northward_velocity_stderror:long_name = "" ;
                pitch_stderror:long_name = "" ;
                pressure:long_name = "" ;
                pressure_stderror:long_name = "" ;
                roll_stderror:long_name = "" ;
                sand:long_name = "" ;
                silt:long_name = "" ;
                speed_of_sound:long_name = "" ;
                sum:long_name = "" ;
                sumscat:long_name = "" ;
                tau:long_name = "" ;
                temperature:long_name = "" ;
                temperature_stderror:long_name = "" ;
                upward_velocity:long_name = "" ;
                upward_velocity_stderror:long_name = "" ;
                water_bottom:long_name = "" ;
                water_transcount:long_name = "" ;
                winddirection:long_name = "" ;
                windgust:long_name = "" ;
                windspeed:long_name = "" ;
emiliom commented 7 years ago

Regarding this:

In some of the bag-info.txt files, the bag-size attribute is as follows:

bag-size: 0.000000 MB

One example is the seahs/ package.

For easy reference, the file is here.

I assume bag-size is generated by the bagit script, not manually? If so, that seems very suspicious (and wrong, of course). @cseaton?

MathewBiddle commented 7 years ago

Looks like there are some files with an instrument:make_model set to "/":

./saturn09/data/saturn09.0.F.GPS/201604-2238.nc
                instrument:make_model = "/" ;
./saturn09/data/saturn09.0.F.GPS/201603-2238.nc
                instrument:make_model = "/" ;

Is this expected?

cseaton commented 7 years ago

bag-size is not part of the default information generated by bagit.py. It is part of the additional metadata requested by NCEI. I was generating the value by integer division, so values less than 1 MB rounded down to 0 MB.

cseaton commented 7 years ago

Not expected. There may be entries without make-model information (and GPSes would be a likely example), but it should be not including the attribute rather than leaving it blank, much less having it be a slash.

Thanks.

Charles

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Tuesday, December 6, 2016 12:11:57 PM | Subject: Re: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing (#1)

Looks like there are some files with an instrument:make_model set to "/": ``` ./saturn09/data/saturn09.0.F.GPS/201604-2238.nc instrument:make_model = "/" ; ./saturn09/data/saturn09.0.F.GPS/201603-2238.nc instrument:make_model = "/" ; ``` Is this expected?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/1#issuecomment-265259029
cseaton commented 7 years ago

I will add long names for these variables.

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Tuesday, December 6, 2016 8:57:35 AM | Subject: Re: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing (#1)

There are a few variables missing long_name attributes: ``` $ find . -type f -iname "*.nc" while read i; do ncdump -h "$i" grep "long_name ="; done sort -n uniq grep "\"\"" airtemp:long_name = "" ; clay:long_name = "" ; eastward_velocity:long_name = "" ; eastward_velocity_stderror:long_name = "" ; elevation:long_name = "" ; elev_stdev:long_name = "" ; heading_stderror:long_name = "" ; humidity:long_name = "" ; instrument_apna_mode:long_name = "" ; instrument_leak:long_name = "" ; instrument_nh4+:long_name = "" ; instrument_no2:long_name = "" ; instrument_nox:long_name = "" ; instrument_po43-:long_name = "" ; instrument_si:long_name = "" ; northward_velocity:long_name = "" ; northward_velocity_stderror:long_name = "" ; pitch_stderror:long_name = "" ; pressure:long_name = "" ; pressure_stderror:long_name = "" ; roll_stderror:long_name = "" ; sand:long_name = "" ; silt:long_name = "" ; speed_of_sound:long_name = "" ; sum:long_name = "" ; sumscat:long_name = "" ; tau:long_name = "" ; temperature:long_name = "" ; temperature_stderror:long_name = "" ; upward_velocity:long_name = "" ; upward_velocity_stderror:long_name = "" ; water_bottom:long_name = "" ; water_transcount:long_name = "" ; winddirection:long_name = "" ; windgust:long_name = "" ; windspeed:long_name = "" ; ```
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/1#issuecomment-265205701
cseaton commented 7 years ago

From the email you sent requesting the additional bag-info: --bag-group-identifier

Unique identifier which identifies other bags to which it belongs, like a regional association. E.g. “Southern California Coastal Ocean Observing System”.  

All of the bags I'm providing are products of the Center for Coastal Margin Observation and Prediction and part of the SATURN collaboratory. I could see an argument for going for the even broader NANOOS identifier, but the CMOP: SATURN identifier matches the level at which the bag-count field operates.

I'm not clear what this field is used for (or the other bag-info fields, such as source-organization: should that the publisher or the creator in the netcdf file attributes), so I'm not sure what the best labels would be.

Charles

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Tuesday, December 6, 2016 8:02:30 AM | Subject: Re: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing (#1)

Should this always be the same (in the bag-info.txt file)?
> bag-group-identifier: Center for Coastal Margin Observation and Prediction:
> SATURN
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/1#issuecomment-265189258
cseaton commented 7 years ago

RE: keywords, Looking at the "gold standard" example files, it looks like the keywords field is specifically for the variable keywords. Is that correct?

Charles

----- Original Message ----- | From: "Emilio Mayorga" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Friday, December 2, 2016 9:01:19 AM | Subject: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing (#1)

Opening this issue to track the issues / bugs identified by Matt Biddle (or us) in his assessment of the first batch of "near final" files submitted by NANOOS to NCEI for final assessment. These files were made ready for NCEI on 2016-11-30 and are at http://data.nanoos.org/ncei/ohsucmop/
These tasks are all on @cseaton, unless we find an exemption that's on @emiliom
- [ ] invalid CF standard names in "raw" variables: We have three
standard_name attributes that are not compliant with CF: light status_flag,
raw_mass_concentration_of_chlorophyll_in_sea_water,
raw_sea_water_turbidity. Note: The raw variables can have the same CF
standard name as other variables, the rest of the attributes for the variable
should describe the difference between the two (particularly the long_name
attribute). Recommendation is to:
- [ ] change the standard_name attribute:
raw_mass_concentration_of_chlorophyll_in_sea_water to
mass_concentration_of_chlorophyll_in_sea_water and raw_sea_water_turbidity
to sea_water_turbidity
- [ ] remove the standard_name for light_qc, if possible
- [ ] There are a few files with empty standard_name attributes. I recommend
removing the standard_name attribute if you cannot find an applicable term.
See the attached text file (in email thread) for a list of the files and their
associated blank standard_name attributes.
- [ ] global attribute for processing_level: `:processing_level = "%s has been
subject to preliminary quality assessment.and values have been modified during
the preliminary quality assessment process.";`
- [ ] files are missing a "keywords" global attribute. You have the
keywords_vocabulary attribute, but no keywords.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/1
MathewBiddle commented 7 years ago

For bag-group-identifier, That's fine. I wasn't sure if SATURN was specific to one station (thus a bug) or applicable to all packages. It's applicable to all packages so I will treat it as such.

The keywords global attribute is a list of keywords from the vocabulary defined in keywords_vocabulary. (see http://wiki.esipfed.org/index.php/Attribute_Convention_for_Data_Discovery#keywords for more information). Since, :keywords_vocabulary = "GCMD Keyword Version: 8.4.1" ; You should populate the keywords attribute with the appropriate keywords from http://gcmdservices.gsfc.nasa.gov/static/kms/sciencekeywords/sciencekeywords.csv?ed_wiki_keywords_page. There is also a RESTful API for GCMD here https://wiki.earthdata.nasa.gov/display/CMR/Keyword+FAQ#expand-HowDoIAccesstheGCMDKeywords under "How Do I Access the GCMD Keywords?"

emiliom commented 7 years ago

@mbiddle-nodc, between these comments on the bag-info.txt attributes and nc file ACDD metadata, and the offline exchanges you and I are having about the ATRAC metadata, I'm seeing a good bit of redundant/repeated metadata populated in different places via different mechanisms. For example, it seems like many of the keyword choices @cseaton would have to make for the nc file ACDD metadata are essentially the same as what I was seeing on the ATRAC iso-19115-2, yet one has to populate the two in completely separate and parallel ways -- which is both substantial extra work and has the potential for inconsistencies. How do we handle this in a way that minimizes these issues? For example, one option could be to focus on putting in good ACDD and variable attributes for each station, then write some code to go through all station nc files, extract and compile relevant ACDD, then spit it out into a text file for loading into the ATRAC project iso 19115-2 in some way.

In addition, I'm still not clear what is the intent of the bag-info.txt attributes. That adds a third source of metadata. Are the definitions of these attributes described somewhere, say in your NCEI archiving cookbook? Is there information about how NCEI uses each of these attributes, to help us decide whether the effort in populating any given attribute is really worthwhile?

These issues can't be unique to NANOOS and our data. How have other RA's addressed them?

I realize these comments are broader than the original focus of this github issue; maybe they're better discussed via email, or a separate issue?

Thanks!

MathewBiddle commented 7 years ago

@emiliom @cseaton, These are all valid questions and I will do my best to describe the nuances between all of the different metadata elements. One thing off the top, both the ISO record and the netCDF files recommend using GCMD vocabularies for discovery level metadata.

  1. The ISO Metadata Record in ATRAC. Don't feel the need to worry so much about this record, I have already populated it with the recommended metadata and was simply looking for a review that the information was accurate. As for how we use the ISO record, we might use this as a base collection level record for the automated archival process. Or, we might simply use is as a reference for the metadata for each archival information package. I have a process that takes the information from the ISO record and inserts it into an SOP for our IT personnel to implement for the archival process. So, I guess the best response is we simply use it as a way to collect more information about the archival process we are planning on establishing.

  2. The metadata in the netCDF file. First I want to answer @cseaton question. Yes, the global keywords attribute would contain keywords for the data variables (or anything else you would like to include). The purpose of including these keywords is for you, the data provider, to have an opportunity to include keywords the might not get into the Archival Information Package (AIP) metadata through NCEI's mapping. We currently have a process that takes the standard_name attributes, and maps them to the appropriate GCMD term (through our Ocean Archive System datatypes table). We then use those GCMD terms to develop the metadata record for the Archival Information Package. So, if there are vocabulary items you would like to have available for discovery, you can use the keywords attribute to do that. Also, the keywords attribute gets used in the ncISO to populate the gmd:descriptiveKeywords section. So, if we need to get file level discovery, we can get those keywords from the netCDF file.

  3. As for the bag-info.txt file. This is really a supplemental metadata file to describe the package that you are submitting to us. I'm still trying to think of ways to use the metadata in the file to populate the Archival information Package metadata, but it seems like (as you stated) most of the information is duplicative. Even though it has duplicative information, it is still something we would like to have to help the user understand the package that has been archived. Here is a link which describes the BagIt convention and how they define the metadata elements in bag-info.txt: https://tools.ietf.org/html/draft-kunze-bagit-14#section-2.2.2.

Here are my recommendations for the IOOS community. --source-organization Organization which generated the package. --organization-address Address of the organization which generated the package. --contact-name Name of the person which created the package. --contact-phone Phone number of the person creating the package. --contact-email E-mail address of the person creating the package. --external-description Short summary of the package. E.g. Oceanographic data from station X from Y to Z.
--external-identifier Human readable name of the station. E.g. “Santa Monica Pier” --bag-size Total size of all the data files in the bag to be transferred, include units (KB, MB, GB, etc). --bag-group-identifier Unique identifier which identifies other bags to which it belongs, like a regional association. E.g. “Southern California Coastal Ocean Observing System”.
--bag-count Number of bags to be submitted. In the format “# of #”, like “1 of 1”. --internal-sender-identifier Any internal identification from your system to identfy the package. --internal-sender-description A description of the package that may contain specific language from your system.

If the metadata in the bag-info.txt file is causing too much of a headache, we can work around it. I'm just trying to use as much of the convention as possible.

So, a short summary. 1. Don't worry about the ISO Metadata record in ATRAC. 2. Try to include additional metadata in the netcdf file that might not get included in the AIP metadata. 3. have a look at https://tools.ietf.org/html/draft-kunze-bagit-14#section-2.2.2 for more info on bagit metadata.

Sorry for the long-winded response...

emiliom commented 7 years ago

Thanks so much for -- as usual :) -- the very helpful reply! I'll digest it slowly, so I probably won't follow up until tomorrow.

MathewBiddle commented 7 years ago

Is the following long name correct?

/saturn01/data/saturn01.0.F.FLNTU/201108-1340.nc
                instrument:long_name = "(Chlorophyll fluorometer/ optical turbidity instrument" ;

Seems like the leading "(" shouldn't be there. This shows up in a couple files.

MathewBiddle commented 7 years ago

I found something interesting, it looks like the data in the kiviuq directory (http://data.nanoos.org/ncei/ohsucmop/kiviuq/data/kiviuq/) is for a kayak based sensor system (http://www.ohsu.edu/xd/research/centers-institutes/environmental-health/news/kayka-sensors-estuary.cfm). Is this true? If so, it looks like some of the metadata is not describing the data correctly. :summary = "This file contains seawater data (phycoerythrin, pump, conductivity, salinity, temperature, dissolved oxygen, oxygen saturation, chlorophyll, turbidity, distance from sea bed and longitude) collected at Kiviuq Kayak, a fixed station by the Center for Coastal Margin Observation and Prediction (CMOP), and assembled by Northwest Association of Networked Ocean Observation Systems (NANOOS). This file contains data from 2012-09-06 to 2012-09-23 from a Kayak. This instrument was deployed on 2012-04-13 and was retrieved on 2012-10-01. Data from an additional deployment during this month is available in a separate file. The instrument depth was 0.6 m relative to the water surface. The measurements reflect a nominal 5.0 minute sampling interval. The mooring used was mobile fixed depth. " ; If my assumption is correct, the data is not a fixed station, it would be more of a trajectory dataset. Thus, the lon, lat variables should have associated coordinates for each time measurement (have a dimension of time).

Maybe we should discuss?

cseaton commented 7 years ago

That's correct. I see two options: 1) Fix the metadata issues designate it as fixed station rather than trajectory 2) Omit it from the current data collection

Arguments for (1)

Arguments for (2)

Charles

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Thursday, December 8, 2016 11:48:34 AM | Subject: Re: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing (#1)

I found something interesting, it looks like the data in the kiviuq directory (http://data.nanoos.org/ncei/ohsucmop/kiviuq/data/kiviuq/) is for a kayak based sensor system (http://www.ohsu.edu/xd/research/centers-institutes/environmental-health/news/kayka-sensors-estuary.cfm). Is this true? If so, it looks like some of the metadata is not describing the data correctly. `:summary = "This file contains seawater data (phycoerythrin, pump, conductivity, salinity, temperature, dissolved oxygen, oxygen saturation, chlorophyll, turbidity, distance from sea bed and longitude) collected at Kiviuq Kayak, a fixed station by the Center for Coastal Margin Observation and Prediction (CMOP), and assembled by Northwest Association of Networked Ocean Observation Systems (NANOOS). This file contains data from 2012-09-06 to 2012-09-23 from a Kayak. This instrument was deployed on 2012-04-13 and was retrieved on 2012-10-01. Data from an additional deployment during this month is available in a separate file. The instrument depth was 0.6 m relative to the water surface. The measurements reflect a nominal 5.0 minute sampling interval. The mooring used was mobile fixed depth. " ;` If my assumption is correct, the data is not a fixed station, it would be more of a trajectory dataset. Thus, the lon, lat variables should have associated coordinates for each time measurement (have a dimension of time).
Maybe we should discuss?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/1#issuecomment-265836520
emiliom commented 7 years ago

I would vote for (2), omitting it from the current data collection, for the reason Charles already mentioned. Plus to minimize new complexities at this stage.

MathewBiddle commented 7 years ago

@cseaton I think option 1) was meant to be " Fix the metadata issues designate it as trajectory rather than fixed station"

I would prefer to develop the archive process with all applicable caveats in place, instead of going back and re-working the procedure when something expectedly different shows up. So I guess my vote is for option 1.

But, we can do option 2 since it will require less development on your end.

Keep in mind. If there is something you don't want to be archived, you can simply not run the BagIt packaging on the dataset. Our process will only pull over packages that match the bag convention. That is: four standard files with the following names: bag-info.txt, bagit.txt, manifest-sha256.txt, and tagmanifest-sha256.txt as well as a data/ directory. If it's missing any of those files, the process shouldn't pick it up.

emiliom commented 7 years ago

@mbiddle-nodc, I believe we're already withholding a data type from this archiving phase: depth profiles. @cseaton can correct me if I'm wrong, but that was my understanding.

I definitely plan to do my best to minimize substantial changes in future NANOOS archiving phases, both from CMOP and other groups. It doesn't help anyone. We have another major group of NANOOS depth profilers from the University of Washington that we intend to archive, probably in the 6-12 month horizon.

Personally I'd rather tackle smaller, more discrete, more achievable chunks at a time (if we can call this batch of CMOP data small :smiley:), and I'm really looking forward to being done with this! I have a hunch Charles is too.

As for this:

Keep in mind. If there is something you don't want to be archived, you can simply not run the BagIt packaging on the dataset. Our process will only pull over packages that match the bag convention. That is: four standard files with the following names: bag-info.txt, bagit.txt, manifest-sha256.txt, and tagmanifest-sha256.txt as well as a data/ directory. If it's missing any of those files, the process shouldn't pick it up.

Good point. I don't feel strongly either way. Charles and I will chat in a couple of hours, and we can make a decision on this then.

emiliom commented 7 years ago

@mbiddle-nodc, I've been wondering what's the best GCMD platform attribute for platforms that are attached/fixed to a structure near shore, as opposed to a moored/anchored buoy. I'm not finding an obvious candidate in the GCMD platforms.

It looks like @cseaton has designated all CMOP stations as one or both of these:

In Situ Ocean-based Platforms > BUOYS
In Situ Ocean-based Platforms > MOORINGS

(BTW, when dealing with bona fide buoys, is there a discovery advantage to always using both of those?)

But that doesn't seem appropriate. None of the alternatives I've found seem entirely satisfactory, or they don't have unambiguous definitions:

In Situ Ocean-based Platforms > OCEAN PLATFORM/OCEAN STATIONS   
In Situ Ocean-based Platforms > OCEAN PLATFORM/OCEAN STATIONS > OCEAN PLATFORMS
In Situ Land-based Platforms >  > FIXED OBSERVATION STATIONS
In Situ Land-based Platforms > OCEAN PLATFORM/OCEAN STATIONS
In Situ Land-based Platforms > OCEAN PLATFORM/OCEAN STATIONS > COASTAL STATIONS

(most of the CMOP stations are in the Columbia estuary, so there's already some ambiguity about whether they're In Situ Ocean-based Platforms or In Situ Land-based Platforms!)

Note that in IOOS we spent a good chunk of effort developing a platform vocabulary with definitions. We use those in our SOS services. From that vocabulary, the most appropriate terms could be one of these:

I also looked at the NCEI platform vocabulary, since it's already required in the platform variable. But I find it very difficult to navigate, and doesn't seem to have definitions. eg, what Charles is using, FIXED PLATFORM.

emiliom commented 7 years ago

Quick updates:

MathewBiddle commented 7 years ago

Yes, as you've clearly documented here, there are a lot of moving parts to all of the vocabularies available. Just remember what it is we are trying to do here, make the data discoverable. If you know of another vocabulary that is more applicable, feel free to add it in and just reference which vocabulary you used. I can add that information to the metadata records for discovery, as long as I know what I'm adding.

I don't think another FeatureType would break any of our processing, since most of it is metadata mapping. But I would need to allow for some flexibility in the descriptions (title and abstract) to document the difference. It's fine to exclude it for now.

MathewBiddle commented 7 years ago

Just in case someone stumbles on this thread. The recommendations I'm providing here are primarily for the NANOOS archival process at NCEI. While some of the information might be useful and applicable to other data sets, these are not blanket statements for all of NCEI's archival procedures.

emiliom commented 7 years ago

@mbiddle-nodc, here are some decisions/results relevant to you and this github issue, from my call with Charles today. Please note that I've kept updated the list of "TO DO's" at the start of this issue (the listing and description, but not necessarily the status):

emiliom commented 7 years ago

@cseaton, I just realized something important about depth profiles in your data files -- or at least in saturn01:

So, this dataset is either not a depth profile at all, or its presentation is highly corrupted and must be changed or dropped!! Or, alternatively, I'm really missing something .... Please look into it (saturn01, other similar depth profilers, and possibly ADCP instruments as well), and let me know what you find -- either here or offline.

cseaton commented 7 years ago

If you look at winched profiler instruments (any of the ones with 'saturn01.0.F' in the name), depth has dimensions (timeSeries,time). I'm not sure which feature_type they should have if not timeSeries.

The ADP files (e.g. saturn01/data/saturn01.1950.A.ADP/201112-1374.nc) do have problems. I'll take a look at that. Currently, depth is just the depth of the instrument, so a single constant value. The velocity data is stored with dimensions (timeSeries, time, cell), with another variable storing the cell_size, and the long name for the cell variable explaining the calculation of the distance of each cell from the instrument. That should get converted into velocity data having dimensions (timeSeries, time, cell) and depth having dimensions (timeSeries, time, cell).

There is also a couple of fixed depth non-profiling instruments at SATURN-01, such as saturn01/data/saturn01.1950.A.CT/201111-1375.nc. Those are just normal timeseries.

Charles

----- Original Message ----- | From: "Emilio Mayorga" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Friday, December 9, 2016 12:40:23 PM | Subject: Re: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing (#1)

@cseaton, I just **realized something important about depth profiles in your data files** -- or at least in saturn01: - In principle, if it really is a depth profile, :featureType can not be "timeSeries", as it is now. - While there is a depth variable, its only dimension is timeSeries, which has a length of 1.
So, this dataset is either not a depth profile at all, or its presentation is
highly corrupted and must be changed or dropped!! Or, alternatively, I'm really
missing something .... Please look into it (saturn01, other similar depth
profilers, and possibly ADCP instruments as well), and let me know what you
find -- either here or offline.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/1#issuecomment-266117544
emiliom commented 7 years ago

Thanks, @cseaton. I should've been clear that my saturn01 assessment was based on only one file from only the saturn01.250.A.CT instrument, which it sounds like it's one of the non-profiling instruments! Thanks for the clarification.

Regarding the appropriate featureType, see the NCEI NetCDF Templates v2.0 -- Feature Type Templates and Examples. In principle, the featureType should be timeSeriesProfile, using one of the 4 templates listed. In practice, wrapping your head -- and your code -- around this feature type may be more work than you anticipated by Tuesday ... @mbiddle-nodc, can you take a look at one of these existing saturn 01 "profile" files and let us know what you think? eg, a file from data/saturn01.0.F.CT

Charles, keep us posted on the status of the ADP instrument files. Are the corrections you've identified something you expect to include in the Tuesday file batch?

FYI, I'm adding a task/check item about this to the master check list at the start of this issue.

cseaton commented 7 years ago

I don't think timeSeriesProfile works for the winched system. I think it is the correct type for the ADP data, and would be the correct type for the winched profiler if the data were binned by depth (as, for example, the ORCA buoy data is). But for a instrument moving up and down in the water over time, the data is not a 2-D array, it is just a 1-D timeseries. There is the issue that one of the coordinate variables, depth, becomes a timeseries instead of a single value, but that seems like it breaks the template instead of making it fit the timeSeriesProfile template.

Confusingly, looking at how glider data coming from the glider DAC is represented in NCEI, it looks like it is called a trajectoryProfile, but the actual data structure is 1-D timeseries of variables, rather than 2-D time,depth arrays.

I will fix the problems with ADP data representation, and I'm open to renaming the feature_type, but binning the profiler data into time, depth arrays is definitely out of scope.

Charles

----- Original Message ----- | From: "Emilio Mayorga" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Friday, December 9, 2016 1:57:43 PM | Subject: Re: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing (#1)

Thanks, @cseaton. I should've been clear that my saturn01 assessment was based on only one file from only the saturn01.250.A.CT instrument, which it sounds like it's one of the non-profiling instruments! Thanks for the clarification.
Regarding the appropriate featureType, see the [NCEI NetCDF Templates v2.0 --
Feature Type Templates and
Examples](https://www.nodc.noaa.gov/data/formats/netcdf/v2.0/#templatesexamples).
In principle, the featureType should be timeSeriesProfile, using one of the
4 templates listed. In practice, wrapping your head -- and your code -- around
this feature type may be more work than you anticipated by Tuesday ...
@mbiddle-nodc, can you take a look at one of these existing saturn 01 "profile"
files and let us know what you think? eg, a file from
data/saturn01.0.F.CT
Charles, keep us posted on the status of the ADP instrument files. Are the
corrections you've identified something you expect to include in the Tuesday
file batch?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/1#issuecomment-266134468
cseaton commented 7 years ago

I wonder if trajectory would be the correct feature_type for SATURN-01 vertical profiler timeseries data, with the time dimension for the latitude and longitude variables being degenerate and therefore absent.

Charles

----- Original Message ----- | From: "cseaton" cseaton@stccmop.org | To: "nanoos-pnw/NCEI-archiving" | reply@reply.github.com | Sent: Friday, December 9, 2016 2:18:29 PM | Subject: Re: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing (#1)

I don't think timeSeriesProfile works for the winched system. I think it is the correct type for the ADP data, and would be the correct type for the winched profiler if the data were binned by depth (as, for example, the ORCA buoy data is). But for a instrument moving up and down in the water over time, the data is not a 2-D array, it is just a 1-D timeseries. There is the issue that one of the coordinate variables, depth, becomes a timeseries instead of a single value, but that seems like it breaks the template instead of making it fit the timeSeriesProfile template.
Confusingly, looking at how glider data coming from the glider DAC is
represented in NCEI, it looks like it is called a trajectoryProfile, but the
actual data structure is 1-D timeseries of variables, rather than 2-D
time,depth arrays.
I will fix the problems with ADP data representation, and I'm open to renaming
the feature_type, but binning the profiler data into time, depth arrays is
definitely out of scope.
Charles
----- Original Message -----
From: "Emilio Mayorga" notifications@github.com
To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com
Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com
Sent: Friday, December 9, 2016 1:57:43 PM
Subject: Re: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing
(#1)
Thanks, @cseaton. I should've been clear that my saturn01 assessment was based
on only one file from only the saturn01.250.A.CT instrument, which it sounds
like it's one of the non-profiling instruments! Thanks for the clarification.
Regarding the appropriate featureType, see the [NCEI NetCDF Templates v2.0 --
Feature Type Templates and
Examples](https://www.nodc.noaa.gov/data/formats/netcdf/v2.0/#templatesexamples).
In principle, the featureType should be timeSeriesProfile, using one of the
4 templates listed. In practice, wrapping your head -- and your code -- around
this feature type may be more work than you anticipated by Tuesday ...
@mbiddle-nodc, can you take a look at one of these existing saturn 01 "profile"
files and let us know what you think? eg, a file from
data/saturn01.0.F.CT
Charles, keep us posted on the status of the ADP instrument files. Are the
corrections you've identified something you expect to include in the Tuesday
file batch?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/1#issuecomment-266134468
emiliom commented 7 years ago

Thanks, Charles. Go ahead with the ADP fix and keeping the winched depth profiles as is, until we hear from Matt. It's already 5:30pm out East, so that won't be until Monday morning. We can then make decisions.

Thanks for comparing these representations to the Glider DAC data (I assume you were referring to CMOP glider data?). I'm no expert at all on trajectoryProfile, but I assume that the CMOP glider files passed muster and otherwise the Glider DAC would've rejected them ... I need to refocus on other things today, so I also won't be able to comment on whether trajectory is appropriate, until Monday.

cseaton commented 7 years ago

Feedback on the appropriate feature type value for SATURN-01 winched profiler data would be helpful.

My current interpretation is that the winched profiler data, which is a timeseries with constant latitude and longitude and time-varying depth, fits the 'trajectory' feature type, with latitude and longitude represented as the degenerate, constant value form of a time series.

The other option is to consider it as a 'timeSeries feature type. The CF documentation describes using the timeSeries feature type for a buoy with latitude and longitude varying within the watch circle. In that example, both a constant value latitude and longitude and a pair of time varying variables are included in the file.

I think 'trajectory' better describes the data, because the depth variation is a primary feature of the data, rather than a detail as in the location variation of a buoy.

Charles

----- Original Message ----- | From: "Emilio Mayorga" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Friday, December 9, 2016 2:39:41 PM | Subject: Re: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing (#1)

Thanks, Charles. Go ahead with the ADP fix and keeping the winched depth profiles as is, until we hear from Matt. It's already 5:30pm out East, so that won't be until Monday morning. We can then make decisions.
Thanks for comparing these representations to the Glider DAC data (I assume you
were referring to CMOP glider data?). I'm no expert at all on
trajectoryProfile, but I assume that the CMOP glider files passed muster and
otherwise the Glider DAC would've rejected them ... I need to refocus on other
things today, so I also won't be able to comment on whether trajectory is
appropriate, until Monday.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/1#issuecomment-266142938
emiliom commented 7 years ago

I was hoping we'd hear from @mbiddle-nodc earlier today, but then it slipped my attention too. Sorry. Matt, if you have time, your input Tuesday morning would be really helpful!

I see your point about the inadequacy of timeSeries feature type in this situation. On the other hand, a "platform" attached to a fixed structure (eg, a pier) and with a pressure sensor would record varying depths (pressure) with time, due to tides etc. This depth variation (relative to the water surface) is a fundamental feature of the data. A winched profiler may be seem as an extreme of that case, where the pressure variability with time is much more rapid -- but not so rapid as to be treated as identical times for a single "profile" and therefore qualify for a timeSeriesProfile (in addition to not having regular, binned depths, right?).

I don't see anything that's fundamentally incorrect about using trajectory, but it seems misleading or a very unusual and unexpected use of this feature type.

My conclusion: I don't know the best way forward! Here's the NCEI netcdf templates page, for reference. How many winched profile platforms & instruments do you have?

cseaton commented 7 years ago

We have the one winched platform, with about 10 types of instrument that have been deployed on it.

Working on converting the ADP data to timeSeriesProfile feature type, I realized that another data type that we have that doesn't fit well in the CF feature type is not in the prototype set, but will included in historical data: a side-looking ADP. This is essentially a timeSeriesProfile data type, except that the profile is horizontal and the extra dimension is distance from a lat-lon-depth location, rather than the extra dimension being depth. I think this one doesn't fit into any featureType, but timesSeriesProfile comes closest. However, I'm not sure if designating data as the closest featureType is beneficial, or if that actually causes more problems than it solves.

I don't work with (as a user) data in the CF compliant discrete sampling geometry feature types, so I'm not sure what tools actually understand the feature types. I think whether those tools would interpret the winched sample data more successfully as a timeSeries or a trajectory should probably determine which it gets designated as.

thanks,

Charles

----- Original Message ----- | From: "Emilio Mayorga" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Monday, December 12, 2016 5:14:33 PM | Subject: Re: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing (#1)

I was hoping we'd hear from @mbiddle-nodc earlier today, but then it slipped my attention too. Sorry. Matt, if you have time, your input Tuesday morning would be really helpful!
I see your point about the inadequacy of timeSeries feature type in this
situation. On the other hand, a "platform" attached to a fixed structure (eg, a
pier) and with a pressure sensor would record varying depths (pressure) with
time, due to tides etc. This depth variation (relative to the water surface) is
a fundamental feature of the data. A winched profiler may be seem as an extreme
of that case, where the pressure variability with time is much more rapid --
but not so rapid as to be treated as identical times for a single "profile" and
therefore qualify for a timeSeriesProfile (in addition to not having regular,
binned depths, right?).
I don't see anything that's fundamentally incorrect about using trajectory,
but it seems misleading or a very unusual and unexpected use of this feature
type.
My conclusion: I don't know the best way forward! [Here's the NCEI netcdf
templates page, for
reference](https://www.nodc.noaa.gov/data/formats/netcdf/v2.0/). How many
winched profile platforms & instruments do you have?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/1#issuecomment-266605651
MathewBiddle commented 7 years ago

Sorry for the delay. I'm really not sure which featureType would be best to represent this data set. One of my thoughts is to set the 'depth' variable as an auxiliary coordinate variable (http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#coordinate-system) which would allow you to continue using the timeSeries featureType.

The other thought is to split the dataset on 'profiles' and use the profile incomplete https://www.nodc.noaa.gov/data/formats/netcdf/v2.0/profileIncomplete.cdl template. Your profile dimension would be the number of profiles the instrument observed.

I've forwarded on the question to our netCDF working group to see what they think the best approach is. Since we are getting to the holidays and AGU is this week, I don't expect a response very quickly.

For now, I say we move forward with what we have and re-evaluate later down the road. I don't see an major repercussions in changing the featureType to the archival process.

MathewBiddle commented 7 years ago

After talking this through, it's just a profile. A whole bunch of profiles one after another, but a set of profiles all in the same. I think we are trying to make this more complex than it needs to be.

emiliom commented 7 years ago

Thanks, @mbiddle-nodc ! Though in terms of moving quickly today, I'm not sure what you're suggesting in the end, with your last comment. Are you saying we use the profile feature type? But by definition all observations along a profile have to share a single timestamp, no? ("An ordered set of data points along a vertical line at a fixed horizontal position and fixed time.")

On the other hand, I'm leaning towards your conclusion an hour earlier:

For now, I say we move forward with what we have and re-evaluate later down the road. I don't see an major repercussions in changing the featureType to the archival process.

It seems disruptive and inelegant to change featureType on an archived dataset down the road, BUT if you think that's ok, I'm willing to go with that. Then revisit depth profiles in late January or February. We have another major, important set of platforms in NANOOS (at UW, not CMOP) that are winched profilers ("ORCA"), so this discussion will be very much relevant for that dataset, not just for archiving it, but for helping us organize it in a robust way that enables internal reuse within NANOOS and it users.

The only other alternative is to drop saturn01 profile-type instruments for now. It sounds like the use of trajectory didn't jump at you as appropriate?

emiliom commented 7 years ago

@cseaton, you raised another issue but another data type:

I realized that another data type that we have that doesn't fit well in the CF feature type is not in the prototype set, but will included in historical data: a side-looking ADP. This is essentially a timeSeriesProfile data type, except that the profile is horizontal and the extra dimension is distance from a lat-lon-depth location, rather than the extra dimension being depth. I think this one doesn't fit into any featureType, but timesSeriesProfile comes closest.

Again, I don't know what to say. If the "profile" part of a timeSeriesProfile were not explicitly (by definition) tied to the depth/z dimension, I would say you're right. But my read is that it is tied to depth/z. This, too, may require future revision.

MathewBiddle commented 7 years ago

trajectory didn't jump out to me because of the singluar latitude longitude coordinate pair.

I'm getting myself all confused here, the reason for profile was that these are basically profiles, but I didn't account for the time measurements. I guess in its most simplistic form, profile is for one profile at one instance. But, the data collected would have some time associated with each measurement. So, why doesn't timeSeriesProfile work? https://www.nodc.noaa.gov/data/formats/netcdf/v2.0/timeSeriesProfileIncomVOrthoT.cdl

MathewBiddle commented 7 years ago

Remember, there are variations on time and depth (orthogonal and incomplete) by which you can vary the templates. Maybe digging through this will help, http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#discrete-sampling-geometries. I'll get back to you.

emiliom commented 7 years ago

Our understanding (Charles' and mine) is that a "profile" in CDM terms, in both profile and timeSeriesProfile, is a collection of observations at a fixed x-y with varying depth AND at a fixed, single timestamp. That last bit is the deal breaker.

My understanding is that the only different between profile and timeSeriesProfile is that the former is a singleton and the latter is a temporal collection of the former.

But I'd be thrilled to be told that this interpretation is wrong, and that timeSeriesProfile does allow for time varying with depth!

MathewBiddle commented 7 years ago

I'm unsure if what you've done is really not in compliance with the timeSeries featureType. You identified depth as an auxiliary coordinate variable, which is okay. After looking at all the variations of timeSeries, profile, and trajectory I don't think there is one that directly maps to what you are trying to do. I

'm nervous to explicitly state which template to use, since I'm having trouble wrapping my head around the topic, but I think timeSeries will work. Since, after all, each measurement is at a specific time, which corresponds to some depth...

Thinking about use cases, how would a typical user subset this data file?

cseaton commented 7 years ago

For use cases, a typical user would subset the data first on time, and then possibly on depth. They might also divide the data up into individual profiles and then bin the data by depth to produce a regular grid.

While a timeSeriesProfile product would be useful, the data that we'd like to submit for archiving is the raw data, with data values and depth varying with time.

Looking again at http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#discrete-sampling-geometries, it definitely fits the timeseries definition. The CF definition of a timeseries is that x and y are constant for each timeseries, and data varies with time. No specific requirement is placed on how depth is structured:

"Mandatory space-time coordinates for a collection of these features x(i) y(i) t(i,o)"

I'll regenerate the saturn01 files as timeSeries feature types.

thanks,

Charles

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Tuesday, December 13, 2016 9:10:27 AM | Subject: Re: [nanoos-pnw/NCEI-archiving] CMOP final test files -- issue listing (#1)

I'm unsure if what you've done is really not in compliance with the timeSeries featureType. You identified depth as an auxiliary coordinate variable, which is okay. After looking at all the variations of timeSeries, profile, and trajectory I don't think there is one that directly maps to what you are trying to do. I
'm nervous to explicitly state which template to use, since I'm having trouble
wrapping my head around the topic, but I think timeSeries will work. Since,
after all, each measurement is at a specific time, which corresponds to some
depth...
Thinking about use cases, how would a typical user subset this data file?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/1#issuecomment-266799162
emiliom commented 7 years ago

Thanks guys. BTW, when @cseaton says "I'll regenerate the saturn01 files as timeSeries feature types", that just means he'll do what he's already been doing! Which is a relief, b/c it means it can be completed today.

I still think we (NANOOS, ideally with input from NCEI netcdf experts) should revisit this topic of depth profiles later, after we've submitted the complete, initial data archive to NCEI (meaning sometime between January and March, I assume). At that time I can bring to bear the ORCA buoy data.

@cseaton, did you make any decisions about the side-looking ADP's? @mbiddle-nodc, did you have any comments about that? See my comments (quoting Charles) from earlier today.

MathewBiddle commented 7 years ago

I don't have any thoughts off the top of my head. If we could get an example of the data together to play with, maybe that would help the discussion. One thought is some sort of timeSeries Swath, which doesn't yet exist, but if there is a case for it we can work up something specific for a dataset like this.

Something I forgot to mention, our templates are examples of how to implement the standards/conventions as defined by CF and ACDD. We have the flexibility to work outside of those constructs if the dataset really necessitates something specific. We could even go as far as defining a new featureType and presenting it to the community. We don't need to force-fit a square peg into a round hole...

emiliom commented 7 years ago

Thanks, @mbiddle-nodc. I'd much rather not try to define a new featureType. But we definitely have a lot to learn about each featureType's constraints and flexibilities, and could benefit from expert input. The use cases we've discussed (winched depth profilers, side-looking ADP's) are not wildly exotic, and it'd be very disappointing if they haven't been addressed by others before. A swath definitely comes to mind for side-looking ADP's; in fact, more broadly, all ADP's are "remote sensors", unlike truly in-situ probes. But trying to figure out swaths and code for them by today or tomorrow is most likely out of scope.

MathewBiddle commented 7 years ago

I agree, maybe we should schedule to have you two on one of our netCDF meetings to discuss the two data types and see what the group consensus is.

emiliom commented 7 years ago

I agree, maybe we should schedule to have you two on one of our netCDF meetings to discuss the two data types and see what the group consensus is.

I'm happy to participate in that ... after December.