nanoos-pnw / NCEI-archiving

Code, documentation and issue tracking for NANOOS NCEI archiving
Apache License 2.0
1 stars 1 forks source link

Comments on 20161214 data files #6

Open MathewBiddle opened 7 years ago

MathewBiddle commented 7 years ago

Starting a new thread for the files generated and copied on 2016-12-14.

From a brief review, it looks like most of the previous comments have been addressed. I do have some concern about some of the variables though. For example, the file ./saturn03/data/saturn03.820.A.APNAraw/201011-1115.nc has some variables similar to what is listed below.

    int instrument_po43-(timeSeries, time) ;
        instrument_po43-:_FillValue = -2147483647 ;
        instrument_po43-:comment = "" ;
        instrument_po43-:path_length = "5" ;
        instrument_po43-:ancillary_variables = "platform instrument instrument_po43-_qc" ;
        instrument_po43-:precision = "" ;
        instrument_po43-:wavelength_caveat = "absorption" ;
        instrument_po43-:long_name = "" ;
        instrument_po43-:valid_min = "" ;
        instrument_po43-:cell_methods = "" ;
        instrument_po43-:references = "" ;
        instrument_po43-:wavelength = "820" ;
        instrument_po43-:ncei_name = "" ;
        instrument_po43-:missing_value = "" ;
        instrument_po43-:grid_mapping = "crs" ;
        instrument_po43-:source = "" ;
        instrument_po43-:resolution = "" ;
        instrument_po43-:coordinates = "time depth lat lon" ;
        instrument_po43-:platform = "platform" ;
        instrument_po43-:instrument = "instrument" ;
        instrument_po43-:units = "counts" ;
        instrument_po43-:coverage_content_type = "physicalMeasurement" ;
        instrument_po43-:valid_max = "" ;
        instrument_po43-:accuracy = "" ;
        instrument_po43-:data_max = 4135L ;
        instrument_po43-:data_min = 50L ;

The attributes in the file don't adequately describe what the variable contains, unless it does and I'm overlooking something. But, if a user were to download this file and try to do something with it (say in 50 years) I don't think they would be able to understand what instrument_po43- contains. It's a physical measurement of something with a wavelength_caveat of absorption(?) with units of 'counts' and has some wavelength and path_length...

There are some others like this, but I just wanted to provide an example. Including a descriptive long_name would really help here.

I don't think this should halt our progress, but it should be something we look into for future submissions.

I'm not sure if you are aware, but we establish the process to be able to create revisions of previous submissions. So, if the data files from a previous submission need to be updated, we can update the appropriate AIP as necessary. In this case, we can update the AIP's with the data files that have more robust metadata as they become available.

emiliom commented 7 years ago

Thanks for starting a new issue, and for the comments and clarifications themselves.

cseaton commented 7 years ago

I agree that this is inadequately documented. This is the raw data from an APNA instrument, and working with this data would require extensive familiarity with the APNA instrument. The processed and scientifically meaningful data from this instrument does exist, but I don't currently have access to it. I can work on improving documentation for these data files, but I will also work on pursuing the processed and more useful form of the data.

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Sent: Wednesday, December 14, 2016 1:08:15 PM | Subject: [nanoos-pnw/NCEI-archiving] Comments on 20161214 data files (#6)

Starting a new thread for the files generated and copied on 2016-12-14.
From a brief review, it looks like most of the previous comments have been
addressed. I do have some concern about some of the variables though. For
example, the file ./saturn03/data/saturn03.820.A.APNAraw/201011-1115.nc has
some variables similar to what is listed below.
```
int instrument_po43-(timeSeries, time) ;
instrument_po43-:_FillValue = -2147483647 ;
instrument_po43-:comment = "" ;
instrument_po43-:path_length = "5" ;
instrument_po43-:ancillary_variables = "platform instrument instrument_po43-_qc"
;
instrument_po43-:precision = "" ;
instrument_po43-:wavelength_caveat = "absorption" ;
instrument_po43-:long_name = "" ;
instrument_po43-:valid_min = "" ;
instrument_po43-:cell_methods = "" ;
instrument_po43-:references = "" ;
instrument_po43-:wavelength = "820" ;
instrument_po43-:ncei_name = "" ;
instrument_po43-:missing_value = "" ;
instrument_po43-:grid_mapping = "crs" ;
instrument_po43-:source = "" ;
instrument_po43-:resolution = "" ;
instrument_po43-:coordinates = "time depth lat lon" ;
instrument_po43-:platform = "platform" ;
instrument_po43-:instrument = "instrument" ;
instrument_po43-:units = "counts" ;
instrument_po43-:coverage_content_type = "physicalMeasurement" ;
instrument_po43-:valid_max = "" ;
instrument_po43-:accuracy = "" ;
instrument_po43-:data_max = 4135L ;
instrument_po43-:data_min = 50L ;
```
The attributes in the file don't adequately describe what the variable contains,
unless it does and I'm overlooking something. But, if a user were to download
this file and try to do something with it (say in 50 years) I don't think they
would be able to understand what instrument_po43- contains. It's a physical
measurement of something with a wavelength_caveat of absorption(?) with units
of 'counts' and has some wavelength and path_length...
There are some others like this, but I just wanted to provide an example.
Including a descriptive long_name would really help here.
I don't think this should halt our progress, but it should be something we look
into for future submissions.
I'm not sure if you are aware, but we establish the process to be able to create
revisions of previous submissions. So, if the data files from a previous
submission need to be updated, we can update the appropriate AIP as necessary.
In this case, we can update the AIP's with the data files that have more robust
metadata as they become available.
--
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/6
MathewBiddle commented 7 years ago

Another thing to note, the global attributes which contain time should be in ISO-8601 format (for a pdf of the actual standard, if your interested, check out pp. 18 for the complete representation).

Here is a list of global attributes with time: time_coverage_start, time_coverage_end, date_created, date_modified, date_issued, date_metadata_modified

For example, :date_modified = "2016-12-13 18:54:10Z" should be :date_modified = "2016-12-13T18:54:10Z"

This doesn't break anything for the archive, but it is a slight deviation from ACDD recommendations.

MathewBiddle commented 7 years ago

Looking at the hmndb/data/hmndb.0.A.Tide_Gauge/200603-311.nc data file, I see elevation:long_name = "Elevation above datum" ; are we sure there are no CF standard names which match this data type? Below are a few that might work: height_above_reference_ellipsoid sea_surface_height_above_geoid sea_surface_height_above_reference_ellipsoid

There are a few others, feel free to use http://cfconventions.org/Data/cf-standard-names/39/build/cf-standard-name-table.html and search for height to see if any fit your requirements. It's fine if we don't have a CF standard_name, I just want to verify.

MathewBiddle commented 7 years ago

Logging another comment.

It looks like there are a few files that have a long_name = "Water temperature" but do not include a standard_name attribute. After briefly reviewing the files, it looks like it's only occurring in a subset of the saturn03 and saturn04 packages. Below is a list of the files and their associated long_name attributes. While leaving this the way it is won't break anything, it would be good to apply the appropriate standard_name if possible. For now, I will map out the long_name = "Water temperature" to the appropriate metadata term.

./saturn03/data/saturn03.1300.R.CT/200804-699.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.CT/200805-699.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.CT/200807-706.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.CT/200808-706.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.CT/201307-1937.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.CT/201308-1937.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.CT/201410-1942.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.CT/201411-1942.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.CT/201510-2176.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.CT/201511-2176.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.CT/201511-2191.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.CT/201512-2191.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.pH/201208-1564.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.pH/201209-1564.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.pH/201611-2603.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.1300.R.pH/201612-2603.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.CT/200908-849.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.CT/200909-849.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.CT/201307-1938.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.CT/201308-1938.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.CT/201410-1943.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.CT/201411-1943.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.CT/201510-2178.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.CT/201511-2178.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.CT/201511-2192.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.CT/201512-2192.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.pH/201208-1563.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.pH/201209-1563.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.pH/201611-2602.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.240.A.pH/201612-2602.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.CT/201107-1273.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.CT/201108-1273.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.CT/201307-1941.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.CT/201308-1941.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.CT/201410-1946.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.CT/201411-1946.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.CT/201510-2179.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.CT/201512-2195.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.CT/201601-2195.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.pH/201208-1565.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.pH/201209-1565.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.pH/201611-2604.nc
                temperature:long_name = "Water temperature" ;
./saturn03/data/saturn03.88888.A.pH/201612-2604.nc
                temperature:long_name = "Water temperature" ;
./saturn04/data/saturn04.30.F.pH/201206-1432.nc
                temperature:long_name = "Water temperature" ;
./saturn04/data/saturn04.30.F.pH/201207-1432.nc
                temperature:long_name = "Water temperature" ;
./saturn04/data/saturn04.860.R.pH/201206-1433.nc
                temperature:long_name = "Water temperature" ;
./saturn04/data/saturn04.860.R.pH/201207-1433.nc
                temperature:long_name = "Water temperature" ;
./saturn04/data/saturn04.88888.A.pH/201206-1441.nc
                temperature:long_name = "Water temperature" ;
./saturn04/data/saturn04.88888.A.pH/201207-1441.nc
                temperature:long_name = "Water temperature" ;
cseaton commented 7 years ago

I think the correct standard name would be water_surface_height_above_reference_datum as NGVD29 is not a geoid or a reference_ellipsoid. I think that the associated variable water_surface_reference_datum_altitude would then contain the conversion from NGVD29 to NAVD88, but I haven't been able to find an example file that uses these standard names that has a valid value for water_surface_reference_datum_altitude (this one: https://www.ngdc.noaa.gov/docucomp/page?xml=NOAA/NESDIS/NGDC/MGG/Tides/iso/xml/9761115_waterlevel_1min.xml&view=getDataView&header=none has Nan for that variable).

For the files with water temperature that doesn't have a standard name of sea_water_temperature, those are water temperatures measured at an above-water instrument, where the water is pumped from depth to the instrument. The water temperature at the instrument can change by several degrees between the in situ conditions and the temperature at the instrument. The data is needed to reconstruct salinity from conductivity (or to correct for temperature effects on other variables). For each of those station-depth combinations, there is an additional instrument which is measuring sea_water_temperature in-situ.

If, given that description (and the comment attribute on the temperature variable), you feel that using the standard name sea_water_temperature will provide more benefit in data discovery than it will provide harm in incorrect interpretation of the data by future users, then I will add in the standard name.

emiliom commented 7 years ago

For the files with water temperature that doesn't have a standard name of sea_water_temperature, those are water temperatures measured at an above-water instrument, where the water is pumped from depth to the instrument. The water temperature at the instrument can change by several degrees between the in situ conditions and the temperature at the instrument. The data is needed to reconstruct salinity from conductivity (or to correct for temperature effects on other variables). For each of those station-depth combinations, there is an additional instrument which is measuring sea_water_temperature in-situ.

That's an interesting edge case, Charles! I'll be curious to see Matt's opinion. Independent of the decision on the use of a standard name attribute, I hope the variable has one or more attributes that make that context clearer, at least to human eyeballs; it sounds like you do have a comment attribute that does that.

Matt, thanks for the comments you've been adding.

MathewBiddle commented 7 years ago

Gotcha. My thought is that even though the instrument is measuring the parameter at an above-water location, it's still measuring sea_water_temperature so the metadata should appropriately reflect that. Your comment attribute does clearly describe that the variable is a water sample which was pumped to a different location, so we are good on the description.

In this case, I don't see an incorrect interpretation of the data occurring if we use the standard_name attribute to describe it. The actual data in the file is sea_water_temperature measurements. If a data user needs more information about the specifics of how the data were collected, they can peruse the metadata in the file to gather that information.

Think of a bucket sample, if you measure the temperature of the water in the bucket, you're still measuring sea_water_temperature, right?

cseaton commented 7 years ago

I can see the validity of the argument that bucket sample temperature is still sea_water_temperature. I think I will strengthen the comment to specify: The sea_water_temperature from the in-situ temperature sensor should be used in preference to this data except for calculating salinity or correcting for temperature effects on other measurements.

thanks,

Charles

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Comment" comment@noreply.github.com | Sent: Wednesday, January 11, 2017 1:26:44 PM | Subject: Re: [nanoos-pnw/NCEI-archiving] Comments on 20161214 data files (#6)

Gotcha. My thought is that even though the instrument is measuring the parameter at an above-water location, it's still measuring sea_water_temperature so the metadata should appropriately reflect that. Your comment attribute does clearly describe that the variable is a water sample which was pumped to a different location, so we are good on the description.
In this case, I don't see an incorrect interpretation of the data occurring if
we use the standard_name attribute to describe it. The actual data in the file
is sea_water_temperature measurements. If a data user needs more information
about the specifics of how the data were collected, they can peruse the
metadata in the file to gather that information.
Think of a bucket sample, if you measure the temperature of the water in the
bucket, you're still measuring sea_water_temperature, right?
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/6#issuecomment-271999784
MathewBiddle commented 7 years ago

Okay, we're really close over here to moving the procedure into production. There is one interesting observation we found. For the package cbnke.260.A.CTD, the title attributes are as follows:

Oceanographic and surface meteorological data collected from Chinook River by Center for Coastal Margin Observation and Prediction (CMOP) and assembled by Northwest Association of Networked Ocean Observation Systems (NANOOS) in the Columbia River Estuary and North East Pacific Ocean from 2004-01-16 to 2004-02-01

Oceanographic and surface meteorological data collected from Chinook River by Center for Coastal Margin Observation and Prediction (CMOP) and assembled by Northwest Association of Networked Ocean Observation Systems (NANOOS) in the Columbia River Estuary and North East Pacific Ocean from 2004-02-01 to 2004-03-01

However, none of the data within those files actually contain "surface meteorological" data. Since we are using these titles as the titles for our Archival Information Packages (with some minor tweaks and additions), I am hesitant to provide a misleading title. Is this something you would be able to rectify? Otherwise, we can heuristically generate the title based on the information we collect from the data files and continue moving forward.

Thanks,

Matt

cseaton commented 7 years ago

Thanks Matt, I will fix this.

Charles

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Comment" comment@noreply.github.com | Sent: Thursday, January 26, 2017 7:27:03 AM | Subject: Re: [nanoos-pnw/NCEI-archiving] Comments on 20161214 data files (#6)

Okay, we're really close over here to moving the procedure into production. There is one interesting observation we found. For the package cbnke.260.A.CTD, the title attributes are as follows:
> Oceanographic and surface meteorological data collected from Chinook River by
> Center for Coastal Margin Observation and Prediction (CMOP) and assembled by
> Northwest Association of Networked Ocean Observation Systems (NANOOS) in the
> Columbia River Estuary and North East Pacific Ocean from 2004-01-16 to
> 2004-02-01
> Oceanographic and surface meteorological data collected from Chinook River by
> Center for Coastal Margin Observation and Prediction (CMOP) and assembled by
> Northwest Association of Networked Ocean Observation Systems (NANOOS) in the
> Columbia River Estuary and North East Pacific Ocean from 2004-02-01 to
> 2004-03-01
However, none of the data within those files actually contain "surface
meteorological" data. Since we are using these titles as the titles for our
Archival Information Packages (with some minor tweaks and additions), I am
hesitant to provide a misleading title. Is this something you would be able to
rectify? Otherwise, we can heuristically generate the title based on the
information we collect from the data files and continue moving forward.
Thanks,
Matt
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/6#issuecomment-275416770
MathewBiddle commented 7 years ago

Okay, will you be regenerating all the files? Or should we put an ad-hoc procedure in place until this can get resolved?

Thanks.

cseaton commented 7 years ago

I will be regenerating all of the files before we move into production (and also generating more files than just the sample files generated so far).

Charles

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Comment" comment@noreply.github.com | Sent: Friday, January 27, 2017 5:33:23 AM | Subject: Re: [nanoos-pnw/NCEI-archiving] Comments on 20161214 data files (#6)

Okay, will you be regenerating all the files? Or should we put an ad-hoc procedure in place until this can get resolved?
Thanks.
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/6#issuecomment-275665908
MathewBiddle commented 7 years ago

Okay, well once the titles are adjusted we are ready to go into production. Let me know when we should do another pull.

cseaton commented 7 years ago

Thanks! I should be able to start generating the full set of files next week. From experience with the test batches, I'd expect to have them all generated by the following week, say Feb 10. I'll let you know when they are ready.

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Comment" comment@noreply.github.com | Sent: Friday, January 27, 2017 11:15:30 AM | Subject: Re: [nanoos-pnw/NCEI-archiving] Comments on 20161214 data files (#6)

Okay, well once the titles are adjusted we are ready to go into production. Let me know when we should do another pull.
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/6#issuecomment-275749234
MathewBiddle commented 7 years ago

Okay, I've got a few other processes to develop. So, let me know when everything is ready.

emiliom commented 7 years ago

@mbiddle-nodc, I'll probably be the one to delay the process. Our web server isn't upgraded yet to handle the complete set of CMOP files that will be archived (apologies ...). I'll have a timeframe later this week.

MathewBiddle commented 7 years ago

Okay, just let me know when your ready.

MathewBiddle commented 7 years ago

Any updates?

cseaton commented 7 years ago

I have been delayed on implementing this. My revised estimate is that I should have it completed late next week. Sorry for the delay.

Charles

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Comment" comment@noreply.github.com | Sent: Thursday, February 9, 2017 5:47:44 AM | Subject: Re: [nanoos-pnw/NCEI-archiving] Comments on 20161214 data files (#6)

Any updates?
--
You are receiving this because you commented.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/6#issuecomment-278644748
emiliom commented 7 years ago

Actually, I thought I was the one delaying things! It turns it'll take longer to upgrade our server to more disk space. Charles, let's move forward with your earlier (offline) suggestion to proceed with "tranches" -- say, divide the total archival submission into 2 batches, possibly 3. We can talk about this today or Monday.

MathewBiddle commented 7 years ago

Just to keep in the back of your mind. If there are data files that are not following the BagIt convention, NCEI will not be picking them up. We will only pull files over that conform to the BagIt convention and validate against the manifest. This might make it a bit easier to work through multiple batches. Once a batch is ready, bag it up and put it on the WAF.

emiliom commented 7 years ago

Thanks for the reminder, @mbiddle-nodc. Yeah, that's our common assumption. If we divide the submission into 2-3 batches, every batch will be self-consistent as defined by the BagIt and manifest.

MathewBiddle commented 7 years ago

Just wanted to give everyone a heads up. Testing has been successful and we are ready to press go. There is one, new, hurdle I have to jump through at NCEI, but it should be a formality. Tuesday March 7th, I will present this automation for a DataSet Readiness Review. Once that gets approved we can press go. I apologize for this delay.

emiliom commented 7 years ago

Great, @mbiddle-nodc! Thanks for the update.

emiliom commented 7 years ago

Tuesday March 7th, I will present this automation for a DataSet Readiness Review. Once that gets approved we can press go.

@mbiddle-nodc, how did this go? Assuming it went well, do you have a rough time frame or expectationn for when it'll get approved? Thanks!

MathewBiddle commented 7 years ago

@emiliom, the Data Set Readiness Review Board approved this data set to go operational. We've done all the appropriate testing and everything is in a position for us to press go. I have passed it off to our IT group and they will implement it as soon as possible, given their workload. I would like to say this will be operational by the end of the week, but I'm not sure.

emiliom commented 7 years ago

Thanks. That sounds good. Just let us know once it's operational, and I'll break the champagne :smile_cat:

MathewBiddle commented 7 years ago

We are operational. I sent you an email with the details.

emiliom commented 7 years ago

Thanks, @mbiddle-nodc!! Thanks also for the formal email announcement that cc'd Jan, Derrick, etc.

On that email you said:

The first data set archived under the new automated process is titled "Oceanographic data collected from Port of Alsea by Center for Coastal Margin Observation and Prediction (CMOP) and assembled by Northwest Association of Networked Ocean Observation Systems (NANOOS) in the Columbia River Estuary and North East Pacific Ocean from 2005-12-15 to 2006-04-18 (NCEI Accession 0161524)"

I assume the other datasets/sites are being processed as we speak, and will be completed by early next week?

We'll follow up with you on the monthly archiving. I think that before we get to that, Charles and I will submit the complete/long time series for the remaining sets of stations. But, we'll discuss that soon.

MathewBiddle commented 7 years ago

Yes, it should be continuing to publish new data sets in the near future. I think, to minimize impact on our servers, they wait until late night/early morning to do the batch processing. Expect to see a slew of e-mails sometime soon, one for each station.

Will you be at the spring IOOS meeting next week? We could discuss then...

emiliom commented 7 years ago

Cool, thanks. I'll be on the lookout for the email onslaught ;)

emiliom commented 7 years ago

Re: Dataset Citation co-authors pattern:
@mbiddle-nodc, looking at the first archived dataset you pointed us to: https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.nodc:0161524 the Dataset Citation authors caught my attention. I had some memory of having discussed with you a different pattern for the authors, and I found the reference on our github exchanges from Dec 9. Specifically, @cseaton and I asked that the following template/pattern be used instead:

we'd like to use a citation template that is station-specific and should be automatically constructed from the ACDD contributor_name entries (in the order included), plus my name as the last one. The total length of co-authors will then be 5-6 in these CMOP-NANOOS AIP's.

In your follow-up comments you said you'd make a request to enable this, and didn't think it'd be problem. But I don't think any of us followed up to confirm, and we hadn't had the opportunity to see a dataset "in action" where we could see the citation.

We'd like to have this changed to what we requested. We feel this is a fairly important issue b/c it makes the attribution in citations more directly and visibly linked to PI's, who will care most about it and will likely be more supportive of our efforts this way. More granular attribution and roles are fairly clearly specified in the ACDD attributes in the netcdf files, so I think that already covers nicely other needs for attribution, appropriate contact information, etc.

Thanks!

MathewBiddle commented 7 years ago

Sorry, this was an oversight on my part. I just put the request in.

emiliom commented 7 years ago

Thanks! Let us know when the citation has been changed.

MathewBiddle commented 7 years ago

how should we handle a change in PI within the same station. say the contributor list is: "Antonio Baptista, Michael Wilkin, Charles Seaton, Sarah Riseman" ; then the next month is "Michael Wilkin, Charles Seaton, Sarah Riseman" ; what should we do?

emiliom commented 7 years ago

how should we handle a change in PI within the same station. say the contributor list is: "Antonio Baptista, Michael Wilkin, Charles Seaton, Sarah Riseman" ; then the next month is "Michael Wilkin, Charles Seaton, Sarah Riseman" ; what should we do?

Hmm. @cseaton, we hadn't accounted for that, had we? What do you think? It seems like a change in the station PI is a possibility. The Dataset Citation is intended to pertain to the entire length of the station data.

cseaton commented 7 years ago

If the citation covers the entire extent of the station data, then I think the Data Citation should include everyone who has been listed on the contributor list of any of the individual files. If you are a contributor for any of the data, then you should be listed as a contributor for the complete data set.

Charles

----- Original Message ----- | From: "Emilio Mayorga" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Monday, March 13, 2017 1:34:54 PM | Subject: Re: [nanoos-pnw/NCEI-archiving] Comments on 20161214 data files (#6)

> how should we handle a change in PI within the same station. say the contributor > list is: > "Antonio Baptista, Michael Wilkin, Charles Seaton, Sarah Riseman" ; > then the next month is > "Michael Wilkin, Charles Seaton, Sarah Riseman" ; > what should we do?
Hmm. @cseaton, we hadn't accounted for that, had we? What do you think? It seems
like a change in the station PI is a possibility. The Dataset Citation is
intended to pertain to the entire length of the station data.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/6#issuecomment-286234999
emiliom commented 7 years ago

@cseaton, a potential problem with that approach is that it means the citation would need to change with time, if a new contributor is added in the future. And the "scheme" for selecting the authors is no longer simple and dependent on the ACDD attribute of a single file. Plus, if a PI leaves and a new PI takes over a station (I realize Antonio is the PI on nearly all stations, though ...), that would also involve a decision about what to do.

I don't have any bright ideas yet. But let's collect all our thoughts on this, including ideas for moving forward.

emiliom commented 7 years ago

Charles, don't forget to chime in on the citation questions.

Matt, any new suggestions, based on our comments so far?

MathewBiddle commented 7 years ago

I don't have any new suggestions.

cseaton commented 7 years ago

Here are some additional thoughts.

It seems to me that the author list for the citation is either required to be constant for all files associated with the data set, or the author list for the citation needs to be constructed from either the complete set of authors listed in any files in the data set, or the partial list of authors in all the files, or it could be established by the first set of files submitted for a data set, and new contributors in subsequent files would be associated with a particular piece of data through the file metadata, but would not be listed as authors for the larger dataset.

In either of the options where the author list is fixed, if a new PI wanted to have data after they became the PI have a different citation with a different author list, they would need to do something to signify that the subsequent data belonged to a new data set rather than being a continuation of the previous data set.

The author question seems the same for the original "[submitter] and [submitting institution]" form, where the submitter could change. Regional DMAC lead seems pretty stable, but it probably changes at least as often as PI (maybe less often than the secondary contributors).

Also, the from "[start date] to [end date]" is also not determined by a single file and (for stations that are active) changes from month to month, so I'm not sure how that differs from the author list in the citation.

How does version number work in the citation? If one file out of a multi-file data set is modified, does the version number for the entire data-set (listed in the citation) increment?

Charles

----- Original Message ----- | From: "Emilio Mayorga" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Tuesday, March 14, 2017 8:08:36 PM | Subject: Re: [nanoos-pnw/NCEI-archiving] Comments on 20161214 data files (#6)

Charles, don't forget to chime in on the citation questions.
Matt, any new suggestions, based on our comments so far?
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/6#issuecomment-286628207
MathewBiddle commented 7 years ago

I just had a conversation with some colleagues and we thought it would be good to mention that the citation for the archive package should reference the list of all parties associated with that asset. Whether they are still working with those assets or not is somewhat irrelevant to the data set in the archive. Remember, we are looking at the entire lifetime of that data asset, so the personnel will change, but, their association with the creation of that data is still applicable, even if it was years ago.

Therefore, my recommendation would be to have us use the contributors from the contributor_names global attribute. If we receive new names, that haven't previously been included, then we append them to the list of authors. This way we have all of the people associated with the production of that data set appropriately identified, even if some of the personnel are no longer associated.

@cseaton Right now, we are looking at the collection level citation. This is for all the data associated with one station, or an Archival Information Package (AIP). So, anything contained in that one Archival Information Package would have this one associated citation. As far as subsetting the author list down to the file level, that's where listing this information in the netCDF file comes to help. People should be using the citations you provide in those netCDF files as applicable.

For the version number in the citation: Check out https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.nodc:0161524. The version number is listed right in the citation

Cite as: Baptista, A.; Wilkin, M.; Seaton, C.; Riseman, S.; Mayorga, E.; Northwest Association of Networked Ocean Observing Systems (2017). Oceanographic data collected from Port of Alsea by Center for Coastal Margin Observation and Prediction (CMOP) and assembled by Northwest Association of Networked Ocean Observation Systems (NANOOS) in the Columbia River Estuary and North East Pacific Ocean from 2005-12-15 to 2006-04-18 (NCEI Accession 0161524). Version 1.1. NOAA National Centers for Environmental Information. Dataset. [access date]

The way we've established this procedure is to create one AIP for each station. Each time the process finds a new station, a new AIP gets generated. The next time the process finds a station that already has an AIP, the AIP gets a major revision (updated version number) and the new data files get appended to the previous package. For an example of this have a look at the SCCOOS AIP. The metadata page can be found at https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.nodc:0157036 and the AIP can be found at ftp://ftp.nodc.noaa.gov/nodc/archive/arc0100/0157036/.

The directories within the AIP are organized as follows: 1.1 - version 1.1, the initial submission of data 2.2 - version 2.2, The next month we looked at their WAF and found new files, so we appended them to version 1.1 and updated the metadata and published version 2.2. 3.3 - version 3.3, The next month we looked at their WAF and found new files, so we appended them to version 2.2 which contains version 1.1 and updated the metadata and published version 3.3. And so on... So, the largest version number should have all the associated data files for that station.

As for the citation in regards to this versioning scheme, we include version number, but getting down to the file level would be to granule for the purpose we are serving.

I hope I didn't get too far off topic and I hope some of this information actually helps.

cseaton commented 7 years ago

Okay, so any contributor listed in any of the files in an AIP will be an author in the citation list, and as new contributors get added over time, the version number of the AIP will be changing anyway, so the citation will be distinct anyway.

One not particularly relevant question about the version numbers: Why do you advance both the major and minor version number simultaneously, 1.1, 2.2, 3.3 etc, instead of 1.1, 2.1, 3.1? And is there anything that would advance just the major version number or just the minor version number?

----- Original Message ----- | From: "mbiddle-nodc" notifications@github.com | To: "nanoos-pnw/NCEI-archiving" NCEI-archiving@noreply.github.com | Cc: "cseaton" cseaton@stccmop.org, "Mention" mention@noreply.github.com | Sent: Wednesday, March 15, 2017 10:36:39 AM | Subject: Re: [nanoos-pnw/NCEI-archiving] Comments on 20161214 data files (#6)

I just had a conversation with some colleagues and we thought it would be good to mention that the citation for the archive package should reference the list of all parties associated with that asset. Whether they are still working with those assets or not is somewhat irrelevant to the data set in the archive. Remember, we are looking at the entire lifetime of that data asset, so the personnel will change, but, their association with the creation of that data is still applicable, even if it was years ago.
Therefore, my recommendation would be to have us use the contributors from the
contributor_names global attribute. If we receive new names, that haven't
previously been included, then we append them to the list of authors. This way
we have all of the people associated with the production of that data set
appropriately identified, even if some of the personnel are no longer
associated.
@cseaton Right now, we are looking at the collection level citation. This is for
all the data associated with one station, or an Archival Information Package
(AIP). So, anything contained in that one Archival Information Package would
have this one associated citation. As far as subsetting the author list down to
the file level, that's where listing this information in the netCDF file comes
to help. People should be using the citations you provide in those netCDF files
as applicable.
For the version number in the citation:
Check out https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.nodc:0161524. The
version number is listed right in the citation
>Cite as: Baptista, A.; Wilkin, M.; Seaton, C.; Riseman, S.; Mayorga, E.;
>Northwest Association of Networked Ocean Observing Systems (2017).
>Oceanographic data collected from Port of Alsea by Center for Coastal Margin
>Observation and Prediction (CMOP) and assembled by Northwest Association of
>Networked Ocean Observation Systems (NANOOS) in the Columbia River Estuary and
>North East Pacific Ocean from 2005-12-15 to 2006-04-18 (NCEI Accession
>0161524). Version 1.1. NOAA National Centers for Environmental Information.
>Dataset. [access date]
The way we've established this procedure is to create one AIP for each station.
Each time the process finds a new station, a new AIP gets generated. The next
time the process finds a station that already has an AIP, the AIP gets a major
revision (updated version number) and the new data files get appended to the
previous package. For an example of this have a look at the SCCOOS AIP. The
metadata page can be found at
https://data.nodc.noaa.gov/cgi-bin/iso?id=gov.noaa.nodc:0157036 and the AIP can
be found at ftp://ftp.nodc.noaa.gov/nodc/archive/arc0100/0157036/.
The directories within the AIP are organized as follows:
1.1 - version 1.1, the initial submission of data
2.2 - version 2.2, The next month we looked at their WAF and found new files, so
we appended them to version 1.1 and updated the metadata and published version
2.2.
3.3 - version 3.3, The next month we looked at their WAF and found new files, so
we appended them to version 2.2 which contains version 1.1 and updated the
metadata and published version 3.3.
And so on...
So, the largest version number should have all the associated data files for
that station.
As for the citation in regards to this versioning scheme, we include version
number, but getting down to the file level would be to granule for the purpose
we are serving.
I hope I didn't get too far off topic and I hope some of this information
actually helps.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/nanoos-pnw/NCEI-archiving/issues/6#issuecomment-286820761
MathewBiddle commented 7 years ago

This gets into how we define major vs minor revisions. A major revision is when we get new data from the provider which requires us to update the files provided by the provider (ie. the 0-data/ directory). Where, a minor revision is when we get an update concerning the NCEI metadata in the package (ie. the about/ directory).

Inherently in receiving a new data file, we need to update the information in the about/ directory, so both a major and minor revision occurs (even though its at the same time). If we just get an update that doesn't require changes to the data files, we will perform a minor revision. But, those are typically done manually.

emiliom commented 7 years ago

Thanks so much, guys!! You've covered the issues of granularity and change in time that needed to be covered. I don't have time to get up to speed on all the discussions today, so that means I won't be able to follow up until two weeks from today. Feel free to keep discussing and refining! But I won't review this closely until just before our scheduled call on 3/30.

emiliom commented 7 years ago

FYI, I'll remove the http://data.nanoos.org/ncei/ohsucmop_test/ folder (the 2016-12-14 test files) very soon. I only kept it (under that changed name) just to be extra cautious in case there was a reference to a previous issue that we wanted to be able to fully track.

I'll remove it by the end of the day Friday unless I hear back from either of you that there is a good reason to keep it a little longer.

emiliom commented 7 years ago

I've removed the http://data.nanoos.org/ncei/ohsucmop_test/ folder.

emiliom commented 7 years ago

Regarding the dataset citations issues we discussed here, earlier (ie, updates over time): for reference, the near-term solution NCEI will implement is described in issue 7, here and here