Open MathewBiddle opened 1 year ago
The netCDF specification will be documented at https://ioos.github.io/ioos-atn-data/
need to decide on a decimation strategy. The frequency of observations varies from 2 minutes to multiple days. Below are some examples of time differences between points in an example dataset:
The decimation strategy that ETN and OTN are working on for acoustic telemetry data is down to a lot of hard work by Peter Desmet and Jonas Mortelmans, and is based in some of Peter's work on camtrap-dp and with other satellite tagged animals. It employs an aggregation strategy of 'take the first detection/location per hour', with other Darwin Core fields like dataGeneralizations helping characterize the summarization by indicating how many detections have been obfuscated by the aggregation.
The benefit of using this method is that each detection is a real point in space and time that the animal was observed, and also it puts a hard upper bound per tag on how many occurrences can be generated by a single individual/tag. There's a lot of background information and ancillary decisions made about how to characterize things like coordinateUncertainty https://github.com/inbo/etn/issues/256 and what the logic for the decimation of the events themselves are here: https://github.com/inbo/etn/blob/main/inst/sql/dwc_occurrence.sql
I've got more code coming that deals with pulling together an Event Core version, with the Occurrences still being generated in a decimated way like this, but with tag attachment and listening station deployments being handled as Events and more things being reported as Extended Measurement or Facts.
I created an example DwC-A package in this PR https://github.com/ioos/ioos_code_lab/pull/13/commits/e58b2b5a340053ee82b0b4da532afc853b1182cf
The template still isn't finalized so I don't want to go too far down the road, but @albenson-usgs gave some great feedback on the initial package, to start addressing:
eventID
needs to be unique for each row in the event file. Right now it's a single eventID
for all rows in the event file. locationID
= Release- I'm not sure what that means. I'm confused why we decided to put that in that field and it doesn't seem like a good fit. Can you explain?eventDate
, decimalLatitude
, decimalLongitude
, geodeticDatum
can be dropped. coordinateUncertaintyInMeters
belongs in the event file and hopefully it can be populated.occurrenceID
seems strange to me. It is unique for each row but it's basically the eventDate
with "_0_Species" after it. Maybe this is ok but just strikes me weird.organismID
probably shouldn't have any spaces in itoccurrenceStatus
is missing and is "present" for all rows.sex
, lifeStage
. For reference, below is a table of the data available (dumped from the netCDF file), followed by the netCDF header of the metadata available. THESE ARE EXAMPLE DATA and therefore I have redacted some information about the PI.
I think we can address all of the comments above from the available data and metadata.
| obs | deploy_id | time | z | lat | lon | ptt | instrument | type | location_class | error_radius | semi_major_axis | semi_minor_axis | ellipse_orientation | offset | offset_orientation | gpe_msd | gpe_u | count | qartod_time_flag | qartod_speed_flag | qartod_location_flag | qartod_rollup_flag | crs | trajectory | animal_age | animal_life_stage | animal_sex | animal_weight | animal_length | animal_length_2 | animal | instrument_tag | instrument_location | taxon_name | taxon_lsid | comment | |------:|:------------|:--------------------|----:|-------:|---------:|------:|:-------------|:-------|:-----------------|---------------:|------------------:|------------------:|----------------------:|---------:|---------------------:|----------:|--------:|--------:|-------------------:|--------------------:|-----------------------:|---------------------:|------------:|:-------------------------|-------------:|:--------------------|:-------------|----------------:|----------------:|------------------:|---------:|:-------------------------|:-------------------------|:-----------------------|:------------------------------------------|:----------| | 0 | 09_13-45866 | 2009-09-23 00:00:00 | 0 | 34.03 | -118.56 | 45866 | SPOT | User | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 2 | 1 | 1 | -2147483647 | 5f0668a86321be13bc7ef628 | nan | juvenile | male | nan | 213 | nan | 09_13 | Wildlife Computers SPOT5 | Wildlife Computers SPOT5 | Carcharodon carcharias | urn:lsid:marinespecies.org:taxname:105838 | | | 1 | 09_13-45866 | 2009-09-25 06:42:00 | 0 | 23.59 | -166.18 | 45866 | SPOT | Argos | A | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 4 | 1 | 4 | -2147483647 | 5f0668a86321be13bc7ef628 | nan | juvenile | male | nan | 213 | nan | 09_13 | Wildlife Computers SPOT5 | Wildlife Computers SPOT5 | Carcharodon carcharias | urn:lsid:marinespecies.org:taxname:105838 | | | 2 | 09_13-45866 | 2009-09-25 11:09:00 | 0 | 34.024 | -118.556 | 45866 | SPOT | Argos | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 4 | 1 | 4 | -2147483647 | 5f0668a86321be13bc7ef628 | nan | juvenile | male | nan | 213 | nan | 09_13 | Wildlife Computers SPOT5 | Wildlife Computers SPOT5 | Carcharodon carcharias | urn:lsid:marinespecies.org:taxname:105838 | | | 3 | 09_13-45866 | 2009-09-25 11:11:00 | 0 | 34.035 | -118.549 | 45866 | SPOT | Argos | 0 | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 4 | 1 | 4 | -2147483647 | 5f0668a86321be13bc7ef628 | nan | juvenile | male | nan | 213 | nan | 09_13 | Wildlife Computers SPOT5 | Wildlife Computers SPOT5 | Carcharodon carcharias | urn:lsid:marinespecies.org:taxname:105838 | | | 4 | 09_13-45866 | 2009-09-27 17:58:00 | 0 | 34.033 | -118.547 | 45866 | SPOT | Argos | 1 | nan | nan | nan | nan | nan | nan | nan | nan | nan | 1 | 1 | 1 | 1 | -2147483647 | 5f0668a86321be13bc7ef628 | nan | juvenile | male | nan | 213 | nan | 09_13 | Wildlife Computers SPOT5 | Wildlife Computers SPOT5 | Carcharodon carcharias | urn:lsid:marinespecies.org:taxname:105838 | |
```
xarray.Dataset {
dimensions:
obs = 29 ;
variables:
object deploy_id() ;
deploy_id:long_name = id for this deployment. This is typically the tag ptt ;
deploy_id:comment = Friendly name given to the tag by the user. If no specific friendly name is given, this is the PTT id. ;
deploy_id:instrument = instrument_location ;
deploy_id:platform = animal ;
deploy_id:coverage_content_type = referenceInformation ;
datetime64[ns] time(obs) ;
time:standard_name = time ;
time:axis = T ;
time:_CoordinateAxisType = Time ;
time:long_name = Time of the measurement, in seconds since 1990-01-01 ;
time:actual_min = 2009-09-23T00:00:00Z ;
time:actual_max = 2009-11-23T05:12:00Z ;
time:ancillary_variables = qartod_time_flag qartod_rollup_flag qartod_speed_flag ;
time:instrument = instrument_location ;
time:platform = animal ;
time:coverage_content_type = coordinate ;
float64 z(obs) ;
z:axis = Z ;
z:long_name = depth of measurement ;
z:positive = down ;
z:standard_name = depth ;
z:units = m ;
z:actual_min = 0.0 ;
z:actual_max = 0.0 ;
z:instrument = ;
z:platform = animal ;
z:comment = This variable is synthetically generated to represent the depth of observations ;
z:coverage_content_type = coordinate ;
float64 lat(obs) ;
lat:axis = Y ;
lat:_CoordinateAxisType = Lat ;
lat:long_name = Latitude portion of location in decimal degrees North ;
lat:standard_name = latitude ;
lat:units = degrees_north ;
lat:valid_max = 90.0 ;
lat:valid_min = -90.0 ;
lat:actual_min = 23.59 ;
lat:actual_max = 34.045 ;
lat:ancillary_variables = qartod_location_flag qartod_rollup_flag qartod_speed_flag error_radius semi_major_axis semi_minor_axis ellipse_orientation offset offset_orientation ;
lat:instrument = instrument_location ;
lat:platform = animal ;
lat:coverage_content_type = coordinate ;
float64 lon(obs) ;
lon:axis = X ;
lon:_CoordinateAxisType = Lon ;
lon:long_name = Longitude portion of location in decimal degrees East ;
lon:standard_name = longitude ;
lon:units = degrees_east ;
lon:valid_max = 180.0 ;
lon:valid_min = -180.0 ;
lon:actual_min = -166.18 ;
lon:actual_max = -118.504 ;
lon:ancillary_variables = qartod_location_flag qartod_rollup_flag qartod_speed_flag error_radius semi_major_axis semi_minor_axis ellipse_orientation offset offset_orientation ;
lon:instrument = instrument_location ;
lon:platform = animal ;
lon:coverage_content_type = coordinate ;
float64 ptt(obs) ;
ptt:long_name = Platform Transmitter Terminal (PTT) id used for Argos transmissions ;
ptt:comment = PTT id for this deployment. PTT ids may be used on multiple deployments, but not concurrently. When combined with deployment dates, PTTs can uniquely identify a deployment. ;
ptt:coverage_content_type = referenceInformation ;
ptt:instrument = instrument_location ;
ptt:platform = animal ;
object instrument(obs) ;
instrument:comment = Wildlife Computers instrument family. Variable may report manufacturer default values (e.g., Mk10) and may not match correctly defined instrument_location or instrument_tag variables and attributes. ;
instrument:long_name = Instrument family ;
instrument:instrument = instrument_location ;
instrument:platform = animal ;
instrument:coverage_content_type = referenceInformation ;
object type(obs) ;
type:comment = Type of location: Argos, FastGPS or User ;
type:long_name = Type of location information - Argos, GPS satellite or user provided location ;
type:instrument = instrument_location ;
type:platform = animal ;
type:coverage_content_type = referenceInformation ;
object location_class(obs) ;
location_class:standard_name = quality_flag ;
location_class:comment = Quality codes from the ARGOS satellite (in meters): G,3,2,1,0,A,B,Z. See http://www.argos-system.org/manual/3-location/34_location_classes.htm ;
location_class:long_name = Location Quality Code from ARGOS satellite system ;
location_class:code_values = G,3,2,1,0,A,B,Z ;
location_class:code_meanings = estimated error less than 100m and 1+ messages received per satellite pass, estimated error less than 250m and 4+ messages received per satellite pass, estimated error between 250m and 500m and 4+ messages per satellite pass, estimated error between 500m and 1500m and 4+ messages per satellite pass, estimated error greater than 1500m and 4+ messages received per satellite pass, no least squares estimated error or unbounded kalman filter estimated error and 3 messages received per satellite pass, no least squares estimated error or unbounded kalman filter estimated error and 1 or 2 messages received per satellite pass, invalid location (available for Service Plus or Auxilliary Location Processing) ;
location_class:instrument = instrument_location ;
location_class:platform = animal ;
location_class:ancillary_variables = lat lon ;
location_class:coverage_content_type = qualityInformation ;
float64 error_radius(obs) ;
error_radius:long_name = Error radius ;
error_radius:units = m ;
error_radius:comment = If the position is best represented as a circle, this field gives the radius of that circle in meters. ;
error_radius:instrument = instrument_location ;
error_radius:platform = animal ;
error_radius:ancillary_variables = lat lon offset offset_orientation ;
error_radius:coverage_content_type = qualityInformation ;
float64 semi_major_axis(obs) ;
semi_major_axis:long_name = Error - ellipse semi-major axis ;
semi_major_axis:units = m ;
semi_major_axis:comment = If the estimated position error is best expressed as an ellipse, this field gives the length in meters of the semi-major elliptical axis (one half of the major axis). ;
semi_major_axis:instrument = instrument_location ;
semi_major_axis:platform = animal ;
semi_major_axis:ancillary_variables = lat lon ellipse_orientation offset offset_orientation ;
semi_major_axis:coverage_content_type = qualityInformation ;
float64 semi_minor_axis(obs) ;
semi_minor_axis:long_name = Error - ellipse semi-minor axis ;
semi_minor_axis:units = m ;
semi_minor_axis:comment = If the estimated position error is best expressed as an ellipse, this field gives the length in meters of the semi-minor elliptical axis (one half of the minor axis). ;
semi_minor_axis:instrument = instrument_location ;
semi_minor_axis:platform = animal ;
semi_minor_axis:ancillary_variables = lat lon ellipse_orientation offset offset_orientation ;
semi_minor_axis:coverage_content_type = qualityInformation ;
float64 ellipse_orientation(obs) ;
ellipse_orientation:long_name = Error - ellipse orientation in degrees clockwise from true north ;
ellipse_orientation:units = degrees ;
ellipse_orientation:comment = The angle in degrees of the ellipse from true north, proceeding clockwise (0 to 360). A blank field represents 0 degrees. ;
ellipse_orientation:instrument = instrument_location ;
ellipse_orientation:platform = animal ;
ellipse_orientation:ancillary_variables = lat lon semi_major_axis semi_minor_axis offset offset_orientation ;
ellipse_orientation:coverage_content_type = qualityInformation ;
float64 offset(obs) ;
offset:long_name = Error - offset in meters to center of error ellipse or circle ;
offset:units = m ;
offset:comment = This field is non-zero if the circle or ellipse are not centered on the (Latitude, Longitude) values on this row. "Offset" gives the distance in meters from (Latitude, Longitude) to the center of the ellipse. ;
offset:instrument = instrument_location ;
offset:platform = animal ;
offset:ancillary_variables = lat lon error_radius semi_major_axis semi_minor_axis offset_orientation ;
offset:coverage_content_type = qualityInformation ;
float64 offset_orientation(obs) ;
offset_orientation:long_name = Error - offset orientation angle to ellipse center ;
offset_orientation:units = degrees ;
offset_orientation:comment = If the "Offset" field is non-zero, this field is the angle in degrees from (Latitude, Longitude) to the center of the ellipse. Zero degrees is true north; a blank field represents 0 degrees. ;
offset_orientation:instrument = instrument_location ;
offset_orientation:platform = animal ;
offset_orientation:ancillary_variables = lat lon error_radius semi_major_axis semi_minor_axis offset ;
offset_orientation:coverage_content_type = qualityInformation ;
float64 gpe_msd(obs) ;
gpe_msd:comment = Historical. No longer applicable. ;
gpe_msd:long_name = ;
gpe_msd:units = ;
gpe_msd:instrument = instrument_location ;
gpe_msd:platform = animal ;
gpe_msd:coverage_content_type = auxillaryInformation ;
float64 gpe_u(obs) ;
gpe_u:comment = Historical. No longer applicable. ;
gpe_u:long_name = ;
gpe_u:units = ;
gpe_u:instrument = instrument_location ;
gpe_u:platform = animal ;
gpe_u:coverage_content_type = auxillaryInformation ;
float64 count(obs) ;
count:comment = Total number of times a particular data item was received, verified, and successfully decoded. ;
count:long_name = Count ;
count:units = count ;
count:instrument = instrument_location ;
count:platform = animal ;
count:coverage_content_type = auxillaryInformation ;
float32 qartod_time_flag(obs) ;
qartod_time_flag:standard_name = gross_range_test_quality_flag ;
qartod_time_flag:long_name = Time QC test - gross range test ;
qartod_time_flag:implementation = https://github.com/ioos/ioos_qc/ ;
qartod_time_flag:flag_meanings = PASS NOT_EVALUATED SUSPECT FAIL MISSING ;
qartod_time_flag:flag_values = [1 2 3 4 9] ;
qartod_time_flag:references = https://cdn.ioos.noaa.gov/media/2020/03/QARTOD_TS_Manual_Update2_200324_final.pdf ;
qartod_time_flag:coverage_content_type = qualityInformation ;
float32 qartod_speed_flag(obs) ;
qartod_speed_flag:standard_name = gross_range_test_quality_flag ;
qartod_speed_flag:long_name = Speed QC test - gross range test ;
qartod_speed_flag:references = https://cdn.ioos.noaa.gov/media/2020/03/QARTOD_TS_Manual_Update2_200324_final.pdf ;
qartod_speed_flag:implementation = https://github.com/ioos/ioos_qc/ ;
qartod_speed_flag:flag_meanings = PASS NOT_EVALUATED SUSPECT FAIL MISSING ;
qartod_speed_flag:flag_values = [1 2 3 4 9] ;
qartod_speed_flag:coverage_content_type = qualityInformation ;
float32 qartod_location_flag(obs) ;
qartod_location_flag:standard_name = location_test_quality_flag ;
qartod_location_flag:long_name = Location QC test - Location test ;
qartod_location_flag:implementation = https://github.com/ioos/ioos_qc/ ;
qartod_location_flag:flag_meanings = PASS NOT_EVALUATED SUSPECT FAIL MISSING ;
qartod_location_flag:flag_values = [1 2 3 4 9] ;
qartod_location_flag:references = https://cdn.ioos.noaa.gov/media/2020/03/QARTOD_TS_Manual_Update2_200324_final.pdf ;
qartod_location_flag:coverage_content_type = qualityInformation ;
float32 qartod_rollup_flag(obs) ;
qartod_rollup_flag:standard_name = aggregate_quality_flag ;
qartod_rollup_flag:long_name = Aggregate QC value ;
qartod_rollup_flag:implementation = https://github.com/ioos/ioos_qc/ ;
qartod_rollup_flag:flag_meanings = PASS NOT_EVALUATED SUSPECT FAIL MISSING ;
qartod_rollup_flag:flag_values = [1 2 3 4 9] ;
qartod_rollup_flag:references = https://cdn.ioos.noaa.gov/media/2020/03/QARTOD_TS_Manual_Update2_200324_final.pdf ;
qartod_rollup_flag:coverage_content_type = qualityInformation ;
int32 crs() ;
crs:epsg_code = EPSG:4326 ;
crs:grid_mapping_name = latitude_longitude ;
crs:inverse_flattening = 298.257223563 ;
crs:long_name = Coordinate Reference System - http://www.opengis.net/def/crs/EPSG/0/4326 ;
crs:semi_major_axis = 6378137.0 ;
crs:coverage_content_type = referenceInformation ;
object trajectory() ;
trajectory:cf_role = trajectory_id ;
trajectory:long_name = trajectory identifier ;
float64 animal_age() ;
animal_age:units = ;
animal_age:long_name = age of the animal as measured or estimated at deployment ;
animal_age:coverage_content_type = referenceInformation ;
animal_age:animal_age = Not provided ;
object animal_life_stage() ;
animal_life_stage:animal_life_stage = juvenile ;
animal_life_stage:long_name = Lifestage of the animal at time of deployment ;
animal_life_stage:coverage_content_type = referenceInformation ;
object animal_sex() ;
animal_sex:animal_sex = male ;
animal_sex:long_name = sex of the animal at time of tag deployment ;
animal_sex:coverage_content_type = referenceInformation ;
float32 animal_weight() ;
animal_weight:units = kg ;
animal_weight:long_name = mass of the animal as measured or estimated at deployment ;
animal_weight:animal_weight = Not provided ;
animal_weight:coverage_content_type = referenceInformation ;
float32 animal_length() ;
animal_length:animal_length_type = total length ;
animal_length:units = cm ;
animal_length:animal_length = 213.0 (cm) total length ;
animal_length:long_name = length of the animal as measured or estimated at deployment ;
animal_length:coverage_content_type = referenceInformation ;
float32 animal_length_2() ;
animal_length_2:animal_length_2_type = Not provided ;
animal_length_2:units = ;
animal_length_2:animal_length_2 = Not provided ;
animal_length_2:long_name = length of the animal as measured or estimated at deployment ;
animal_length_2:coverage_content_type = referenceInformation ;
object animal() ;
animal:suborder = ;
animal:infraorder = ;
animal:scientificname = Carcharodon carcharias ;
animal:long_name = tagged animal id ;
animal:superdomain = Biota ;
animal:order = Lamniformes ;
animal:authority = (Linnaeus, 1758) ;
animal:kingdom = Animalia ;
animal:species = Carcharodon carcharias ;
animal:genus = Carcharodon ;
animal:megaclass = ;
animal:family = Lamnidae ;
animal:taxonRankID = 220 ;
animal:class = Elasmobranchii ;
animal:cf_role = trajectory_id ;
animal:coverage_content_type = referenceInformation ;
animal:subphylum = Vertebrata ;
animal:phylum = Chordata ;
animal:AphiaID = 105838 ;
animal:valid_name = Carcharodon carcharias ;
animal:infraphylum = Gnathostomata ;
animal:subclass = Neoselachii ;
animal:rank = Species ;
object instrument_tag() ;
instrument_tag:manufacturer = Wildlife Computers ;
instrument_tag:make_model = SPOT5 ;
instrument_tag:serial_number = 07S0230 ;
instrument_tag:long_name = telemetry tag applied to animal ;
instrument_tag:coverage_content_type = referenceInformation ;
instrument_tag:calibration_date = Not Provided ;
object instrument_location() ;
instrument_location:manufacturer = Wildlife Computers ;
instrument_location:make_model = SPOT5 ;
instrument_location:serial_number = 07S0230 ;
instrument_location:long_name = Wildlife Computers SPOT5 ;
instrument_location:location_type = argos / modeled ;
instrument_location:comment = Location ;
instrument_location:coverage_content_type = referenceInformation ;
instrument_location:calibration_date = Not Provided ;
object taxon_name() ;
taxon_name:standard_name = biological_taxon_name ;
taxon_name:long_name = most precise taxonomic classification for the tagged animal ;
taxon_name:coverage_content_type = referenceInformation ;
taxon_name:source = Froese, R. and D. Pauly. Editors. (2023). FishBase. Carcharodon carcharias (Linnaeus, 1758). Accessed through: World Register of Marine Species at: https://www.marinespecies.org/aphia.php?p=taxdetails&id=105838 on 2023-08-16 ;
taxon_name:url = https://www.marinespecies.org/aphia.php?p=taxdetails&id=105838 ;
@albenson-usgs I'm poking around in this now.
For locationID
I followed the guidance at https://github.com/tdwg/dwc-for-biologging/wiki/Acoustic-sensor-enabled-tracking-of-blue-sharks
But maybe that's only for the tagging event?
Now that I'm fiddling with the data more, I'm wondering if there should be two/three events.
cc @mmckinzie
Maybe https://github.com/tdwg/dwc-for-biologging/wiki/Movebank-GPS-data#darwin-core-recommendation is the right way?
This is what I understand from the text on movebank GPS data:
flowchart LR
A([Deployment])
B([Tag attachment])
C([GPS positions])
A --parentEventID--> B
A --parentEventID--> C
subgraph parent event
A
end
subgraph child events
B
C
end
I worked through some reorganizing after discussion on the Slack space. I think I have addressed most of the comments in https://github.com/ioos/bio_data_guide/issues/145#issuecomment-1692201277
It was decided to go with occurrence and emof (no event).
Here are the files and notebook for review:
coordinateUncertaintyInMeters
is populated with fill values. Apparently this deployment doesn't have information about error_radius
, semi_major_axis
, semi_minor_axis
, or offset
to use for this entry. Is there something we can do when we don't have that information?I am most curious about additional information we could be porting into the occurrence
or emof
record. For example, we have information about the Instrument family
(eg. SPOT),Type of location: Argos, FastGPS or User
, Location Quality Code from ARGOS satellite system
, Platform Transmitter Terminal (PTT) id used for Argos transmissions
, instrument_tag
(telemetry tag applied to animal including serial number and make_model), and instrument_location
(serial_number and make_model). Further information about each of those variables are included in the netCDF metadata in this comment https://github.com/ioos/bio_data_guide/issues/145#issuecomment-1692211792
We also have a few flag variables (time, speed, location, and rollup) and a bunch of metadata that could be stuck somewhere.
ATN data are now being archived at NCEI. For the notebook I'm working on here, I would like to pull the source data from this archival information package. https://www.ncei.noaa.gov/archive/accession/0282699
@sformel-usgs will handle the next review on this. Also I know that @jdpye published some (lots?) of data to OBIS somewhat recently and might have some words of wisdom to share.
We did!
I looked over Mat's shoulder briefly at the IOOS DMAC but I would gently recommend we further align this to the standard that OTN and ETN had worked out for all our satellite and acoustic telemetry data publishing, if it's possible. Just a bit of summarization of the occurrences to keep the row count manageable when our datasets get included in general queries against OBIS in the future.
Here is the mapping table for the occurrence record:
DarwinCore | netCDF |
---|---|
basisOfRecord |
data contained in the type variable where type of User = HumanObservation and Argos = MachineObservation . |
organismID |
platform_id global attribute plus the animal_common_name global attribute. |
eventDate |
data contained in time variable. Converted to ISO8601. |
occurrenceID |
eventDate , plus data contained in z variable, plus animal_common_name global attribute. |
decimalLatitude |
data in lat variable. |
decimalLongitude |
data in lon variable. |
geodeticDatum |
attribute epsg_code in the crs variable. |
eventID |
animal_common_name global attribute plus the eventDate . |
kingdom |
kingdom attribute in the animal variable. |
taxonRank |
rank attribute in the animal variable. |
occurrenceStatus |
hardcoded to present . |
sex |
data from the variable animal_sex . |
lifeStage |
data from the variable animal_life_stage . |
scientificName |
data from the variable taxon_name . |
scientificNameID |
data from the variable taxon_lsid . |
coordinateUncertaintyInMeters |
maximum value of the data from the variables error_radius , semi_major_axis , and offset . |
And for the measurement or fact file
The measurementOrFact file will only contain information referencing the basisOfRecord
= HumanObservation
as these observations were made when the animal was directly tagged, in person (ie. when basisOfRecord
== HumanObservation
).
DarwinCore Term | Status | netCDF |
---|---|---|
organismID | The platform_id global attribute plus the animal_common_name global attribute. |
|
occurrenceID | Required | eventDate , plus data contained in z variable, plus animal_common_name global attribute. |
measurementType | Required | long_name attribute of the animal_weight , animal_length , animal_length_2 variables. |
measurementValue | Required | The data from the animal_weight , animal_length , animal_length_2 variables. |
eventID | Strongly Recommended | animal_common_name global attribute plus the eventDate . |
measurementUnit | Strongly Recommended | unit attribute of the animal_weight , animal_length , animal_length_2 variables. |
measurementMethod | Strongly Recommended | animal_weight , animal_length , animal_length_2 attributes of their respective variables. |
measurementTypeID | Strongly Recommended | mapping table somewhere? |
measurementMethodID | Strongly Recommended | mapping table somewhere? |
measurementUnitID | Strongly Recommended | mapping table somewhere? |
measurementAccuracy | Share if available | |
measurementDeterminedDate | Share if available | |
measurementDeterminedBy | Share if available | |
measurementRemarks | Share if available | |
measurementValueID | Share if available |
@MathewBiddle I'm still getting up to speed on this. Does anything need review right now?
@jdpye From https://github.com/ioos/bio_data_guide/issues/145#issuecomment-1686715497, my understanding is the decimation strategy for these satellite telemetry observations should be:
'take the first detection/location per hour', with other Darwin Core fields like dataGeneralizations helping characterize the summarization by indicating how many detections have been obfuscated by the aggregation.
So, I will work on taking my occurrence table and decimating it to the first detection each hour. Does that sound reasonable?
@sformel-usgs Yes! If you don't mind taking a look at the csv files I reference in https://github.com/ioos/bio_data_guide/issues/145#issuecomment-1710385902, that will help us in the overarching organization of these data. I think the decimation strategy will simply limit the amount of rows from what we have above.
@jdpye From #145 (comment), my understanding is the decimation strategy for these satellite telemetry observations should be:
'take the first detection/location per hour', with other Darwin Core fields like dataGeneralizations helping characterize the summarization by indicating how many detections have been obfuscated by the aggregation.
So, I will work on taking my occurrence table and decimating it to the first detection each hour. Does that sound reasonable?
Yep! With this, you can add into dataGeneralizations a string like 'first of # records' to indicate there are more records in the raw dataset to be discovered by the super-curious.
I just finished prototyping up a DwC archive to lonboard / Deck.gl vis tool and so i will attempt to eat your DwC archive with it when i get time!
Here's a stab at filtering the occurrence record down to the first occurrence per hour (in Python). https://gist.github.com/MathewBiddle/d434ac2b538b2728aa80c6a7945f94be
Now to write that in R...
Figured out how to do it in R (hacky but works for now):
library(lubridate)
# sort by date
occurrencedf <- occurrencedf %>% arrange(eventDate)
# create column of date to the hour which will be our decimation strategy
occurrencedf$eventDateHrs <- format(as.POSIXct(occurrencedf$eventDate, format="%Y-%m-%dT%H:%M:%SZ"),"%Y-%m-%dT%H")
# filter table to only unique date + hour and pick the first row keeping all the columns
occurrencedf <- distinct(occurrencedf,eventDateHrs,.keep_all = TRUE)
# nuke the invented column
occurrencedf$eventDateHrs <- NULL
occurrencedf
In these data we also have additional information about the Location Quality Code from ARGOS satellite system and QARTOD tests. Below are the codes and those meanings.
code_values | code meanings |
---|---|
G | estimated error less than 100m and 1+ messages received per satellite pass |
3 | estimated error less than 250m and 4+ messages received per satellite pass |
2 | estimated error between 250m and 500m and 4+ messages per satellite pass |
1 | estimated error between 500m and 1500m and 4+ messages per satellite pass |
0 | estimated error greater than 1500m and 4+ messages received per satellite pass |
A | no least squares estimated error or unbounded kalman filter estimated error and 3 messages received per satellite pass |
B | no least squares estimated error or unbounded kalman filter estimated error and 1 or 2 messages received per satellite pass |
Z | invalid location (available for Service Plus or Auxilliary Location Processing) |
Since codes A
, B
, and Z
are essentially bad values, I propose that we filter those out.
Also, create a mapping table for coordinateUncertaintyInMeters
that corresponds to the ARGOS code maximum error as shown in the table below:
code | coordinateUncertaintyInMeters |
---|---|
G | 100 |
3 | 250 |
2 | 500 |
1 | 1500 |
0 | >1500 (not sure what would go there?) |
value | meaning |
---|---|
1 | PASS |
2 | NOT_EVALUATED |
3 | SUSPECT |
4 | FAIL |
9 | MISSING |
The QARTOD tests are:
variable | long_name |
---|---|
qartod_time_flag | Time QC test - gross range test |
qartod_speed_flag | Speed QC test - gross range test |
qartod_location_flag | Location QC test - Location test |
qartod_rollup_flag | Aggregate QC value |
I'm not sure what to do here. My preference would be to include all rows where qartod_rollup_flag
== 1 and drop the rest. But I'm open to suggestions.
@sformel-usgs @jdpye I've updated the notebook (and on nbviewer) to include this decimation strategy as well as adding in some initial filtering based on location class and the inclusion of dataGeneralizations
to the occurrence record. I've filtered down the emof to only contain data where data were observed.
If you don't mind taking a look when you get a chance, it would be much appreciated! I think there are some additional details we can add to the occurrence/emof from the netCDF files, I'm just not sure what.
@MathewBiddle here are some thoughts. I'm still feeling like I don't have a good grasp on all the moving parts, so please ping me here or in Slack if there is anything I didn't address specifically, no matter how small. I don't see any big issues, what you've derived works as a DwC-A. But I'm going to dig through the data a little more and see if there is anything else I think could be included.
I was able to work through most of the R notebook with no big issues. There are some spots where I think I could help make things more succinct and/or readable. I just forked the repo and will submit a PR with some suggestions. I'll try to do this tomorrow morning.
I couldn't quickly identify where to grab the file, atn_trajectory_template.nc
that is referenced in the EML building (cell 54).
coordinateUncertaintyInMeters
needs to be an integer or blank. So, if you can't put a confident maximum boundary on > 1500, then you can leave it blank for unknown. I'll take a closer look at that data when I have some more time.
I understand that the QARTOD flags are for QC, but I don't know enough about them to say if they should be filtered out. If not, the flags could be included through eMOF (although that might be easy to overlook, and therefore a bad idea).
For the eMOF attributes that you've marked as "mapping table somewhere?". I'm not sure if this is what you're after, but I think these would need to be found on a case-by-case basis. But it should be easy to find some examples for measurementUnitID
. The other two would depend on whether or not anyone has published the method and type in database like NERC.
I think I can help find your P01 codes for the measurements, sorry, I didn't look at the emof file on the first pass.
I'll look at this today!
for the coordinateUncertaintyInMeters distance for Argos location class 0, this paper suggests an upper bound of ~ 10km. https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0063051
From that paper, this quote:
In brief, “good” positions (location codes 3, 2, 1, A) are accurate to about 2 km, while 0 and B locations are accurate to about 5–10 km. However, due to the lognormal distribution of the errors, larger outliers are to be expected in all location codes and need to be accounted for in the user’s data processing.
does not fill my heart with joy, so the upper bound of the estimate is probably a safer value to include.
Thanks for taking a look! I should have mentioned the EML section of the notebook is a work in progress. It should reference the same netcdf file that is used to generate the dwc files (the one from NCEI). I just haven't updated it in a few months.
Something to discuss is if generating the EML is even necessary. Would OBIS-USA generate the EML? Is there a way to for a provider to upload an EML xml file? How should we deal with this with the expectation that we might want to automate the process?
If everyone has filled in their metadata for the NetCDF files in the same way, we should get a simple EML template for this flavour of data and map our incoming data to it, and submit that to your OBIS publication endpoint along with the data, as an initial pass of the metadata for the archive. Amendments can be made after the initial metadata harvest from the source NetCDF, but we should have a good start from there.
If we build a simple eml.xml and zip it up, the metadata pre-populates and will save your OBIS data manager a bit of headache :D
@MathewBiddle the IPT is all fat fingers. So, the more EML you can generate programmatically, the less time it will take and the less chance of human error. But just do the easy stuff, don't worry about getting every detail.
Since these are satellite telemetry observations, our depth of measurement
is always == 0, so minimumDepthInMeters
and maximumDepthInMeters
should be 0, correct? Does it cause an issue if they are the same value?
No that's fine that they are the same value.
I have added in min/max depth to the occurrence file https://github.com/MathewBiddle/ioos_code_lab/blob/r_nc2dwc/jupyterbook/content/code_gallery/data_management_notebooks/atn_45866_occurrence.csv
I've merged the optimizations @sformel-usgs proposed and cleaned up some of the comments.
As far as the metadata goes, the source netCDF files are built via an automated pipeline, so we know what content is going where and how much (or little) it will be standardized. It's merely a mapping exercise to get the information into EML for the records. However, I am curious to get @mmckinzie to weigh in on the granularity of the "datasets" for OBIS. Right now, we are archiving at NCEI on a deployment by deployment basis, is that too granular for OBIS?
Obviously, it would be much simpler to have 1 ATN dataset that is updated with new deployments as they make it to NCEI. But, we loose some granularity in the credits at OBIS when we do that.
Some items to consider:
Maybe there's lessons learned from the CREMP datasets we should explore?
I think answering those questions will help us decide what needs to be mapped into the metadata record.
Should we also include samplingProtocol
== satellite telemetry
? Similar to https://github.com/inbo/etn/blob/abfe5b000913706f50a7563c92e9024f668046a1/inst/sql/dwc_occurrence.sql#L222C45-L222C61
@MathewBiddle sorry if I'm overlooking it in the above comments, could you point me to an example metadata record from ATN/NCEI? I don't have a sense of what is included, how many people are credited, and how often it's updated.
Here is the NCEI landing page for this dataset https://www.ncei.noaa.gov/archive/accession/0282699
That metadata record is built at NCEI directly from the netCDF file, plus any additional NCEI metadata. My hope would be that we would build the EML metadata directly from the netCDF file instead of harvesting from another source. But, I'm open to suggestions.
In a perfect world, these data wouldn't have updates. The archive packages will be updated only when there are additions of other observing methods, like profile observations or modeled tracks (foie gras analysis), which would be added in separate files. So, the satellite telemetry data files would be static. But, we all know that perfect worlds are hard to come by, so building in an update process would be who of us.
As for the number of people credited, that could be anywhere from 1-n, some of these will be one PI, others could have ten, it's highly variable.
Note: ATN and NCEI are still working out the authorship and acknowledgements in the files and resultant NCEI metadata as some pieces we're mapped correctly. That should be addressed very soon.
I got confused with the files in different repos. So, I've added the mobilization notebook here as a PR and converted it to .Rmd
rmarkdown:::convert_ipynb('atn_satellite_telemetry_netCDF2DwC.ipynb',"atn_satellite_telemetry_netCDF2DwC.Rmd")
The .Rmd, source data, and resultant DwC can be found in this directory: https://github.com/MathewBiddle/bio_data_guide/tree/add_atnsat_telem/datasets/atn_satellite_telemetry
I like the samplingProtocol as 'satellite telemetry', we were talking with the rest of the tdwg MOBS group about deciding on a controlled or suggested vocabulary for samplingProtocol and any steps we take towards that will help us down the line.
I would argue strongly for creating granular datasets, first because attribution can be precise and comprehensive without overattributing researchers to unrelated tracks held at ATN, but also because that would allow individual researchers to revise/update/extend their program or their individual track data as needed without triggering a major update of some ATN-wide archive.
Is there a place in Darwin Core where we could have a link that goes to the NCEI archived raw data?
@laurabrenskelle was looking into this.
associatedMedia
? associatedReferences
?
We would also want to do this for passive acoustic data. Pointing back to the raw audio files at NCEI.
Created an issue to discuss this in the DwC Q&A repo: https://github.com/tdwg/dwc-qa/issues/207
dcterms:references
is the term to use. We will just need to make sure there is a way to trace an occurrence from OBIS to a particular record in the ATN NCEI archive. Probably occurrenceID
?
I agree, if the identifier for the observation record can stay consistent across service endpoints, that would be ideal, and occurrenceID
would be the way to go.
@MathewBiddle for PAM and the raw audio, I think we should use associatedMedia
, to describe a single wav/flac file within an archive, since this is analogous to the raw sequences from DNA data that are references in associatedSequences
. However, the entire archive of sound files could either be described with associatedMedia
or references
(maybe let the community guide us on this one).
I don't want to close this just yet.
TODO:
references
with NCEI url@laurabrenskelle can you take a look at the rmd and see how we can add the NCEI url into the dwc archive?
See https://github.com/ioos/bio_data_guide/tree/main/datasets/atn_satellite_telemetry/
And https://github.com/ioos/bio_data_guide/tree/main/datasets/atn_satellite_telemetry/data/dwc
@MathewBiddle Are we just wanting to add the link to the landing page to references
or are we trying to be more granular than that?
Good question. By granular, what do you mean? I don't think we can get much more granular than that from NCEI. Unless we're talking about the specific url to the data file?
I think either including the url to the landing page (eg. https://www.ncei.noaa.gov/archive/accession/0282699) or one of the identifiers from the landing page (screenshot below) would suffice.
Sorry, I guess because this is just one dataset from one shark's track, the landing page should suffice. Is that the case for all ATN data, or are they ever aggregated with data from multiple animals mixed together?
They will be archived on a deployment by deployment basis. So it should be one animal for each netCDF file.
An "old" but still ATN-relevant conversation from the TDWG Darwin Core Q&A issues: https://github.com/tdwg/dwc-qa/issues/173 I thought it was worth dropping here for future reference.
Thanks, @laurabrenskelle ! I think I forgot to include dwc:samplingProtocol
as satellite telemetry
per https://github.com/ioos/bio_data_guide/issues/145#issuecomment-1814954547
Contact details
mathew.biddle@noaa.gov
Dataset Title
ATN satellite telemetry data
Describe your dataset and any specific challenges or blockers you have or anticipate.
We are very close to a final netCDF template for ATN's satellite trajectory deployment files.
https://github.com/ioos/ioos-atn-data/blob/main/templates/atn_trajectory_template.cdl
Last year, I developed an R script to read in the template and start creating a DwC-A package. This year I'd like to finish that work, assuming we finish the template and create some example files.
https://github.com/MathewBiddle/ioos_code_lab/blob/r_nc2dwc/jupyterbook/content/code_gallery/data_management_notebooks/DRAFT-R-netCDF2DwC.ipynb
xref:
Link to "raw" Data Files.
https://github.com/ioos/ioos-atn-data/tree/main/data