tdwg / dwc-for-biologging

Darwin Core recommendations for biologging data
Creative Commons Attribution 4.0 International
13 stars 3 forks source link

RAATD dataset for use case #19

Open data-biodiversity-aq opened 4 years ago

data-biodiversity-aq commented 4 years ago

Hello,

@Antonarctica has a dataset from Retrospective Analysis of Antarctic Tracking Data Project which could be interesting for the use cases. The project uses a mix of sensors:

Global Location Sensors (GLS loggers or geolocators), satellite-relayed Platform Terminal Transmitters (PTTs), and Global Positioning System devices (GPS)

Thanks a lot!

jdpye commented 4 years ago

the RAATD data is semi-famous in my circles now, I know how much work has gone into straightening that data out and I would love to implement that example in this format!

jdpye commented 4 years ago

it's possible that there's some overlap with this data example I've solicited from @ianjonsen as well: https://github.com/ianjonsen/tdwg_imos/wiki/Argos-satellite-tracking-of-southern-elephant-seals but I know the full RAATD dataset will have more coincident deployments on it and I think that's a very useful example.

ianjonsen commented 4 years ago

It might be best to go with the RAATD data, depending on how much @Antonarctica has (probably all of it Anton?). These would be a more comprehensive test case, with multiple species, deployment locations, tag types, etc... The data aren't in the public domain yet as the Scientific Data paper hasn't come out, but I expect this will happen quite soon (possibly next 1-2 months)

Antonarctica commented 4 years ago

Hi I just got the last version back from the AADC So it is all on my laptop. So we should be able to standardise it to the Darwincore event format. And this would indeed be the right time to do it.

peggynewman commented 4 years ago

What's the best way to go about working on that? Like Movebank has, with a few examples on a spreadsheet in Github with the larger dataset sitting elsewhere?

Antonarctica commented 4 years ago

The original compiled datasets will reside at the Australian Antarctic Data Centre (this was diced a while ago, if making the decision now we might have gone with Zenodo).

The plan is to have a full copy published through the biodiveristy.aq IPT In Darwincore Event core (feeding into OBIS/GBIF). It would be good if that could also be linked to move bank but I'm not that knowledgeable on Movesbank and how the flow would go best (also some of the data might already be in).

Of course this is a good example to follow I guess https://zenodo.org/record/3541812#.XgZmXi2ZM0o although our dataflow would be different.

We have standardised data and filtered data

We have metadata on the deployment (see below)

for standardised

RAATD_ADPE_standardized.csv

for filtered eg.

ADPE_ssm_by_id.rds ADPE_ssm_by_id.pdf ADPE_ssm_by_id_predicted.csv

Metadata variables dataset_identifier file_name individual_id keepornot scientific_name common_name abbreviated_name sex age_class device_id device_type deployment_site deployment_year deployment_month deployment_day deployment_time deployment_decimal_longitude deployment_decimal_latitude data_contact contact_email comments

msweetlove commented 4 years ago

I made a first attempt to format the RAATD data in the DarwinCore format, with an event core and occurrence extension. Before we want to push the whole dataset into this format, it might be useful for you guys to have a look at it and give some feedback on the approach taken. The most important formatting discussion are written down in the README file

The data, R-script to format it and README can be found here: https://github.com/msweetlove/dwc-for-biologging/tree/master/use-cases/RAATD-penguin-tracking-use-case-for-discussion

If possible, can someone with admin rights merge my fork of this repo with tdwg/dwc-for-biologging?

ianjonsen commented 4 years ago

This looks pretty good, only issue I've noticed so far is that the variable "fieldnotes" contains the Argos location quality indices. These indices are essential for Argos location quality control and other movement modelling processes and should have a more informative variable name. Will the schema allow "location quality" to be used instead of "fieldnotes"? I would worry that anything named "fieldnotes" would be the one of the first variables stripped by automated data processing workflows. Additionally, the values should simply be in the set: {3,2,1,0,"A","B","Z"} or {3,2,1,0,-1,-2,-9}, rather than "location_quality= Z", etc...

Antonarctica commented 4 years ago

"location quality" is not part of the Standard darwincore terms (an overview here: https://dwc.tdwg.org/terms/).

A couple of options come to mind

with option 2 maybe being e good compromise

1) There is a "location remarks" but that is as useful as "field notes".

2) Another one is dynamicPropoerties which is a gathering bin but one that can be structured. It is format like this: {"heightInMeters":1.5}, {"tragusLengthInMeters":0.014, "weightInGrams":120}, {"natureOfID":"expert identification", "identificationEvidence":"cytochrome B sequence"}.

3) Then there is measurement or fact (which should go in a separate extension normally). In measurement or fact you get measurementID measurementType measurementValue measurementAccuracy measurementUnit measurementDeterminedBy measurementDeterminedDate measurementMethod measurementRemarks

Not sure how other solved this.

jdpye commented 4 years ago

Thank you for pulling this together, Maxime! I'll see if I have the right permissions to merge your fork into a demo/example subfolder here.

I know that we're encouraged to keep dynamicProperties sparse if we can at all help it, but I agree with option 2, and can see the value in designating a transient variable that'd only be available in certain subclasses of biologging location data. Option 1 is inviting ourselves to repurpose location remarks as dynamicPropertiesAboutLocations, probably nobody will like us for doing that!

Short of translating Argos location qualities into CoordinateUncertaintyInMeters, I don't know what else we'd do other than include something in DynamicProperties.

To completely convince myself, I'm going to poke around a few other example DwC occurrence archives in GBIF/OBIS that are using Argos for location data. So far the ones i've found have not included the quality info inline and have simply alluded to the fact that they 'filtered erroneous location data' in the archive-level metadata, so that's a bar that I think we can clear with your proposed solution.

ianjonsen commented 4 years ago

wrt Argos location data: option 3 has some merits as there are now different "flavours" of Argos location data that could be captured in the "measurementDeterminedBy" variables: 1) locations based on CLS Argos' old Least-Squares algorithm; 2) locations based on their Kalman Filter algorithm; 3) locations based on their Kalman Filter & Smoother algorithms (users have to pay additional fees for this and it's only available in post-processing, so I'd guess it's relatively rare). "measurementMethod" could be used to identify type of location data (Argos, GPS, GLS, ...), no?

In the case of "old" Least-Squares data, all you get is a "location quality" class for each observation. It is an index of Accuracy so could be capture by "measurementAccuracy". The Kalman Filtered and and Kalman Filtered & Smoothed flavours have "location quality" and error ellipse variables (Ellipse Semi-Major Axis, Ellipse Semi-Minor Axis, Ellipse Orientation). These are all important for modelling (location quality control and other applications).

I'd guess I'm preaching to the choir here, but... you would never want to archive/serve Argos data that had "erroneous location data" filtered or otherwise removed. I'd think you'd want to either provide filtered (or otherwise quality-controlled) location data as a separate, derived ("modelled" in the broadest sense) version of the data, or via a flag that indicates whether a record passed or failed the quality control process(es). I'd guess the metadata would have to capture the essentials of the quality control process applied. In the case of statistical quality control processes, e.g., state-space models - this is where CoordinateUncertaintyInMeters can be used to capture the estimated location uncertainty.

Antonarctica commented 4 years ago

@jdpye if memory serves me right the OBIS logic would be to throw it all in extended Measurement of Fact (lat long locations quality) have a simplified track or range polygon at the event level.

@ianjonsen The standardised vs filter discussion is on that comes back. Given that OBIS and GBIF mainly deal with primary observation, my feeling is that filtered data would be quite heavily processed and not really be the primary observation anymore. (also you ideally try to keep all of that close together.

Im happy with option 2 as an intermediate for now @msweetlove So we can do a first push. Based on how the discussion go we can always redo the export to GBIF/OBIS.

In any case with any approach for me the data on OBIS and GBIF would be a lead into discovering more detailed information which can be at movebank the AADC of another online repository. For instance for the Herring Gull data @peterdesmet used Zenodo.

ianjonsen commented 4 years ago

@Antonarctica yes, that makes sense - I knew I was wandering off into things beyond the primary observations

peggynewman commented 4 years ago

What about using some of the location class terms for the Argos location qualities? For example,

georeferenceProtocol and georeferenceVerificationStatus?

The latter recommends use of a controlled vocabulary, which the Argos location quality essentially is.

jdpye commented 4 years ago

The Argos LQs look to fit very neatly in those columns. We could set a good example with those.

Antonarctica commented 4 years ago

@peggynewman @jdpye @msweetlove Seems they would be a good fit so georeferenceProtocol would be 'ArgosLocations'? and the controlled vocal http://www.argos-system.org/manual/3-location/34_location_classes.htm

ianjonsen commented 4 years ago

Sounds like a great solution

peggynewman commented 4 years ago

Yes, something like that, although a sanity check @peterdesmet would be appreciated. Eg. georeferenceProtocol is "Argos Location Class" plus a link to the 'vocab', and the values (0,1,2,3,A,B,Z) in georeferenceVerificationStatus.

Movebank have added Argos terms to their vocabulary in NERC and it only refers to the Argos 2011 manual but doesn't link to it. They have "Argos LC" which must be the label they use. In the absence of a proper vocabulary, a link out to the manual seems like the right thing to do.

albenson-usgs commented 4 years ago

Just throwing a comment here to see what still needs to happen :-)

Antonarctica commented 4 years ago

I hope nothing... it is public now https://ipt.biodiversity.aq/resource?r=scar_raatd_trackingdata after some long time calculating on @msweetlove computer and finding some small errors. I guess we'll register it next week....

jdpye commented 4 years ago

Yeah the last open ticket we had about eventDates looks to be fixed up in that DwC-A, I think this is good to go! is this PR's branch up to date?

wardappeltans commented 4 years ago

And the RAATD dataset is also published in OBIS https://obis.org/dataset/48cb8624-a221-47ed-9a6d-b99b0bb394e0

jdpye commented 4 years ago

Looks like Mirounga leonina still needs a scientificNameID, but not too many more is to dot and ts to cross, once we have the latest scriptlet and data example in the msweetlove:master branch.

msweetlove commented 4 years ago

@jdpye all scientificNameIDs were collected from WoRMS in an automated loop. If the field is blank for a species, it means it had no exact match with the WoRMS database or there were multiple matches that could not be resolved automatically. I'll clean up the R-script and put it online today.

msweetlove commented 4 years ago

the R-script for formatting the RAATD data is available here

msweetlove commented 4 years ago

@jdpye I updated the occurrence file to add the scientificNameID of Mirounga leonina. The reason it was left blanc was due to multiple matches that could not be resolved automatically.

jdpye commented 4 years ago

Thanks Max! I suspected it was something like that. I've had to parse the AcceptedStatus of the results sometimes to arrive at the one that's approved for my species. Some other times, there are still ambiguities and I have to do as you did. I'll review this now!

jdpye commented 4 years ago

Is the updated file and workflow in the msweetlove fork's master branch?

msweetlove commented 4 years ago

The updated occurrences file can be found here: https://ipt.biodiversity.aq/resource?r=scar_raatd_trackingdata. To do this step I used just two trivial lines of code, so I didn't update the script for that. It goes like this (with occurrences = the occurrence file):

condition<-occurrences[occurrences$scientificName=="Mirounga leonina",] occurrences[condition,]$scientificNameID<-"urn:lsid:marinespecies.org:taxname:231413"