iobis / gbif-marine

2 stars 1 forks source link

obis-env-datasets #1

Open wardappeltans opened 8 years ago

wardappeltans commented 8 years ago

Event Core - Occurrence extension - eMoF is currently only available in test-mode IPTs. This means that when OBIS nodes move their IPT to production mode (for harvesting by GBIF) they will not be able to use the eMoF.

eMoF (OBIS extended Measurement or Fact extension) http://tools.gbif.org/dwca-validator/extension.do?id=http://rs.obis.org/obis/terms/ExtendedMeasurementOrFact

NicBailly commented 8 years ago

After our discussion, we can ask GBIF where do they stand to implement that, most probably on the basis of experiments done in EU BON.

kbraak commented 8 years ago

It would be great to move OBIS' Extended Measurement or Facts Extension into production as soon as possible. Before we do that, however, I highly recommended that OBIS create a new version of this extension that addresses the list of issues and recommendations below. Documentation on how to create a new version of an extension can be found here.

Recently I investigated this sample event dataset hosted by MedOBIS. Version 1.1 of this dataset is missing data for fields fundamental to describing the sample event, namely sampleSizeValue and sampleSizeUnit. It would be great if OBIS can recommend a set of required and recommended DwC terms to its publishers and if its recommendations are in line with GBIF's. So that you are aware, the list of DwC terms that GBIF requires/recommends for occurrence data can be found here, and the list of DwC terms that GBIF requires/recommends for sample event data can be found here. Thanks.

screen shot 2016-03-30 at 16 29 07

kbraak commented 8 years ago

Another problem I spotted with the aforementioned dataset is the use of simple integers for record-level identifiers eventID and occurrenceID. Therefore I'd also like to propose that OBIS publishers adopt a formula for creating record-level identifiers, which will ensure that they are near globally unique and remain stable over time. More information about GBIF's promotion of the occurrenceID as the unique identifier for occurrence records can found be in this old IPT blog post. Thanks.

NicBailly commented 8 years ago

Ok, we have such things, we need to change the mapping. I think it is our mistake there, but I will for OBIS in general. BW Nicolas.

kbraak commented 8 years ago

That sounds great Nicolas.

I'd also like to propose OBIS publishers indicate at record-level, that WoRMS is the source where their names are defined (whenever appropriate). There is an attempt to do this in the aforementioned dataset by filling in scientificNameID, but is more appropriate instead to use the DwC terms taxonID and nameAccordingTo.

For example, to indicate that the scientific name Heteromastus filiformis (Claparède, 1864) is according to WoRMS, you could fill in these terms as follows:

taxonId="urn:lsid:marinespecies.org:taxname:129884” nameAccordingTo="WoRMS, 2016”

The LSID above is the globally unique identifier WoRMS assigned to this taxon.

wardappeltans commented 8 years ago

Dear Kyle, This is an OBIS convention and changing this requires consultation and approval by the OBIS Steering Group. See https://github.com/iobis/training/wiki/Darwin%20Core#taxonomy

Daphnisd commented 8 years ago

@kbraak

As part of an IODE project, OBIS-ENV-DATA, we are working on guidelines for the OBIS community on how to use DwC-A for combined (biotic and abiotic) datasets and are working on a scientific paper on this.

Placing the eMoF extension into production is too early at this point. We are still in a testing phase, where we are creating resources in the proposed OBIS-ENV-DATA format, which we developed during a workshop in October 2015. We have some pilot datasets available at http://ipt.vliz.be/obis-env/ (some may still need some tweaking here and there but the overall format is ok). We are still discussing whether an additional parameter called measurementQuality would be needed. But I understand from your documentation this would not be a problem, as it only requires a new version to be created.

We understand that you are proposing to make the following data fields mandatory: sampleSize, samplingProtocol and CountryCode.

You mention that sampleSizeValue and sampleSizeUnit are fundamental to describing the sample event. Is the purpose of these fields the correct interpretation of the values filled out under organismQuantity and/or individualCount? How would GBIF handle a sample which was analyzed only for environmental characteristics? In oceanographic data you may sample using a trawl for 2 km and then take a sediment sample at the beginning or end of the trawl. Perhaps this sediment sample is only analyzed for abiotic measurements (as is the case for dataset (http://ipt.iobis.org/obis-env/resource?r=north_sea_hypbent_com ). This sample is taken to be analyzed together with the trawl, however the specific coordinates may be very relevant as sediment conditions can be completely different from a few meters further. On top of this, most physical readings in oceanography are meaningless if you do not also record the depth at which the measurement was made. As depths are stored in the Event Core, to be able to record the abiotic (e.g. temperature) measurement at depths, we need to create a separate event record. Now, for an abiotic sample and a temperature reading the sampleSize is either irrelevant or not applicable.

OBIS-ENV-DATA is investigating a practice of storing all sampling descriptors as part of the eMoF (which would make the fields sampleSizeValue and sampleSizeUnit, samplingProtocol and samplingEffort obsolete) for the following reasons:

  1. It’s sometimes difficult to distinguish which parameter is the sample size. When you go trawling, you usually measure the distance trawled, the duration spend trawling, the vessel speed, the current direction and current speed, the volume of water passing through the net. Additionally the width and type of the trawl which was used are relevant to determine the catch. In some cases, the sample is subsampled. (e.g. in nematode studies scientists may only identify the first 100 specimens and derive an abundance form the proportional abundance to the total. So you would need an additional descriptor providing the percentage of specimens analyzed). You can add such descriptors in the samplingEffort field, however using this field would make it impossible for OBIS or to automatically calculate the actual abundances.
  2. Other descriptors which are extremely relevant like mesh size of the sampling net and the mesh of the filter or sieve are impossible to capture semantically in the event core. This could possibly be included in the field samplingProtocol together with the instrument type and any other relevant descriptors mentioned above. The problem with this is that it is very difficult for an integrated database like OBIS to build a functionality using this information.

Using the OBIS MoF we have identified the need for measurementTypeID, measurementValueID and measurementUnitID, which refer to an identifier for each of the respective fields. We would prefer the use of an id that links to an external, controlled, vocabulary instead of forcing an internal standardized vocabulary for the following reasons:

  1. Forcing a single vocabulary on these verbatim fields would mean we cannot keep the original terms used by the person who provided the data.
  2. Adding a reference id allows to precisely identify a parameter and provide the user a specific definition.
  3. OBIS needs the flexibility to use different reference authorities that are relevant to ocean science and data.
  4. A reference ID also allows to reference different authorities for parameters related to different disciplines. As there is a vast amount of oceanographic parameters (BODC alone holds over 30,000 http://vocab.nerc.ac.uk/collection/P01/current) it’s just unpractical to try to contain them in a single vocabulary.

I understand the confusion between measurementID and measurementValueID, as measurementID is missing from the slide in the presentation and the definition of measurementValueID alone may be confusing. However, both are very different. The parameter measurementValueID would provide more information about a specific fact which is stored in the column measurementValue. e.g if you store the fact “Van Veen grab” as a measurementValue you could add the identifier http://vocab.nerc.ac.uk/collection/L22/current/TOOL0653/ which provides more information about this instrument as a measurementValueID.

Additionally, we noticed that it is proposed to make countryCode mandatory. The definition says: “The standard code for the country in which the Location occurs”. I guess this may make sense for terrestrial data, but it makes much less sense for marine data as the vast majority of the ocean is not under the jurisdiction on any country. The wikipage https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 lists a user assigned code (XZ) for international waters but it may not be a ISO 3166 code sensu-stricto?

Were also a bit uncertain about the use of hierarchies. Some earlier documentation we found suggested the use of “nested” samples http://www.gbif.org/sites/default/files/gbif_IPT-sample-data-primer_en.pdf, We’re now uncertain on whether this practice is still supported in current guidelines. The example you provide at https://github.com/gbif/ipt/wiki/sampleEventData#exemplar-datasets which describes different sections along a transect seems an excellent candidate for a nested sample (as I would assume a scientist could be interested in the abundance of a butterfly species along the entire track as well as in the section)? However, it seems to us that parentEventID is used as an identifier for the track? This is again very different from how we envision it. Based on the nested samples example, we would use parentEventID to (1) combine biotic and abiotic samples (as discussed earlier we sometimes need separate event records for abiotic samples) taken at the same event location and to be analyzed together, (2) to define subevents of the deployment of a single sensor; being the placement of a satellite tracker, the deployment of a CTD, a VPR, or the multitude sensor readings when a sensor is attached to a sampling gear (e.g. a plankton net). The presentation you referred to provides some examples of that (for telemetry data see http://ipt.iobis.org/obis-env/resource?r=imosrealtimectd and http://ipt.iobis.org/obis-env/resource?r=gulltracking). For the butterfly exemplar dataset, using the event hierarchy as we envision it, you can calculate the abundance of the butterfly species per track with one single query. I attached an excel file with a few records in each format to illustrate this point Israeli Butterfly Monitoring Scheme (BMS-IL).xlsx.

When you use parentEventID in this manner (and create an event hierarchy), you could opt not to fill out the eventDate at the underlying records as this would reduce data duplication. For the butterfly exemplar dataset, it may be semantically more correct to use the eventTime only at the parent event record as (I understand) this interval refers to the entire track and not to the sections. On the other hand we understand that it would be prudent to duplicate this information at record level in order to prevent mistakes.

We look forward to hearing your opinion on how we could align both approaches and move forward.

dagendresen commented 8 years ago

Perhaps adding resourceID instead of adding occurrenceID? The resourceID could include both events and occurrences etc. The resources would need to have globally unique identifiers or at least an identifier unique to the Darwin Core archive across both events and occurrences.

wardappeltans commented 8 years ago

Dear Dag, why would you prefer this over occurrenceID? I do not see an immediate benefit. People will need to repeat eventID under resourceID and there is a risk that eventID and occurrenceID are not unique.

dagendresen commented 8 years ago

This came up in the European GBIF Nodes meeting in the context of sampling data and the new Event core and datasets with measurement data for both events (plots) and occurrences. Nabil showed us the OBIS draft version of the MeasurementOrFact extension Adding occurrenceID to the extension will solve only the issue of including measurement data on occurrences to an event core Darwin Core archive - however, one might have measurements or facts for other classes (taxonID, locationID, ... etc). Adding a resourceID instead of the occurrenceID will thus be a more general solution. However, as you mention the identifier need to be unique at least to the dataset. When identifiers are not events and occurrences would perhaps need to be published as separate Darwin Core archive datasets anyway.

Another thought in development would be the possible direction towards MeasurementOrFact as a new core type by itself. Mandating globally unique resource identifiers for the subject resource...

wardappeltans commented 8 years ago

alright. Now I understand and indeed could make sense. Glad this gets wider attention/discussion. thanks for sharing your thoughts!