iobis / obis-issues

Repository for all OBIS related issues and feature requests
4 stars 3 forks source link

between-field consistency checks #214

Open ymgan opened 1 month ago

ymgan commented 1 month ago

Hi, I was posed a question today - is there a way for OBIS nodes to spot for between-field inconsistencies (e.g. scientificName is consistent with scientificNameID) prior to data publication?

Example record: https://www.gbif.org/occurrence/4410956370 Either the:

@sformel-usgs For this specific example, it sounds like it would be easy enough to check with the worrms package. In general, I think that these type of checks would need to be custom functions that could potentially be added to obistools, if it seems appropriate.

The function I mentioned above would look something like this:

library(dplyr)

d <- tibble(scientificName = 'Pempheris schomburgki',
            scientificNameID = 'urn:lsid:marinespecies.org:taxname:277068') 

aphiaID <- d$scientificNameID %>% stringr::str_extract(pattern = '\\d{6}$') %>% as.integer()

d$scientificName == worrms::wm_id2name(id = aphiaID)

@ymgan Thank you so much Steve! yeah, I guess I get the question for general practice by the OBIS nodes and my interpretation is that, the tools that are recommended in OBIS manual don’t seem to perform this kind of check.

Based on my understanding, neither obistools, pyobistools nor EMODnet BioCheck perform this kind of check. It seems to me that more focus was placed on whether the field is populated, or whether the ID resolved, rather than the scientificNameID and scientificName correspond to each other. On the other hand, GBIF Data Validator returns inconsistent results when I tested it. “Taxon match none" for this record instead of “Scientific name and ID inconsistent” …

I guess the conclusion is that this check is not widely supported by the existing tools unless it is a custom script written for this purpose.

JoBeja commented 1 month ago

Hi, as a standard practice we don't correct the information in the scientificName field, it remains as the originator provided. I would say the relevant field will be the scientificNameID as that is the the one that contains the standardised vocabulary which describes what the occurrence should be. Why would you need to check for this type of inconsistencies? Indeed The LifeWatch/EMODnet QC tool does not perform this check... I would say this is similar to the way we treat emofs, in that we keep the originator's parameter name and the mapping to the vocabulary is done via the TypeID field which is that standardised one.

rubenpp7 commented 1 month ago

Hi everyone,

Some time ago I actually added a check that looks at "broken one-to-one relationships" to the EMODnetBiocheck R package (see https://github.com/EMODnet/EMODnetBiocheck/blob/master/R/one_to_one_check.R ).

It does not check that the name of the selected scientificNameID corresponds to the name in scientificName but it does check that at least they are consistent throughout the dataset.

@ymgan I like your idea and it would not be too hard to integrate in the Biocheck tool, I'll add it to my list

rubenpp7 commented 1 month ago

for the moment, the one_to_one_check() function in the Biocheck tool is looking at the following couples of fields:

field x field y
scientificName scientificNameID
measurementType measurementTypeID
measurementValue measurementValueID
measurementUnit measurementUnitID

Let me know if more field pairs should be considered