Additional checks for vague date ranges required?

sacrevert commented 4 years ago

Early records are less likely to be resolved to single years. For example, the first exemplar row here https://zenodo.org/record/3635510#.Xj1LLWj7SHt 1700 | 1kmE3802N3133 | 2287615 | 1 | 301 apparently derives from the GBIF record here https://www.gbif.org/occurrence/477065724 but this seems to misrepresent the original https://mczbase.mcz.harvard.edu/guid/MCZ:Mala:152567 which gives a collecting date of 1700-2009 (i.e. presumably unknown or not digitised?)

Should the automated aggregation process should include some sort of flag for early records that are unlikely to, in reality, be resolved to a single year?
What checks could be done?
For example, it’s not clear to me why the GBIF record linked above has a date but also the claim of “no verbatim date data”, is this contradictory?

qgroom commented 4 years ago

I think there might be an issue already raised with GBIF related to this. Last time I checked they couldn't handle date ranges in the eventDate field.

sacrevert commented 4 years ago

The GBIF guidance suggests otherwise, unless you mean that there is currently a bug report open. https://www.gbif.org/data-quality-requirements-occurrences#dcEventDate

Couldn't your process do some error checking to compare the interpreted and original event dates? For example, in the case above, there is clearly an error in the interpration of the originally supplied date. Seems like a fairly important issue for modelling trends in IAS, which presumably the aggregated dataset is going to be used for. The checking could be conditional on the record date being assigned to the 1st January of any given year.

qgroom commented 4 years ago

We could, but I have a suspicion that the Original might not be available in rgbif

qgroom commented 4 years ago

Closed in error

sacrevert commented 4 years ago

The simplest approach would be just to manually check any record resolved to a day where that day was Julian day 0. This would at least exclude the most egregious errors. Over the past 6 years at BRC I have never seen an automated pipeline that didn't benefit from some manual checks or intervention.

qgroom commented 4 years ago

I agree about manual checks, but we do need to keep this to a minimum for what we envision. In the case of Belgian data we are also the publishers of most of the data, so some problems can and should be fixed in the publication pipelines too.

sacrevert commented 4 years ago

Looks like the Original data are available through the RESTful API at least http://api.gbif.org/v1/occurrence/477065724/verbatim

damianooldoni commented 4 years ago

Thanks @sacrevert for your observation. Screening observation via querying the API endpoint verbatim is practically impossible when process millions of occurrences as it means millions of queries.

About the parsing of eventDate: in the link you sent (dcEventDate) is mentioned the following:

For the levels of information that are unknown, avoid padding and instead end the value, to limit ambiguity of interpretation. If, for example, only year and month are known, represent this as 2016-04, not as 2016-04-01.

The eventDate "1700-01-01/2009-02-04" is correct according to the ISO-standard. There are still parsing issues at GBIF side.

Once the GBIF issue is solved, we can think to assign the occurrence randomly to a specific year and add the column min_date_uncertainty. Something very similar to the processing of spatial information.

peterdesmet commented 4 years ago

Last time I checked they couldn't handle date ranges in the eventDate field. ... there is clearly an error in the interpration of the originally supplied date.

GBIF does now "handle" date ranges, as taking the first date of the range (see https://github.com/gbif/portal-feedback/issues/652#issuecomment-343407595). That is already an improvement from ignoring the date altogether, which was the case before.

sacrevert commented 4 years ago

It's up to you guys really, I was just pointing out that early dates resolved to single years are often wrong, and this was obvious within about 10 seconds of looking at your "occurrence cube". My personal opinion is that extremely vague dates should not be arbitrarily assigned to single years, particularly if one is ultimately going to be producing trends for policy or broader ecological use or interpretation. Either the records should be ignored, or presented with full known range, so that later they can either be excluded or known to fall within a particular date range for modelling.

I suppose randomly assigning a year is one potential solution, although I would personally choose to exclude such data points, as they don't add any information and are liable to be misinterpreted by any uninitiated users downstream. This would assume that the dates were missing completely at random (in the statistical jargon), which is also unlikely, as the missingness is probably correlated with the true date of collection.

peterdesmet commented 4 years ago

@sacrevert completely agree. I just suggested to GBIF to flag such records, so we can exclude them in the future: https://github.com/gbif/gbif-api/issues/4#issuecomment-584541663

damianooldoni commented 4 years ago

Thanks @sacrevert for your comment. If these data would not be such useful, adding temporal uncertainty column is just making data processing more computationally demanding with no benefit for the researcher and making output larger and less readable. As we can find a way to filter out these data fast, I will more than happy to make a new version of the occurrence cubes. I hope the solution of flagging proposed by @peterdesmet will be accepted very soon.

trias-project / occ-cube-alien

Additional checks for vague date ranges required? #23