Closed timadriaens closed 1 year ago
This is of utmost importance! For the Walloon data, it is partly the biggest part of Max's work. He made a big effort to convince the experts to validate the datasets before publication. We also chose to validate ourselves the data from some experts for some taxonomic groups for which they had a very good expertise. That's part of the reason why it took so long. It seems that Natagora followed the same process as Natuurpunt. I confirm what Tim just said above, only validated data can be used to run the indicators, to identify emerging species, to run the models for risk mapping. I know that we potentially 'lose' a lot of data, but here quality MUST take precedence over quantity! I also include here @amyjsdavis and @DiederikStrubbe as this discussion is relevant for them too.
The record in question is this one: https://www.gbif.org/occurrence/2631775528 (Natuurpunt:Waarnemingen:190863847
).
Both Natagora and Natuurpunt have the field identificationVerificationStatus
and both (see e.g. https://www.gbif.org/occurrence/2270408500) are publishing unverified records to GBIF (which is fine).
Since other datasets do not have this field, the only option I see is removing records that are explicitly marked as unverified, i.e.:
identificationVerificationStatus = "unverified"
Let's do this properly as it could have a big effect. Can we have a quick preview here of the number of records per identificationVerificationStatus
and per classis
for instance ? It might be that there are other relevant categories (and the verification types were adapted along the way after discussions with admins). There is also a category "pending" for example.
Currently our modelling workflow does not discriminate using identificationVerificationStatus. Based on the many factors we use to filter occurrencedata, the data are already greatly reduced, so I would hate to add another, especially if filling this attribute is not widely adopted. However it seems most relevant to pay attention to the identification status when 2 species closely resemble each other (especially if one is alien/invasive and the other is native).
On Thu, Jun 4, 2020 at 6:36 PM Tim Adriaens notifications@github.com wrote:
Let's do this properly as it could have a big effect. Can we have a quick preview here of the number of records per identificationVerificationStatus and per classis for instance ? It might be that there are other relevant categories (and the verification types were adapted along the way after discussions with admins). There is also a category "pending" for example.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trias-project/indicators/issues/84#issuecomment-638969947, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4KXBXUPR6L2EMYYN4FM3LRU7ERTANCNFSM4NSTMAZQ .
-- Dr. Amy J.S. Davis Data-driven solutions to invasive species and biodiversity conservation
Terrestrial Ecology Unit Department of Biology Ghent University K. L. Ledeganckstraat 35 B-9000 Ghent Belgium http://amyjsdavis.com
@timadriaens: as it seems important, I will not wait to check it while making a new cube. I try to find some time tomorrow or next week to tackle this.
The GBIF download used as start point for the occurrence cube pulished on Zenodo, contains 2447 distinct values of identificationVerificationStatus
. Here below they are shown based on number of occurrences in descending order. As you can see there are a lot of unverified occurrences, 6.652.040, almost 19% of the data. The filtering based on issue
(coordinate issues) and occurrenceStatus
(absences) removes "just" 165.653 occurrences. So, even if all of them would be unverified the amount of unverified occurrences would still remain very high.
identificationVerificationStatus | n |
---|---|
"" | 15901818 |
"unverified" | 6652040 |
"approved on knowledge rules" | 6471644 |
"approved on expert judgement" | 3598234 |
"approved on photographic evidence" | 1273927 |
"verified" | 748793 |
"Validated on the basis of rules" | 60848 |
"Verified Observation" | 33048 |
"validated by PAULY A" | 27731 |
"validated by RASMONT P" | 22352 |
"approved on photographic evide" | 18643 |
"validated by LECLERCQ J" | 18533 |
"Validated without evidence (additional information provided, ...)" | 15668 |
"validated by D'Haeseleer J." | 12881 |
"validated without a document in support (expertise or additional informations)" | 10925 |
"validated by REMACLE A" | 8211 |
... | ... |
Interesting! It is not often appreciated that very common species don't need verifying, because even if the identification was wrong, there is a very good chance that that species is present within a grid cell anyway. On the other hand, for rare species the numbers of false identifications can far exceed the number of correct identifications. Therefore, you can happily accept the "unverified" records for common species, but where do you put the cut off?
This was just a relatively fast check. I will investigate further by:
Stay tuned :radio:
Ok, but I think removing records that are explicitly marked as unverified is indeed fine.
As promised, a little more insight about the 6.652.040 unverified data in our GBIF download (date of download: 28 Jan 2020) containing occurrences in BE.
Around 77% of the unverified data come from Waarnemingen.be - Bird occurrences in Flanders and the Brussels Capital Region, Belgium. Almost all of the datasets are "Natuurpunt" related data. One comes from Wallonia: Observations.be - Non-native species occurrences in Wallonia, Belgium. There is also an INBO dataset: Vlinderdatabank - Butterflies in Flanders and the Brussels Capital Region, Belgium.
title | n | datasetKey |
---|---|---|
Waarnemingen.be - Bird occurrences in Flanders and the Brussels Capital Region, Belgium | 5137863 | e7cbb0ed-04c6-44ce-ac86-ebe49f4efb28 |
Waarnemingen.be - Plant occurrences in Flanders and the Brussels Capital Region, Belgium | 442505 | bfc6fe18-77c7-4ede-a555-9207d60d1d86 |
Waarnemingen.be - Butterfly occurrences in Flanders and the Brussels Capital Region, Belgium | 328363 | 1f968e89-ca96-4065-91a5-4858e736b5aa |
Waarnemingen.be - Non-native animal occurrences in Flanders and the Brussels Capital Region, Belgium | 281091 | 9a0b66df-7535-4f28-9f4e-5bc11b8b096c |
Waarnemingen.be - Hymenoptera occurrences in Flanders and the Brussels Capital Region, Belgium | 168301 | 71cfd412-6327-4ec7-8035-d8b2d0509ac5 |
Waarnemingen.be - Orthoptera occurrences in Flanders and the Brussels Capital Region, Belgium | 99233 | 958b1d2f-2d11-4e94-a828-c8e2d2c013ca |
Waarnemingen.be - Non-native plant occurrences in Flanders and the Brussels Capital Region, Belgium | 61194 | 7f5e4129-0717-428e-876a-464fbd5d9a47 |
Observations.be - Non-native species occurrences in Wallonia, Belgium | 44387 | 629befd5-fb45-4365-95c4-d07e72479b37 |
Waarnemingen.be - Hemiptera occurrences in Flanders and the Brussels Capital Region, Belgium | 43826 | 37e094f3-dcf2-469f-93a2-c4b9b5fa7275 |
Vlinderdatabank - Butterflies in Flanders and the Brussels Capital Region, Belgium | 20478 | 7888f666-f59e-4534-8478-3a10a3bfee45 |
Waarnemingen.be - Fish occurrences in Flanders and the Brussels Capital Region, Belgium | 13963 | 8124cd73-ac84-43d2-ab39-1d80dc346525 |
Waarnemingen.be - Other insect occurrences in Flanders and the Brussels Capital Region, Belgium | 10836 | 27e9e069-2862-4183-bcec-1e1a7f74d3e7 |
Here below the distribution of unverified occurrences at class level, ordered by n
, number of occs. Empty class
value = occs of taxa which don't belong to any class.
class | kingdom | n |
---|---|---|
Aves | Animalia | 5371281 |
Insecta | Animalia | 684639 |
Magnoliopsida | Plantae | 355435 |
Liliopsida | Plantae | 136518 |
Mammalia | Animalia | 51388 |
Actinopterygii | Animalia | 17016 |
Polypodiopsida | Plantae | 9383 |
| Plantae | 4767 |
Pinopsida | Plantae | 4084 |
Reptilia | Animalia | 3711 |
Amphibia | Animalia | 2973 |
Gastropoda | Animalia | 2732 |
Bryopsida | Plantae | 2033 |
Bivalvia | Animalia | 1401 |
Malacostraca | Animalia | 1211 |
| Animalia | 1201 |
Elasmobranchii | Animalia | 626 |
Jungermanniopsida | Plantae | 530 |
Leotiomycetes | Fungi | 293 |
| incertae sedis | 173 |
Phaeophyceae | Chromista | 136 |
Maxillopoda | Animalia | 94 |
Tentaculata | Animalia | 89 |
Lycopodiopsida | Plantae | 82 |
Ascidiacea | Animalia | 58 |
Arachnida | Animalia | 47 |
Cephalaspidomorphi | Animalia | 30 |
Hydrozoa | Animalia | 20 |
Polychaeta | Animalia | 19 |
Ginkgoopsida | Plantae | 15 |
Florideophyceae | Plantae | 13 |
Demospongiae | Animalia | 10 |
Gymnolaemata | Animalia | 9 |
Chilopoda | Animalia | 6 |
Anthozoa | Animalia | 5 |
Agaricomycetes | Fungi | 4 |
Phylactolaemata | Animalia | 4 |
Cephalopoda | Animalia | 2 |
Clitellata | Animalia | 1 |
Leptocardii | Animalia | 1 |
Distribution among years in a plot (from 1980) and in a table where years are given in a descending order of number of occurrences , n
. As the GBIF download has been triggered at 2020-01-28 there is still no data from waarnemingen.be which are updated monthly and so no unverified occs for 2020. There are also way less unverified data for 2019, due to a typical publishing delay, which is longer than 28 days. Both expected facts.
year | n |
---|---|
2018 | 934689 |
2017 | 720012 |
2016 | 663146 |
2015 | 574543 |
2014 | 508016 |
2013 | 467497 |
2012 | 437100 |
2011 | 432699 |
2010 | 422923 |
2009 | 347533 |
2008 | 162985 |
2007 | 81114 |
2005 | 65306 |
2006 | 64999 |
2004 | 53171 |
2019 | 51730 |
1996 | 43995 |
2003 | 39278 |
2002 | 38855 |
1999 | 38227 |
1997 | 36425 |
2000 | 36116 |
1998 | 35821 |
1995 | 35511 |
2001 | 34515 |
1994 | 32038 |
1993 | 30831 |
1992 | 25246 |
1991 | 23535 |
1987 | 20743 |
1984 | 20421 |
1986 | 19869 |
1990 | 16730 |
1985 | 16152 |
1988 | 15827 |
1981 | 13706 |
1989 | 13307 |
1982 | 10714 |
1983 | 10649 |
1980 | 8561 |
1979 | 6545 |
1978 | 5408 |
1974 | 4790 |
1975 | 4563 |
1973 | 4532 |
1976 | 3675 |
1972 | 3459 |
1977 | 3453 |
1971 | 1746 |
1968 | 957 |
1969 | 728 |
1958 | 648 |
1959 | 534 |
1967 | 504 |
1970 | 482 |
1966 | 476 |
1964 | 452 |
1962 | 420 |
1965 | 387 |
1963 | 371 |
1961 | 363 |
1960 | 347 |
1957 | 345 |
1956 | 311 |
1955 | 306 |
1954 | 128 |
1919 | 123 |
1948 | 108 |
... | < 100 |
I hope this first analysis give you all more elements to discuss. I would remove these data. I think that data quality is as important as transparency in science.
Indeed @damianooldoni this is as expected. obs.be/wnm.be have a well established validation flow and therefore have that field identificationVerificationStatus
filled. Data from the vlinderdatabank are high quality atlas data and not very relevant to TrIAS (unless for the cube used for survey effort correction but I guess for that we don't need to exclude unverified as it is all about the effort) since they contain no non-native species. The distribution looks like it follows the same trend as the total number of observations.
@damianooldoni @peterdesmet @qgroom @SoVDH @amyjsdavis @DiederikStrubbe we exlude the unverified records from the occurrence based indicators. But do we keep all records to build the cube assuming that even an unverified records represents a survey effort? Or how do we deal with this?
At the risk of sounding like an extremist, I'd rule them out.
I agree with @SoVDH for two reasons:
I agree that we should not make a distinction between verified and unverified data, but I am not sure that we should exclude unverified data from the cube. But if this what you want to do (exclude unverified data), we need to decide quickly, because at present these data are being included in the risk models.
On Wed, Jun 10, 2020 at 2:01 PM Damiano Oldoni notifications@github.com wrote:
I agree with @SoVDH https://github.com/SoVDH for two reasons:
- a minimum of data quality (= validation) is extremely important, no matter the goal the data are used for
- the indicators are built upon the cube, where data are already aggregated per year, taxon and grid cell. Making a distinction between verified and unverified data means adding an extra column validation (TRUE or FALSE) to maintaining things tidy. It would make the understanding of the occurrence cube more difficult and it don't think it's worth.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trias-project/indicators/issues/84#issuecomment-641957204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4KXBSOWI3KDP73QNC2LELRV5Y27ANCNFSM4NSTMAZQ .
-- Dr. Amy J.S. Davis Data-driven solutions to invasive species and biodiversity conservation
Terrestrial Ecology Unit Department of Biology Ghent University K. L. Ledeganckstraat 35 B-9000 Ghent Belgium http://amyjsdavis.com
ok of course, but not sure @amyjsdavis will like it?
@DiederikStrubbe and I discussed this and now I have a better understanding. I am ok with you excluding the unverified data and I don't think this will substantially change the risk models.
If we want to exclude records that are marked as unvalidated
(I'm fine with that), I suggest to do that for all processing (alien cube + all cube) and all datasets. It is clearer to explain.
@SoVDH : I have 17 out of 19 plant species SDM models for the risk assessment completed. These of course include the "unvalidated" or "unverified" label. Is it your preference that I run them again with these data excluded or do you want the maps now? It will take a few days, but it can be done.
Yes think that would be better.
@amyjsdavis: I thought you were using the cube for Europe I made for your SDM (eu_modellingtaxa_cube.csv,metadata: eu_modellingtaxa_info.csv). And in this cube there is no way to exclude unverified taxa. So, I wonder which occurrence data you are using.
By the way, I will try to make a new version of the cubes before end of June.
@damianooldoni : These are a different set of species (the plant species on the Species selection for PRA-ing list on google drive). They are different from the modellingtaxa list, and thus there is not an European cube. I realized that in the future, there will likely not be a cube for every species that are to be evaluated, so my modelling flow has the option to use the Cube as an input if already existing, or to process data directly downloaded from GBIF. As you may recall, I have to download global data for each species anyway for the models.
@amyjsdavis it's good to keep the options open, but there is a cube for every species on the unified and in fact on every spp.
Are your belgian maps just crops of a Eu/global risk map or how does that work?
@timadriaens : indeed, there is a cube for every species, but only for Belgium. The risk maps for Belgium are essentially a crop of a European risk model.
R we planning to do anything with the european maps? There is certainly interest cf Crassula helmsii and Muntiacus reevesi.
I would stop this interesting discussion here as it has nothing to do with verification anymore. I started a new one here: https://github.com/trias-project/occ-cube-alien/issues/25
I have also seen "unvalidated' as an attribute for identificationVerificationStatus. Should those data also be excluded?
@amyjsdavis That was the term we were discussing or am I missing something?
Oh, you mean “unvalidated” in addition to “unverified”. Yes, those should ideally be removed as well. Which dataset did you find those in?
Yes, I found it in my global download for the plants for the risk assessment. I just happened to notice it for Symphyotrichum lanceolatum. The dataset provider is urn:lsid:swedishlifewatch.se:DataProvider:1, the dataset name is Artportalen (Swedish Species Observation System).
My global download dataset is here: https://doi.org/10.15468/dl.ruaasw
This issue can be closed as we filter out unvalidated data, see https://github.com/trias-project/occ-cube/blob/master/src/2_create_db.Rmd#L252-L260 and https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L288-L296 for the names of the issue whose occurrences we filter out.
Hi, this came up when checking a unverified, false record of a supposedly new alien species for Belgium in the wnm.be data (Vespa orientalis). Waarnemingen and observations publish all records with IdentificationVerificationStatus on gbif (which is ok!). However, for the pipeline, the models etc. it is imperative we only use validated occurrences. Therefore: the pipeline needs a line to subset data based on IdentificationVerificationStatus
Perhaps we can do some sort of sensitivity analysis to see how this impacts (I'm sure there is no time)...
Question to @damianooldoni @peterdesmet @qgroom @SoVDH , anticipating that perhaps many datasets/records on gbif do not even have a IdentificationVerificationStatus : what do we do if that field is not filled?