Use only verified data for pipeline

timadriaens commented 4 years ago

Hi, this came up when checking a unverified, false record of a supposedly new alien species for Belgium in the wnm.be data (Vespa orientalis). Waarnemingen and observations publish all records with IdentificationVerificationStatus on gbif (which is ok!). However, for the pipeline, the models etc. it is imperative we only use validated occurrences. Therefore: the pipeline needs a line to subset data based on IdentificationVerificationStatus

approved on expert judgement
approved on photographic evidence
approved on knowledge rules

Perhaps we can do some sort of sensitivity analysis to see how this impacts (I'm sure there is no time)...

Question to @damianooldoni @peterdesmet @qgroom @SoVDH , anticipating that perhaps many datasets/records on gbif do not even have a IdentificationVerificationStatus : what do we do if that field is not filled?

SoVDH commented 4 years ago

This is of utmost importance! For the Walloon data, it is partly the biggest part of Max's work. He made a big effort to convince the experts to validate the datasets before publication. We also chose to validate ourselves the data from some experts for some taxonomic groups for which they had a very good expertise. That's part of the reason why it took so long. It seems that Natagora followed the same process as Natuurpunt. I confirm what Tim just said above, only validated data can be used to run the indicators, to identify emerging species, to run the models for risk mapping. I know that we potentially 'lose' a lot of data, but here quality MUST take precedence over quantity! I also include here @amyjsdavis and @DiederikStrubbe as this discussion is relevant for them too.

peterdesmet commented 4 years ago

The record in question is this one: https://www.gbif.org/occurrence/2631775528 (Natuurpunt:Waarnemingen:190863847).

Both Natagora and Natuurpunt have the field identificationVerificationStatus and both (see e.g. https://www.gbif.org/occurrence/2270408500) are publishing unverified records to GBIF (which is fine).

Since other datasets do not have this field, the only option I see is removing records that are explicitly marked as unverified, i.e.:

identificationVerificationStatus = "unverified"

timadriaens commented 4 years ago

Let's do this properly as it could have a big effect. Can we have a quick preview here of the number of records per identificationVerificationStatus and per classis for instance ? It might be that there are other relevant categories (and the verification types were adapted along the way after discussions with admins). There is also a category "pending" for example.

amyjsdavis commented 4 years ago

Currently our modelling workflow does not discriminate using identificationVerificationStatus. Based on the many factors we use to filter occurrencedata, the data are already greatly reduced, so I would hate to add another, especially if filling this attribute is not widely adopted. However it seems most relevant to pay attention to the identification status when 2 species closely resemble each other (especially if one is alien/invasive and the other is native).

On Thu, Jun 4, 2020 at 6:36 PM Tim Adriaens notifications@github.com wrote:

Let's do this properly as it could have a big effect. Can we have a quick preview here of the number of records per identificationVerificationStatus and per classis for instance ? It might be that there are other relevant categories (and the verification types were adapted along the way after discussions with admins). There is also a category "pending" for example.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trias-project/indicators/issues/84#issuecomment-638969947, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4KXBXUPR6L2EMYYN4FM3LRU7ERTANCNFSM4NSTMAZQ .

-- Dr. Amy J.S. Davis Data-driven solutions to invasive species and biodiversity conservation

Terrestrial Ecology Unit Department of Biology Ghent University K. L. Ledeganckstraat 35 B-9000 Ghent Belgium http://amyjsdavis.com

damianooldoni commented 4 years ago

@timadriaens: as it seems important, I will not wait to check it while making a new cube. I try to find some time tomorrow or next week to tackle this.

damianooldoni commented 4 years ago

The GBIF download used as start point for the occurrence cube pulished on Zenodo, contains 2447 distinct values of identificationVerificationStatus. Here below they are shown based on number of occurrences in descending order. As you can see there are a lot of unverified occurrences, 6.652.040, almost 19% of the data. The filtering based on issue (coordinate issues) and occurrenceStatus (absences) removes "just" 165.653 occurrences. So, even if all of them would be unverified the amount of unverified occurrences would still remain very high.

identificationVerificationStatus	n
""	15901818
"unverified"	6652040
"approved on knowledge rules"	6471644
"approved on expert judgement"	3598234
"approved on photographic evidence"	1273927
"verified"	748793
"Validated on the basis of rules"	60848
"Verified Observation"	33048
"validated by PAULY A"	27731
"validated by RASMONT P"	22352
"approved on photographic evide"	18643
"validated by LECLERCQ J"	18533
"Validated without evidence (additional information provided, ...)"	15668
"validated by D'Haeseleer J."	12881
"validated without a document in support (expertise or additional informations)"	10925
"validated by REMACLE A"	8211
...	...

qgroom commented 4 years ago

Interesting! It is not often appreciated that very common species don't need verifying, because even if the identification was wrong, there is a very good chance that that species is present within a grid cell anyway. On the other hand, for rare species the numbers of false identifications can far exceed the number of correct identifications. Therefore, you can happily accept the "unverified" records for common species, but where do you put the cut off?

damianooldoni commented 4 years ago

This was just a relatively fast check. I will investigate further by:

searching for the datasets the unverified obs come from. Waarnemingen (Natuurpunt data) for sure, maybe other ones?
grouping them by class as asked by @timadriaens
grouping them by year (maybe most of them are "too" recent data from actual year? Then impact on our analysis is very limited)

Stay tuned :radio:

timadriaens commented 4 years ago

Ok, but I think removing records that are explicitly marked as unverified is indeed fine.

damianooldoni commented 4 years ago

As promised, a little more insight about the 6.652.040 unverified data in our GBIF download (date of download: 28 Jan 2020) containing occurrences in BE.

Datasets

Around 77% of the unverified data come from Waarnemingen.be - Bird occurrences in Flanders and the Brussels Capital Region, Belgium. Almost all of the datasets are "Natuurpunt" related data. One comes from Wallonia: Observations.be - Non-native species occurrences in Wallonia, Belgium. There is also an INBO dataset: Vlinderdatabank - Butterflies in Flanders and the Brussels Capital Region, Belgium.

title	n	datasetKey
Waarnemingen.be - Bird occurrences in Flanders and the Brussels Capital Region, Belgium	5137863	e7cbb0ed-04c6-44ce-ac86-ebe49f4efb28
Waarnemingen.be - Plant occurrences in Flanders and the Brussels Capital Region, Belgium	442505	bfc6fe18-77c7-4ede-a555-9207d60d1d86
Waarnemingen.be - Butterfly occurrences in Flanders and the Brussels Capital Region, Belgium	328363	1f968e89-ca96-4065-91a5-4858e736b5aa
Waarnemingen.be - Non-native animal occurrences in Flanders and the Brussels Capital Region, Belgium	281091	9a0b66df-7535-4f28-9f4e-5bc11b8b096c
Waarnemingen.be - Hymenoptera occurrences in Flanders and the Brussels Capital Region, Belgium	168301	71cfd412-6327-4ec7-8035-d8b2d0509ac5
Waarnemingen.be - Orthoptera occurrences in Flanders and the Brussels Capital Region, Belgium	99233	958b1d2f-2d11-4e94-a828-c8e2d2c013ca
Waarnemingen.be - Non-native plant occurrences in Flanders and the Brussels Capital Region, Belgium	61194	7f5e4129-0717-428e-876a-464fbd5d9a47
Observations.be - Non-native species occurrences in Wallonia, Belgium	44387	629befd5-fb45-4365-95c4-d07e72479b37
Waarnemingen.be - Hemiptera occurrences in Flanders and the Brussels Capital Region, Belgium	43826	37e094f3-dcf2-469f-93a2-c4b9b5fa7275
Vlinderdatabank - Butterflies in Flanders and the Brussels Capital Region, Belgium	20478	7888f666-f59e-4534-8478-3a10a3bfee45
Waarnemingen.be - Fish occurrences in Flanders and the Brussels Capital Region, Belgium	13963	8124cd73-ac84-43d2-ab39-1d80dc346525
Waarnemingen.be - Other insect occurrences in Flanders and the Brussels Capital Region, Belgium	10836	27e9e069-2862-4183-bcec-1e1a7f74d3e7

Classes

Here below the distribution of unverified occurrences at class level, ordered by n, number of occs. Empty class value = occs of taxa which don't belong to any class.

class	kingdom	n
Aves	Animalia	5371281
Insecta	Animalia	684639
Magnoliopsida	Plantae	355435
Liliopsida	Plantae	136518
Mammalia	Animalia	51388
Actinopterygii	Animalia	17016
Polypodiopsida	Plantae	9383
‎	Plantae	4767
Pinopsida	Plantae	4084
Reptilia	Animalia	3711
Amphibia	Animalia	2973
Gastropoda	Animalia	2732
Bryopsida	Plantae	2033
Bivalvia	Animalia	1401
Malacostraca	Animalia	1211
‎	Animalia	1201
Elasmobranchii	Animalia	626
Jungermanniopsida	Plantae	530
Leotiomycetes	Fungi	293
‎	incertae sedis	173
Phaeophyceae	Chromista	136
Maxillopoda	Animalia	94
Tentaculata	Animalia	89
Lycopodiopsida	Plantae	82
Ascidiacea	Animalia	58
Arachnida	Animalia	47
Cephalaspidomorphi	Animalia	30
Hydrozoa	Animalia	20
Polychaeta	Animalia	19
Ginkgoopsida	Plantae	15
Florideophyceae	Plantae	13
Demospongiae	Animalia	10
Gymnolaemata	Animalia	9
Chilopoda	Animalia	6
Anthozoa	Animalia	5
Agaricomycetes	Fungi	4
Phylactolaemata	Animalia	4
Cephalopoda	Animalia	2
Clitellata	Animalia	1
Leptocardii	Animalia	1

Years

Distribution among years in a plot (from 1980) and in a table where years are given in a descending order of number of occurrences , n. As the GBIF download has been triggered at 2020-01-28 there is still no data from waarnemingen.be which are updated monthly and so no unverified occs for 2020. There are also way less unverified data for 2019, due to a typical publishing delay, which is longer than 28 days. Both expected facts.

year	n
2018	934689
2017	720012
2016	663146
2015	574543
2014	508016
2013	467497
2012	437100
2011	432699
2010	422923
2009	347533
2008	162985
2007	81114
2005	65306
2006	64999
2004	53171
2019	51730
1996	43995
2003	39278
2002	38855
1999	38227
1997	36425
2000	36116
1998	35821
1995	35511
2001	34515
1994	32038
1993	30831
1992	25246
1991	23535
1987	20743
1984	20421
1986	19869
1990	16730
1985	16152
1988	15827
1981	13706
1989	13307
1982	10714
1983	10649
1980	8561
1979	6545
1978	5408
1974	4790
1975	4563
1973	4532
1976	3675
1972	3459
1977	3453
1971	1746
1968	957
1969	728
1958	648
1959	534
1967	504
1970	482
1966	476
1964	452
1962	420
1965	387
1963	371
1961	363
1960	347
1957	345
1956	311
1955	306
1954	128
1919	123
1948	108
...	< 100

I hope this first analysis give you all more elements to discuss. I would remove these data. I think that data quality is as important as transparency in science.

timadriaens commented 4 years ago

Indeed @damianooldoni this is as expected. obs.be/wnm.be have a well established validation flow and therefore have that field identificationVerificationStatus filled. Data from the vlinderdatabank are high quality atlas data and not very relevant to TrIAS (unless for the cube used for survey effort correction but I guess for that we don't need to exclude unverified as it is all about the effort) since they contain no non-native species. The distribution looks like it follows the same trend as the total number of observations.

@damianooldoni @peterdesmet @qgroom @SoVDH @amyjsdavis @DiederikStrubbe we exlude the unverified records from the occurrence based indicators. But do we keep all records to build the cube assuming that even an unverified records represents a survey effort? Or how do we deal with this?

SoVDH commented 4 years ago

At the risk of sounding like an extremist, I'd rule them out.

damianooldoni commented 4 years ago

I agree with @SoVDH for two reasons:

a minimum of data quality (= validation) is extremely important, no matter the goal the data are used for
the indicators are built upon the cube, where data are already aggregated per year, taxon and grid cell. Making a distinction between verified and unverified data means adding an extra column validation (TRUE or FALSE) to maintaining things tidy. It would make the understanding of the occurrence cube more difficult and it don't think it's worth.

amyjsdavis commented 4 years ago

I agree that we should not make a distinction between verified and unverified data, but I am not sure that we should exclude unverified data from the cube. But if this what you want to do (exclude unverified data), we need to decide quickly, because at present these data are being included in the risk models.

On Wed, Jun 10, 2020 at 2:01 PM Damiano Oldoni notifications@github.com wrote:

I agree with @SoVDH https://github.com/SoVDH for two reasons:

a minimum of data quality (= validation) is extremely important, no matter the goal the data are used for

the indicators are built upon the cube, where data are already aggregated per year, taxon and grid cell. Making a distinction between verified and unverified data means adding an extra column validation (TRUE or FALSE) to maintaining things tidy. It would make the understanding of the occurrence cube more difficult and it don't think it's worth.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/trias-project/indicators/issues/84#issuecomment-641957204, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC4KXBSOWI3KDP73QNC2LELRV5Y27ANCNFSM4NSTMAZQ .

-- Dr. Amy J.S. Davis Data-driven solutions to invasive species and biodiversity conservation

Terrestrial Ecology Unit Department of Biology Ghent University K. L. Ledeganckstraat 35 B-9000 Ghent Belgium http://amyjsdavis.com

timadriaens commented 4 years ago

ok of course, but not sure @amyjsdavis will like it?

amyjsdavis commented 4 years ago

@DiederikStrubbe and I discussed this and now I have a better understanding. I am ok with you excluding the unverified data and I don't think this will substantially change the risk models.

peterdesmet commented 4 years ago

If we want to exclude records that are marked as unvalidated (I'm fine with that), I suggest to do that for all processing (alien cube + all cube) and all datasets. It is clearer to explain.

amyjsdavis commented 4 years ago

@SoVDH : I have 17 out of 19 plant species SDM models for the risk assessment completed. These of course include the "unvalidated" or "unverified" label. Is it your preference that I run them again with these data excluded or do you want the maps now? It will take a few days, but it can be done.

timadriaens commented 4 years ago

Yes think that would be better.

damianooldoni commented 4 years ago

@amyjsdavis: I thought you were using the cube for Europe I made for your SDM (eu_modellingtaxa_cube.csv,metadata: eu_modellingtaxa_info.csv). And in this cube there is no way to exclude unverified taxa. So, I wonder which occurrence data you are using.

By the way, I will try to make a new version of the cubes before end of June.

amyjsdavis commented 4 years ago

@damianooldoni : These are a different set of species (the plant species on the Species selection for PRA-ing list on google drive). They are different from the modellingtaxa list, and thus there is not an European cube. I realized that in the future, there will likely not be a cube for every species that are to be evaluated, so my modelling flow has the option to use the Cube as an input if already existing, or to process data directly downloaded from GBIF. As you may recall, I have to download global data for each species anyway for the models.

timadriaens commented 4 years ago

@amyjsdavis it's good to keep the options open, but there is a cube for every species on the unified and in fact on every spp.

timadriaens commented 4 years ago

Are your belgian maps just crops of a Eu/global risk map or how does that work?

amyjsdavis commented 4 years ago

@timadriaens : indeed, there is a cube for every species, but only for Belgium. The risk maps for Belgium are essentially a crop of a European risk model.

timadriaens commented 4 years ago

R we planning to do anything with the european maps? There is certainly interest cf Crassula helmsii and Muntiacus reevesi.

damianooldoni commented 4 years ago

I would stop this interesting discussion here as it has nothing to do with verification anymore. I started a new one here: https://github.com/trias-project/occ-cube-alien/issues/25

amyjsdavis commented 4 years ago

I have also seen "unvalidated' as an attribute for identificationVerificationStatus. Should those data also be excluded?

peterdesmet commented 4 years ago

@amyjsdavis That was the term we were discussing or am I missing something?

peterdesmet commented 4 years ago

Oh, you mean “unvalidated” in addition to “unverified”. Yes, those should ideally be removed as well. Which dataset did you find those in?

amyjsdavis commented 4 years ago

Yes, I found it in my global download for the plants for the risk assessment. I just happened to notice it for Symphyotrichum lanceolatum. The dataset provider is urn:lsid:swedishlifewatch.se:DataProvider:1, the dataset name is Artportalen (Swedish Species Observation System).

amyjsdavis commented 4 years ago

My global download dataset is here: https://doi.org/10.15468/dl.ruaasw

damianooldoni commented 1 year ago

This issue can be closed as we filter out unvalidated data, see https://github.com/trias-project/occ-cube/blob/master/src/2_create_db.Rmd#L252-L260 and https://github.com/trias-project/occ-cube-alien/blob/master/src/belgium/2_create_db.Rmd#L288-L296 for the names of the issue whose occurrences we filter out.

trias-project / indicators