From occurrences and AreaOfOccupancy to an emerging species decision rule at year level

damianooldoni commented 5 years ago

This issue describes a part of the general workflow for assessing the emerging status of alien species, as discussed on Friday, 15 Feb 2019 by @damianooldoni , @timadriaens and @ToonVanDaele .

Input data

We start from the output of occ-processing repository called cube_belgium.csv as mentioned in https://github.com/trias-project/occ-processing/issues/3. This file contains occurrences with (at least) the following key columns:

taxonKey
speciesKey
kingdomKey
year
CELLCODE (grid id from European Environment)

Grouping by speciesKey and year, we get the number of occurrences per year (x: year, y: n_occs). We work at year level, no more detailed temporary information used. The research effort bias of area of occupancy (AOO) already corrected at this stage (for details about research effort bias correction, see #46). Working at species level can be not always the case, issue discussed separately (see https://github.com/trias-project/unified-checklist/issues/35).

AOO and occurrences are time series (x: year, y: occurrences or y: AOO). Although we could have data before 1950, we start analysis from 1950, the birth date of invasion ecology (cit. @timadriaens :smiley: ).

Limit cases

No occurrences for a species: no analysis possible. Very unlikely situation, but still possible. however, these species MUST be in the final list for Risk Assessment.
Occurrences only at the last valid year (due to delay in data publishing; see #48): no sufficient data for analysis. However, these species MUST be in the final list for Risk Assessment.
Occurrences only at one of the very last years: no sufficient data for analysis. However, these species MUST be in the final list for Risk Assessment. How many years should we consider, is still not clear. However, it should not be too much far in the past, as the absence of observations can just say that such species is not emerging at all.

Segmented regression

After extracting the limit cases, we set occ and AOO equal to zero for years with no occurrences as only years with occurrences are present in the cube. Segmented regression will be applied to the AOO and occ time series separately. So, for each of the two time series and for each year, the slope of the last segment and its confidence interval is evaluated as a categorical variable. We can have three situations:

Slope is positive: occ/AOO is increasing.
Slope is zero (zero is within slope's confidence interval): occ/AOO is stable.
Slope is negative: occ/AOO is decreasing.

Emerging decision table at year level

For each year and species we can then apply a decision table to define the status of emergency of the species:

AOO	n. occurrences	emerging status
decrease	decrease	not emerging
decrease	stable	not emerging
decrease	increase	potentially emerging
stable	decrease	not emerging
stable	stable	not emerging
stable	increase	potentially emerging
increase	decrease	potentially emerging
increase	stable	potentially emerging
increase	increase	emerging

This will end up in an output like this:

species	year	emerging status
A	1950	potentially emerging
...	...	...
A	2012	not emerging
A	2013	pot. emerging
A	2014	pot. emerging
A	2015	pot. emerging
A	2016	emerging
...	...	...
B	2012	not emerging
B	2013	pot. emerging
B	2014	emerging
B	2015	emerging
B	2016	not emerging

Next steps: how to aggregate this emerging labels in order to estimate the general emerging status of a species? My two cents: as our analysis is future oriented, the emerging status in the recent past should definitely weight more in the finale decision than the status in the far past.

@ToonVanDaele , @timadriaens : please comment if you think I missed something or you have new thoughts about it.

timadriaens commented 5 years ago

Here is the figure we discussed for our memory

20190215_121111

damianooldoni commented 5 years ago

During meeting of July, 8, 2019, we decided/thought the following:

baseline on AOO (occupancy) and number of occurrences based on classis
we go further by usign GAM which extends fairly well the idea of piece wise regression and it is way more flexible for model tuning (see package gratia and inspired by Harrison et al. 2014, JAPE)
we define the following categories:
- emerging
- potentially emerging
- unclear (cases of too much fluctuations or too few data, GAM cannot provide significant results at p-value 0.8)
- not emerging
Appearing or reappearing species. We add a new section based on decision rules for species with very few observations (sequence of 0s and 1s). If 1 is the maximum number of squares/occurrences in the last 5 years, then the taxon is considered (re)appearing and we don't apply any GAM model on it as it would be not meaningful. Still, more work should be done in taking into account very extreme cases.
We should also add a kind of warning/label whether an alien species is found in protected areas.
The envisaged output is a ranking as we are working with categorical variables, i.e. it is impossible to weight along years (linear weighting? exponential weighting? ...).
We will have to choose by (at least) two different ranking rules. See below.

1st ranking strategy

highest category in 2017 (last year)
If same category in 2017: rank by highest category in 2016 (last year - 1)
...
If same category in 2014: rank by highest category in 2013 (last year - 4)

This ends up with a rankings for each indicator, which will be shown as a synoptic table. Here below an example:

species	rank	2013	2014	2015	2016	2017
D	1	emerging	potentially emerging	emerging	emerging	emerging
B	2	emerging	not emerging	emerging	emerging	emerging
H	3	emerging	emerging	**unclear	emerging	emerging
W	4	emerging	emerging	not emerging	emerging	emerging
Z	5	emerging	emerging	emerging	potentially emerging	emerging
Y	6	emerging	emerging	emerging	emerging	unclear
S	7	emerging	emerging	emerging	emerging	not emerging
S	7	emerging	potentially emerging	emerging	emerging	not emerging

This way we account for the most recent year being more important in assessing emerging character of a species. As we work with rankings, merging indicators will have to merge rankings, thus producing a new ranking. A strategy could be to calculate the final ranking based on the sum of the two rankings, where minimum wins.

2nd ranking strategy

In line with previous thoughts, we want to combine both rankings in one general indicator. Again, reason to work with these two indicators is that a species can increase its AOO without a noticeable increase in occurrences and viceversa.

We start from this table:

species	year	AOO	occurrence
A	2017	emerging	emerging
A	2016	emerging	potentially emerging
A	2014	unclear	potentially emerging
A	2015	not emerging	potentially emerging
A	2015	not emerging	potentially emerging
B	2017	emerging	emerging
B	2016	emerging	unclear
B	2015	potentially emerging	unclear
B	2014	not emerging	potentially emerging
B	2013	not emerging	potentially emerging

We define ranking based on the labels in the most recent year (2017), then, if same labels occur, we evaluate the second most recent year (2016), etc.

In this way we get just one (!) ranking and we do not need of manipulate rankings.But, at the same time, we can still provide partial rankings based on each indicator as ancillary information. If I should vote now, I would choose this strategy.

qgroom commented 5 years ago

I think we can only really evaluate these strategies with real data from known species. If we did this with data from the 20th century can we see what we have now?

How does this approach resolve the issue of slowly mobilized data?

damianooldoni commented 5 years ago

@qgroom : sure. Thanks to @ToonVanDaele we started to work on real data. FOr that reason we could think further. We are working on it and we hope to have some results to discuss with you all in September. We limit analysis to 2017 (next year we will take 2018 too) to avoid the drop down. Correcting by dividing occurrences and AOO with baseline on classis level will be taken into account. In this way, we hope to correct research effort bias as well. Next step, at least for me, is to include @ToonVanDaele 's code in trias package.

timadriaens commented 5 years ago

@qgroom some graph output we used for the discussion yesterday is available on the TrIAS folder but needs updating (e.g. including sampling bias correction). Some species i would suspect being "emerging" are really flagged :-)

2882849_Vaccinium corymbosum_DT_3_GAM_3

What do you mean with "how is this to solve the issue of slowly mobilized data"?

timadriaens commented 5 years ago

@damianooldoni @ToonVanDaele re "5. We should also add a kind of warning/label whether an alien species is found in protected areas." This can indeed be done and will be available through another occurrence indicator (see this issue). As we discussed, there are several options:

to run the same procedure on a select of occurrences in the protected areas only
to include it in the model to see whether occurrences in protected areas explain any of the variation
to simply flag if a species occurs (a lot) in protected areas

I don't think we decided how to proceed with that. Occurrence in protected areas is probably very linked to what is outside those areas.

timadriaens commented 5 years ago

Here are two other usual suspect that are indeed flagged as emerging (but based on the last 3 years)

7501634_Rosa multiflora_DT_3_GAM_3 3084015_Phytolacca americana_DT_3_GAM_3

qgroom commented 5 years ago

How does this approach resolve the issue of slowly mobilized data?

I'm forgetting that you are going to correct by class.

ToonVanDaele commented 5 years ago

To avoid confounding effects, the observations by class shoud not contain observations of the potential invasive species. The observations of invasive species are substracted from the class observations in the pre-processing fase.

damianooldoni commented 5 years ago

Code for ranking (2nd strategy deployed) and snapshot based on 19 species (test data from @ToonVanDaele): https://github.com/ToonVanDaele/trias-test/issues/4.

damianooldoni commented 5 years ago

For tracking purposes. As shown during TrIAS meeting in October, this is the general workflow diagram we are working with: workflow_emerging_status_indicator

damianooldoni commented 5 years ago

Emerging status scores assigned by GAM:

Emerging
Potentially emerging
Unclear
Not emerging

In case of score "Unclear" (too low points for applying GAM, 0 within confidence interval of 1st and 2nd derivative), we use a set of decision rules. Based on them, the possible outcomes are:

Emerging
Potentially emerging
Not emeriging
Appearing/reappearing

We don't include taxa with score "appearing/reappearing" for ranking as they are not comparable with other taxa. Discussing bilaterally with @ToonVanDaele and @timadriaens, we thought to put them in another table. We three will discuss more about during our internal meeting on 18 Nov. Results discussed with core team on Monday 25 Nov.

damianooldoni commented 4 years ago

Based on meeting with @timadriaens and @ToonVanDaele today:

Using observations of native species within same class as covariate could not only compensate for research effort bias, but also for publishing delay. If our model can already assess that a species is emerging in 2018 or 2019, we should allow it! The opposite holds not true: if a species seems not emerging in 2018 or 2019, it doesn't mean that it is not emerging, but that its situation cannot be assessed. In this way we increase True Positives without adding False Negatives.
Archaeophyte (occurring before 1500) should be removed from emerging species list. Other way is to filter by taking into account only species introduced since 1950.

Ideas to improve ranking, especially the ranking of the group of the "most" emerging species:

compare with 5 and 10 years baseline trend
detect the doubling time, i.e. the number of years needed to get the number of observations/cells. Shorter the doubling time, steeper the growth.
the 1st derivative confidence interval's minimum (= lowest guaranteed growth in # cells/obs per year). This is meaningful only if this minimum is positive. This holds true for emerging species, as we classify as emerging only species with minimum of 1st and 2nd surely above zero. For communication, we could give the full confidence interval, but for ranking we could use the minimum only. The problem with this method is that not all species can be modelled by GAM due to too low data, while doubling time can be always retrieved in some way.

damianooldoni commented 4 years ago

During meeting I had with @ToonVanDaele, Hans Van Calster suggested to run GAM just once and to take the output of the years of interest (in our case 2016, 2017, 2018) instead of running GAM thrice, first on time series up to 2016, second on time series up to 2017 and finally on the time series up to 2018.

If I understood correctly, the reason is that the GAM outputs are not statistically independent so there is no reason to make model such complex and computationally demanding. @ToonVanDaele agreed and he and I immediately implemented this "easier" approach. Still, @ToonVanDaele, we need your help to formulate this concept better in a near future.

Another update about emerging status assessment: we use the lower value of the confidence interval of the growth (1st derivative) of occupancy in 2018 as a way to introduce a ranking among the species considered emerging. As this value is continuous, ex aequo are solved. @ToonVanDaele : I am very curious to see the output list. Meanwhile, I am getting progress in making your code shining in this repo (not in master, working on branch occurrence-indicators).

damianooldoni commented 4 years ago

Entire workflow for occurrence indicators is now online! :tada: Partial emerging statuses based on GAM and decision rules (link) and final ranking (link) have been added. In #70 more details about changes and the work still to be done.

I thought about an alternative ranking strategy which is less strict (less hierarchical) as it is point-based, although it follows same criteria discussed in this issue. You can find the table of the weights (gain factors) in the pipeline at section 4.2. So, please, @timadriaens: up to you to see if this second strategy is preferable.

We use last full three years (108, 2017, 2016) as evaluation years, so the pipeline detecting (re)appearing taxa ahs been slightly changed and takes into account only occurrences of current year.

@timadriaens , @ToonVanDaele: we should also sit together to discuss the results and improve GAM plots without making them too complex. GAM uses the values of native species within same class as covariate: so the smoother in plots doesn't follow the real occurrence/occupancy as part of the growth/decrease is due to the covariate. So, now you could think the GAM results are bad, but actually are good.

@ToonVanDaele : could you please use the functions apply_gam and em_status_dr and review them? In the meantime I will move them to trias package adding documentation and rigorous unit-testing. At this stage of development are your suggestions more than welcome. I simplified the decision rules/tree too.

decision_tree

I still maintained the original numbering of @ToonVanDaele for letting you to ease the review process. I will rename them if you think they are good.

damianooldoni commented 4 years ago

https://trias-project.github.io/indicators/ranking_emerging_status.html#4_ranking_by_emerging_status

damianooldoni commented 4 years ago

See also two different ranking methods, the hierarchical one (which works like in Olympic games) and the point-based system where each partial evaluation contributes (differently) to final score.

damianooldoni commented 4 years ago

Today, based on a parliamentary question and through @timadriaens, I found we don't provide graphs of observations and occupancy in case GAM cannot be used. This is a pity, as we have these data and we should show them even if we cannot add GAM prediction as additional layer. As I am working on adding apply_gam to trias function, I will solve this at function level.

timadriaens commented 4 years ago

We had already agreed on that (raw data always needed for interpretation) at the previous meeting see #53 but just did not get to it yet (and did not put it on github perhaps). But indeed, the plan is to show at least the following graphs:

(1) raw trend in occupancy
modeled trend (GAM) in occupancy (corrected for bias)
(2) raw trend in number of observations
modeled trend (GAM) in occupancy (corrected for bias)
these graphs we need with all squares and only Natura2000 squares (for this, we decided to put ALL three options on one graph: SAC, SPA and all Natura2000)

We need this for all species, not only the emerging ones.

It would also be good, per species to have a small data frame with the "emergence" indicators so this is available and can be shown per species. This should make clear whether it's based on decision rules or on GAM.

Other ideas for visualization/graphs/output tables are welcome @damianooldoni @ToonVanDaele

@ToonVanDaele has made some more attractive graphs for the emerging species series of TrIAS Aware in Natuur.focus. Perhaps we can base the layout on those graphs and use his code to that end, bearing in mind we are going to put that information on a website for the public I feel they should look smashing.

timadriaens commented 4 years ago

examples pulled together with @damianooldoni for Dama dama

GAM_observations_correct_baseline_5220136_Dama dama_Natura2000

timadriaens commented 4 years ago

and some graphs @ToonVanDaele prepared

rosa_multi1

nf_sp

damianooldoni commented 4 years ago

@timadriaens , @ToonVanDaele: interesting. As already said we need to sit together one half day and produce graphs so we can find the right output and the best smashing style :+1: At the same time we have to solve also this visulaization issue: https://github.com/ToonVanDaele/trias-test/issues/10. I check your agenda online and make a proposal.

damianooldoni commented 4 years ago

Meanwhile, I will correct baseline data. Up to now we used number of native species instead of using ALL data at class level minus the obs of taxon under examination.

damianooldoni commented 4 years ago

This long issue has been tackled and can be closed. :+1:

trias-project / indicators