For candidate datasets, determine appropriate steps for cleaning and preparing data. We don't want too many shortcuts here (i.e. blindly aggregating with no justification for this), it would be better to think through the data generating process for each dataset

nicholasjclark / phylo_func_trends

Using phylogenetic and functional relationships to inform nonlinear trend estimates from long-term biodiversity data

6 stars 1 forks source link

For candidate datasets, determine appropriate steps for cleaning and preparing data. We don't want too many shortcuts here (i.e. blindly aggregating with no justification for this), it would be better to think through the data generating process for each dataset #5

Open nicholasjclark opened 3 months ago

AdamCSmithCWS commented 3 months ago

Hi Folks. What do we want to consider a zero-count in the BBS (and any other dataset we use)? In the BBS dataset, only the positive counts for a given species and survey-event are stored, and so we need to decide what defines a valid zero. Of course we could do a full join of every survey-event with the entire list of positive counts, but this would generate a dataset with 91 million rows (745 species * 123 000 surveys). For most trend-analyses of the BBS, zero-values are added for a given route (survey-location) and species, if at some point in the history of the BBS, that species has been observed on that route. In effect, BBS routes at which a species has never been observed are treated as if they are outside of a species' range and so for that species and route, there are no data (no zeroes and no positive counts). This approach makes for a much more manageable dataset, but also means that the list of species with data varies among routes. How will this affect the way the models work? Can it easily handle varying species' lists, or should we stratify, so that we can fix the species list for some intermediate spatial extent (i.e., consistent list within a given BCR, so that we add the zeros for all species that occur on any route in that BCR).

nicholasjclark commented 3 months ago

Hi @AdamCSmithCWS, this is a good question that deserves some thought.

On the one hand I'd prefer to add zeros for all missing combinations as we would expect that species to be detectable in another region if it occurred, so the fact that it wasn't recorded by survey teams should count as a "true" zero. I haven't tested this explicitly, but my gut instinct is that these zeros are very useful for estimating variation in species' spatial fields.

But on the other hand the additional complexity is enormous, as you have recognised. In the latest model I've tried, where I ended up with 83 species over the 37 BCRs, I have just over 89,000 observations. This model takes ~50 minutes to complete on my brand new i9 processor. So it isn't difficult to imagine that we'll run into situations where they can't be done on personal machines.

Perhaps it would be better to use some sort of threshold for which species to include? But I'm happy to try different things

AdamCSmithCWS commented 3 months ago

Yes, I think you're right: the zeros (and yes, I agree, they are true zeros given the BBS field methods), would help estimate the species' spatial fields. Would they help, or hinder, the estimation of the temporal smooths? If a species is missing from a region, I have a feeling that species' time-series of zeros (i.e., a constant population with trend = 0) will have an influence on the estimation of the trend for closely related species and other species in the region. For example, datasets for regions with relatively few species (northern regions with lower species richness) will have many rows of data representing species with stable populations (stable at 0)... Would it make sense to stratify the data-preparation, such that the species lists would be fixed within a given spatial stratum (e.g., a BCR) and the zeros would be filled on all routes and all years for that stratum-specific list of species. In each stratum, we could constrain the species list to those species that occur with some minimum threshold (e.g., a minimum of 3 BBS routes, and across 5 years).

If helpful, I can draft up some data-preparation scripts to support these alternatives.

AdamCSmithCWS commented 3 months ago

There are two potentially important components of variation in the BBS data that we may want to consider including:

Variation in counts among routes
Variation in counts among observers.

Both of these are known to affect BBS counts. The observer-variation is clearly nuisance variation, while the route-variation could be considered nuisance (sampling noise reflecting random variation among relatively few routes within a bird conservation regions) or biological (variation in abundance within relatively large strata that captures variation in landcover, elevation, and latitude). Both sources of variation will also vary in time with turn-over in the observer pool and variation in which routes get surveyed in a given year. either would require significant computational resources to incorporate, given the large number of routes and/or observers, but either could also be modeled as simple zero-mean random effects. Their addition would also significantly increase the size of the dataset, so that there would be one row for each species, region, year, and route (potentially increasing the dataset by a factor of ~5000, given there are about 5000 routes).

So, if route and observer variation need to be ignored for practical reasons (in favour of aggregated counts across all routes and observers), I think we could argue it's important to demonstrate the phylogenetic and trait-based patterns. But I think it will limit the immediate applicability of these models, at least for many of the species status assessment uses of the BBS.

nicholasjclark commented 3 months ago

Thanks @AdamCSmithCWS. Yes I agree completely and in fact my early attempts were using route-level counts so I could incorporate both of these sources. But as you alluded to, the size of the dataset and complexity of the model became outright impossible for me to handle. At present my process has been to aggregate at the polygon level but to use the number of routes that provided information for that polygon in that year to form an offset. This of course treats all routes equally and ignores observer effects entirely, which is a shame. We can certainly argue our way out of it, though we might not want to.

But one option that deserves some thought is to chunk the data into well-defined spatial units and fit separate models within each unit, allowing us to use models that do go into point-level detail. This would allow us to use geostatistical spatial models and might make more sense anyway given that these are the scales where we would expect phylogenetic (and especially) functional relationships to be particularly important. The question is whether we would lose some important information, i.e. if species 1 has a broad range and is declining in areas outside of our defined unit, we'd lose that context when estimating that species' trend within the unit. But maybe that doesn't matter if the units are big enough and we have enough data? There are always endless compromises!

drhammed commented 3 months ago

Interesting discussions here! In the full CSV file of BBS that contains the 50-stop count, they included (what I also agree can be referred to as) "true zeros." One thing I have done in the past was to include the route variation (at least), as I believe that would have ecological implications. So, to reduce the size of this dataset (hopefully), I have some potential suggestions too (in addition to the stratification plans- which is very good, btw).

I believe when working with BBS data, a single species count for an entire route/year should be sufficient. In that case, we could summarize (add up) all the 50 stop counts within a particular route/state/country (and create an ID column for that).
The complete 50-stop counts for BBS started in 1997 (which means data prior to 1997 will have incomplete stops)! As such, we can probably use data from 1997 till date. One concern here is that I have no expectation/evidence on why/how that should bias our results. But we could probably (in addition to reducing data size) base the justification on incomplete point sampling and for us to have equal coverage!
Another thing is that since BBS data are sampled during daytime (if I'm not wrong), we could exclude species that were not well sampled in the existing data. For example, nocturnal and aquatic species. With that, waterbirds, shorebirds, owls, etc can all be excluded.
Lastly (which, of course, should be so), we can exclude surveys that do not pass the BBS data quality standards!

If we do all that, we will have each row for each species, route ID, year and count, which should be < 10 million rows!

On a last note (regarding computing facility), I wonder how fast/effective our analysis would be if we decide to run in the cloud! We could set up an instance in Amazon EC2 using their Tensor core GPU (e.g., Amazon EC2 P3), as I believe that would be much faster or even on OCI. Of course, we'll still use R and only have to transfer everything to the Cloud for efficiency.

guifandos commented 3 months ago

It's probably not the first option for developing the models, but it might be something to consider for later stages. Another potential dataset, but private, is the Swiss breeding bird survey. The survey data is collected within the MHB (the Swiss common breeding bird survey), which covers 267 sampling sites (1 km2 cell) every year. See here an example where we developed dynamic occupancy models to evaluate range dynamics

nicholasjclark commented 3 months ago

Thanks @guifandos. Yes I've seen a lot of the work that Marc Kéry has done with the Swiss breeding bird survey. It certainly seems like a suitable dataset to explore. Would you know what procedures we'd need to go through to gain access to the data?

GitTFJ commented 2 months ago

I wonder if the zeros dilemma links back to this issue focussed on the theoretical predictions. I can't think of any real examples of this, but I could imagine a given species may only exist in a location if another competing species is absent. For instance, in the UK, the only place we see native Red squirrels are in locations with very very low densities of grey squirrels. So the zeros could be very informative across space/phylo. 91 million observations sound nasty though...