what datasets? - Githubissues

lizzieinvancouver commented 1 year ago

Please share ideas on datasets. I was going to suggest lynx/hare as a classic but I think folks will think more mechanistically on that one (or creatively at least), and I think we want one where people will gravitate to some classic statistical model and then we push them to think more generatively ....

I can offer messy observational time-series of phenology data (I tossed in some examples in data/synchrony) or phenology from full factorial experiments, but I don't think these are great either.

lizzieinvancouver commented 1 year ago

@aammd suggested:

Perhaps a dataset on multi-species tree diameter (DBH) over time in permanent plots? This is a kind of bread-and-butter dataset for a lot of ecologists, and is full of realistic devils in the details: measurement error, time series, missing data, hierarchical structure in space, etc

hneyster commented 1 year ago

I agree @lizzieinvancouver -- nice to start with a dataset that doesn't have a well-known mechanistic/generative model behind it.

I like @aammd's suggestion of a DBH dataset. Here's the Harvard Forest tree data: https://harvardforest1.fas.harvard.edu/exist/apps/datasets/showData.html?id=hf264

A lot of my work is on birds. Probably the msot widely used bird dataset is from the Swiss Breeding Bird Survey, (here: https://www.vogelwarte.ch/en/projects/monitoring/monitoring-common-breeding-birds) the 2014 data is easily accessible in this R package: https://search.r-project.org/CRAN/refmans/AHMbook/html/data_MHB2014.html

Another neat dataset is this one on Golden-winged Warblers, showing that males and females use different habitats. https://osf.io/txy7w/ Could imagine a fun exercize starting with the full dataset and then noticing that the data are not all generated by the same process, and then adding parameters to reflect this.

A lot of my recent work been with bird data on transects or point counts, with habitat covariates measured at multiple scales. These data are fun because they bring up lots of interesting spatial scale/scale of effect challenges, as well as lots of collinear predictors. Happy to share these if of interest.

aammd commented 1 year ago

HI everyone! I always love and dread this stage of any course planning -- choosing a dataset. I have a suspicion that what makes something good for teaching is generality, but what makes ecological data interesting is all the specific details. maybe its impossible to do both -- or maybe we do one of each. That is, we might have some topics which we illustrate with a large, bland ecological dataset, and then dive into one specific dataset to show some specific things which are interesting, connecting them back to the bigger idea

aammd commented 1 year ago

At any rate, here is some brainstorming

palmer penguins small, manageable, not THAT much to see, realistic
vegan::mite We used the mite data (species abundance data on different sites with space information and environmental predictors) and the penguin data in a course recently, and it was OK.
look at all this free data on animal traits ! again, theres not that many variables to connect together but its huge and fascinating
a friend recommended the Forest Inventory analysis dataset. I like the Harvard forest dataset suggested by @hneyster, however it only covers 3 consecutive years which might not be a lot for thinking about growth.

aammd commented 1 year ago

i asked about this on Twitter and got this absolute GOLDMINE of a list:

https://twitter.com/SMWadgymar/status/1699960393446076489

my favourites are the phenology dataset https://esajournals.onlinelibrary.wiley.com/doi/10.1002/ecy.3705

and the thermal tolerances dataset https://www.nature.com/articles/sdata201822

lizzieinvancouver commented 1 year ago

Thank you for this @aammd and @hneyster ! I blocked off this morning to work through potential datasets and am -- as forever -- behind. So I will just plop down some notes ....

I asked about how to download FIA data ... info below, but I think doing ONE plot such as the Harvard Forest data linked above (or I am waiting to hear on the Mount Rainier data where we might have some local knowledge of how the data were collected, for the data generating process).

I've used this site recently to download FIA data (https://apps.fs.usda.gov/fia/datamart/datamart.html). You can download data for individual states or the database for the entire US, but be forewarned the entire database is ~6 gb zipped. As you can imagine the entire database is kind of unwieldy and probably includes a lot of information you wouldn't need. In the work I've done previously, I've only ever needed the "PLOT" and "TREE" files - the former has plot location data (lat, long, elevation, etc.) and the latter has individual tree records within those plots.

Depending on what you need from the data, you might look at the rFIA package (https://rfia.netlify.app/), which has some useful functions to calculate some derived metrics (e.g., recruitment, growth, mortality, carbon stocks, etc.). The fiesta package also has ways of extracting FIA data, but I find it much less user friendly (https://cran.r-project.org/web/packages/FIESTA/index.html).

If you need something straightforward (e.g., tree richness/diversity in plots, species lists for plots, individual species occurrences, etc.), I probably already have the raw data downloaded, and could provide it to you, just to save you the time and effort. Just let me know.

Regarding other public datasets....NEON comes to mind, and would have similar types of information as FIA, and might be more manageable. I haven't used it much in the past, but I know there are R packages to access it (https://cran.r-project.org/web/packages/neonUtilities/index.html).

lizzieinvancouver commented 1 year ago

... and thoughts in no real order:

Phenology of course would be easy for me, but I love the idea of tree growth stuff as I am trying to get into that some and does seem very bread and butter and growth -- growth seems very basic like something we would all be able to think of modeling.
We should yay or nay a meta-analysis as datasource (like the thermal tolerances). Feelings?
@hneyster @aammd Any bird or animal datasets that would include growth perhaps? Then we could have plant and animal growth....

With the penguin dataset, I just have to toss this in.

betanalpha commented 1 year ago

... and thoughts in no real order:

Phenology of course would be easy for me, but I love the idea of tree growth stuff as I am trying to get into that some and does seem very bread and butter and growth -- growth seems very basic like something we would all be able to think of modeling. We should yay or nay a meta-analysis as datasource (like the thermal tolerances). Feelings? @hneyster https://github.com/hneyster @aammd https://github.com/aammd Any bird or animal datasets that would include growth perhaps? Then we could have plant and animal growth.... Personally I would nay any meta-analysis data — the corresponding data generating processes are a bit awkward and typically less intuitive than direct measurements of some ecological process.

Phenology can be tricky depending on how complex we might the modeling to be (survival modeling may be a bit much for an introductory workshop) but tree or animal growth could be interesting. I think the most important factor is whether or not anyone is familiar with the actual provenance of the data so that we can understand any substantial artifacts.

From the statistical perspective we’re looking for a data set along with an understanding of 1) What is the structure of the data? 2) What process is the measurement targeting? 3) What other undesired processes influence the data? 4) How was the measurement supposed to be implemented? 5) How was the measurement actually implemented?

With the penguin dataset, I just have to toss this https://www.science.org/content/article/flipper-bands-harm-penguins in.

Measurement effects like these would be ideal to really emphasize that the data generating process models the entire measurement and not just the underlying ecological process. Capture-recapture is great for this as simple selection effects are often straightforward to incorporate into the underlying model.

lizzieinvancouver commented 1 year ago

@betanalpha I say we do animal or plant growth! But with the caveat that we might not get all the provenance info ... more and more ecological datasets are just posted online, and some of the provenance is in there, but not as much when someone who collected is sitting there (but this is a reality of how the science is changing I think we should accept). I love the penguin example, but selfishly working on the tree growth data would be way better for me. I will follow up with a colleague about one dataset for that ... and maybe we can also track down the penguin data as an alternative example?

betanalpha commented 1 year ago

Sure let’s go with tree growth data.

The ideal data would be as raw as possible to give us an opportunity to find and then adapt the model to accommodate for not uncommon artifacts. As much provenance as possible to understand those artifacts is ideal but we’ll work with whatever we can get.

On Sep 18, 2023, at 7:07 PM, Elizabeth M Wolkovich @.***> wrote:

@betanalpha https://github.com/betanalpha I say we do animal or plant growth! But with the caveat that we might not get all the provenance info ... more and more ecological datasets are just posted online, and some of the provenance is in there, but not as much when someone who collected is sitting there (but this is a reality of how the science is changing I think we should accept). I love the penguin example, but selfishly working on the tree growth data would be way better for me. I will follow up with a colleague about one dataset for that ... and maybe we can also track down the penguin data as an alternative example?

— Reply to this email directly, view it on GitHub https://github.com/lizzieinvancouver/bayesian2024ubc/issues/1#issuecomment-1724581818, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALU3FW4236I6PNDLBKXU2TX3DHZPANCNFSM6AAAAAA3THVYSQ. You are receiving this because you were mentioned.

lizzieinvancouver commented 1 year ago

@betanalpha Just a heads-up that I am working on this. I tracked down the data to here: https://pnwpsp.forestry.oregonstate.edu/data

My goal is to start us with ONE site of data across years -- we'll have repeat measurements of the same tree's diameter over years, with many trees measured in a stand (and multiple species). And I will try to get some climate data too. We can add other stands if we want. They also have summary data or mortality, but I think the simple growth question will get us closest to the RAW data, easily be SUPER complex and also will feel like something lots of folks can connect to. I'll be hoovering up 'metadata' info so we better understand how the measurements were really taken also. Comments on this approach/plan welcome!

I need some help on what best to pull ... I am trying to schedule a meeting with someone ASAP and then will report back.

lizzieinvancouver commented 1 year ago

Still working on this, some notes from chatting with Ailene today:

BAI is slightly less biased to size of tree ….

AE10 is high elevation site — climate data from Paradise (fairly close) and is fairly species poor

Lower in elevation (more species) is Longmire climate data (Ranger station) but climate data may not be as good …. Try AV06

Ailene has core data from some of these trees.

lizzieinvancouver commented 1 year ago

Okay! I suggest we use the data I just posted -- regularly censused tree data (mostly tree diameter) at Mount Rainier. Here's the acknowledgement to remember:

"Data [and/or facilities] were provided by the H.J. Andrews Experimental Forest and Long Term Ecological Research (LTER) program, administered cooperatively by Oregon State University, the USDA Forest Service Pacific Northwest Research Station, and the Willamette National Forest. This material is based upon work supported by the National Science Foundation under the grant LTER8 DEB-2025755."

Please include a data citation (including DOI) in your references. A data citation is provided on web pages for each of our datasets

And the data are uploaded here.

lizzieinvancouver commented 1 year ago

One more note for myself, from Janneke:

Alana worked with the PSP data at Mt. Rainier (published a paper on it too), so if you have any questions about the data she might also be an easier person to email. I'd also be curious to see what you all come up with, it's an amazing data set and there is a lot left to do with it...

lizzieinvancouver / PSPmountrainier

what datasets? #1