Estimated "true" repertoire as a predictor

tbendixen commented 1 year ago

Thanks for yet another instructive case study, Richard.

I was thinking whether there's a way to extend the example such that the estimated "true" repertoire size is used as a predictor in another model.

For instance, we might imagine a dataset composed of species (instead of individuals). Say we're interested in repertoire size and brain size at the level of species. However, the observed behavioral repertoire of any given species is always an imperfect measure; ideally, we'd want to estimate the "true" repertoire size (perhaps as a function of other variables, e.g. observed repertoire size, research effort, phylogeny, habitat, etc.) and plug that estimate (and uncertainty) in our outcome model predicting brain size.

It seems conceptually linked to the measurement error and missing data models that are so neatly explained in Statistical Rethinking, although repertoire size is an integer and therefore would require a different computational approach from Gaussian variables.

Best wishes

rmcelreath commented 1 year ago

Yeah it could be done. The trick would be to marginalize over the unknown repertoire size in the likelihood of the second model. This is like how populations size models work.

One thing I worry about in generalizing to many species is the open-endedness of repertoire size. Would need to think carefully about a good prior family there, something like Pitman–Yor process.

simeonqs commented 1 year ago

I have been working on something similar (though much simpler at the moment). There is a large literature on innovation frequency, which is the number of innovations observed in a taxon/species. The generative process is slightly different, but the modelling is probably very similar. I think there is a structural equation model solution, where the 'true' repertoire size in the first model is the predictor or response in the second model. The focus in my simulation so far is more on controlling for research effort though: https://github.com/simeonqs/research_effort. If anybody is interested, I really need to start documenting it. But I would also be very happy if Richard made my paper completely unnecessary!

tbendixen commented 1 year ago

@rmcelreath Thanks! I see, good point on the open-endedness. I wonder whether one work-around is to assume a maximum number of possible behaviors in the repertoire of a taxon (say, a literature review returned n identified hunting/foraging behaviors) and treat that as the number of "trials" in a binomial model (e.g., a species might have k out of n identified behaviors). Incidentally, that'd link to my question here on incorporating phylogeny in a binomial model: https://github.com/rmcelreath/stat_rethinking_2023/issues/7

@simeonqs Thanks for the link to your repo, I'll check it out!

rmcelreath commented 1 year ago

Issue that concerns me with prior on behavior repertoire is that typically the distribution is highly imbalanced---lots of rare behavior types. So those are hard to count. And then once you start worrying about sampling effort, the upper bound on the maximum repertoire needs some prior that knows about the imbalanced distribution. Does that make sense?

Very similar problem in estimating number of species in a community. So e.g.: https://doi.org/10.1034/j.1600-0706.2000.890320.x

tbendixen commented 1 year ago

Yeah, it does make sense -- even if it introduces new fresh problems! For instance, something like Simpson's diversity index seems apt, but ideally we'd probably want to adjust for something like habitat, body size, etc. too, since some species are just easier to observe than others.

Anyway, thanks so much for taking the time, Richard!

rmcelreath / cg_vocal_repertoires

Estimated "true" repertoire as a predictor #1