rsbivand / nem24_talk

Nordic Econometric Meeting 2024 talk 3 June
Other
0 stars 0 forks source link

Comments on draft up-coming talk slides welcome! #1

Open rsbivand opened 5 months ago

rsbivand commented 5 months ago

@mikemahoney218 Could I ask whether you could give me a view on whether the Ames data set examples make any sense in this draft talk (I'm presenting this on Monday afternoon my time, so a bit late to elicit input now)? I've tried to be fair wrt waywiser and spatialsample, but as far as I can see, although spatialsample mitigates information flows between data subsets, I don't think we know how to improve the specification of spatial models, or predict for test/validation sets.

Are there other arxiv papers I've missed that you know of? The rendered version of the beamer slides is at: https://rsbivand.github.io/nem24_talk/. Any comments very welcome!

Might you the 4-way Moran split be useful in waywiser in its current restricted form (row-standardised weights)? Using a permutation test is a bit trivial, but at least it's something.

mikemahoney218 commented 5 months ago

On a first read-through, I think this is a very fair example! I think the quote from the tidymodels book is trying to distinguish the Ames data from data where observations are remeasured, because rsample provides group_* resampling methods for those situations. That said, while I don't think they were trying to deny any spatial autocorrelation in measured values, a straight reading definitely gives that impression.

The only notes I'd have regarding spatialsample is that I tend to have a strong predictive modeling focus, where I'm much more interested in predictive accuracy than in inference or model coefficients. This is why I tend to focus on machine learning approaches, and why our paper focuses on RMSE estimates (though Mila et al 2022 does a much better job of this). In that mindset, spatial CV is mostly useful to assess models that will need to extrapolate outside of the area data was collected (or more abstractly, will need to predict on new data that wasn't generated by the same spatially autocorrelated process). The argument then is that spatial CV will give a better estimate of model performance than a single hold-out set could, as it's impossible to know if performance on a hold-out set is actually representative of general performance if we're assuming that our model is misspecified to the extent that we have spatially autocorrelated residuals. I think the ideal is that you then follow up with a map accuracy assessment, using an independent probability sample to assess mapped values, so hopefully your spatial CV estimates are useful guidance through the model development process, before your final assessment stage.

I've actually found waywiser -- and specifically the autocorrelation metrics we wrap from spdep -- as more useful for improving model specification. A big part of our model development workflow is mapping local spatial autocorrelation estimates, and using that visual guide to help identify areas that are "weird" for one reason or another. We then use our knowledge of the study area to try and guess what variables might explain the difference; as ecologists, we're often able to look at the local autocorrelation map and say "that looks like elevation" or "that looks like rainfall" (or, once, "that looks like where the National Forest comes over the border, I bet those trees are larger than the surrounding area"). It's not a useful approach if you're interested in inference, but for predictive modeling this can be a very useful workflow!

I'm not aware of any preprints in this space that would be useful; I honestly haven't kept up with the space recently while I've been focused on other projects.

The 4-way split is definitely interesting! I'm wondering how it might help in sizing blocks for spatialsample -- a common complaint about our approach to spatial CV is that there's not clear guidance on how big blocks should be. I wonder if there'd be a way to dynamically determine block size, so that you use the largest blocks that minimize trte?

Small note on slide "spatial diff-in-diff" -- small typo in bullet 2: "alaysis"

Thanks for looping me in -- is the talk being recorded? Would love to watch later if so :smile:

rsbivand commented 5 months ago

@mikemahoney218 Thanks for such a rapid and helpful response! I'll ask whether recording is possible, I don't think streaming is planned. I'll try to get back to you as the meeting starts on Sunday.

rsbivand commented 5 months ago

@mikemahoney218 I'm told that recording may be possible, I'll try tomorrow and see.

A similar case to your "identifiable spatial patterns" cases might be from geographically weighted regression on p.440 in https://doi.org/10.1111/1467-9884.00145 - the urban-rural variable's geographically-varying coefficient signalling a missing variable representing rural areas with a history of coal-mining.

I think that the need for row-standardisation isn't absolute, if all the components are multiplied by (n/S__0). Potentially, I_i could be partitioned too.

More after I talk tomorrow; I'll let you know if the recording succeeds. Is an audio recording as a fallback any use?

rsbivand commented 5 months ago

This is the link to the recording: https://nhh.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=d26410ee-6243-48ce-96dd-b18400beb764

topepo commented 5 months ago

I think the quote from the tidymodels book is trying to distinguish the Ames data from data where observations are remeasured, because rsample provides group_* resampling methods for those situations. That said, while I don't think they were trying to deny any spatial autocorrelation in measured values, a straight reading definitely gives that impression.

That's correct. We are trying to convey that we know of no non-spatial correlations. Spatial stats are not my forte, so I acquiesce to you folks on those matters.

I also give a warning in the new book where we use dissimilarity sampling to create a data set.

rsbivand commented 5 months ago

@topepo thanks for your understanding - in the recording I try to set the record straight(er). In taking this forward, using splits by design or natural splits to avoid spillover from training to sets intended to be independent, spatialsample and similar seem to work adequately. The other questions - about whether a spatially explicit model fitting engine might be justified to achieve a "better" fit on training data and about using diagnostics to point towards missing covariates in a spatially aware feature engineering - are very likely to attract attention going forward, I hope.