grambank / grambank-analysed

3 stars 0 forks source link

Spatial Neighbours #65

Closed SamPassmore closed 2 years ago

SamPassmore commented 2 years ago

In our last meeting @QuentinAtkinson @blasid and @SimonGreenhill suggested we should look at how many neighbors each languages would have based on a distance of approximately 1000km radius. I will display these results below, alongside the number of neighbors with a non-zero covariance. The results are histograms showing the number of languages that have N neighbors. Note that this is a count of neighbors from the sample of languages used in the analyses. Also note that the covariance matrix doesn't hit approximately zero at exactly 1000 kilometers, hence the slight difference in the summary statistics.

Number of neighbors within 1000km: Screen Shot 2022-07-11 at 9 18 21 am

Summary statistics:

> summary(counts$count)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00   21.50   44.00   60.24   94.00  159.00 

From the earlier conversation, it was the lower end of the extremes that people wanted to investigate, so here is the count of languages who have 1 to 10 neighbors.

 0   1   2   3   4   5   6   7   8   9  10
14   7  10  12  25  16  20  16  20  18  12

Number of neighbors with more than 0.01 Covariance: Screen Shot 2022-07-11 at 9 19 26 am

> summary(counts$count)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00   18.50   40.00   55.17   87.00  150.00 

Number of languages with 1 - 10 Neighbors based on covariance:

  1   2   3   4   5   6   7   8   9  10
 24  22  13  15  32  18  37  22  18  15

Script to reproduce these results which can be run from the R_Grambank directory of this repository

## Load necessary information
source("requirements.R")
source("spatiophylogenetic_modelling/analysis/INLA_parameters.R")
source("spatiophylogenetic_modelling/analysis/functions/varcov_spatial.R")

locations_df = read.delim('output/non_GB_datasets/glottolog-cldf_wide_df.tsv', sep = "\t") %>%
  inner_join(lgs_in_analysis, by = "Language_ID") #subset to matches in tree and in cropped in GB.

### Covariance counts
# Make Spatial covariance matrix
spatial_covar_mat = varcov.spatial(locations_df[,c("Longitude", "Latitude")],
                                   cov.pars = sigma,
                                   kappa = kappa)$varcov

# How many languages have a covariance value greater than tol?
tol = 0.01

long_lower = cbind(which(!is.na(spatial_covar_mat),arr.ind = TRUE), na.omit(as.vector(spatial_covar_mat)))
colnames(long_lower) = c("row", "col", "covariance")
long_lower = data.frame(long_lower)

# how many pairs of languages have covariances greater than tol
counts = long_lower %>% 
  group_by(col) %>% 
  summarise(count = sum(covariance > tol))

hist(counts$count, breaks = 20)
table(counts$count)
summary(counts$count)

## Distance counts
# Make spatial Haversine distance matrix (in km)
spatial_dist_mat = distm(locations_df[,c("Longitude", "Latitude")], fun = distHaversine) / 1000

# How many languages are within a radius of dist_tol?
dist_tol = 500 # 1000 km

dist_lower = cbind(which(!is.na(spatial_dist_mat),arr.ind = TRUE), na.omit(as.vector(spatial_dist_mat)))
colnames(dist_lower) = c("row", "col", "distance")
dist_lower = data.frame(dist_lower)

counts = dist_lower %>% 
  filter(col != row) %>% 
  group_by(col) %>% 
  summarise(count = sum(distance < dist_tol))

hist(counts$count, breaks = 20)
table(counts$count)
summary(counts$count)
SamPassmore commented 2 years ago

also pinging @HedvigS and @RustyGray

QuentinAtkinson commented 2 years ago

Well, it confirms what we suspected - it varies a lot. That's not a problem in itself, of course. I think this was suggested as a first step to evaluating whether really we ought to be using x nearest neighbours as the spatial covariance predictor. I think based on this, that is still an option. What should x be? I don't know. 10? 20? 50?

SimonGreenhill commented 2 years ago

Thanks Sam -- I was worried that we would be averaging n=1 or something embarrassing. I think I prefer the k approach rather than n neighbors as more elegant.

RustyGray commented 2 years ago

I agree. R.

On 11. Jul 2022, at 10:34, Simon J Greenhill @.***> wrote:

Thanks Sam -- I was worried that we would be averaging n=1 or something embarrassing. I think I prefer the k approach rather than n neighbors as more elegant.

— Reply to this email directly, view it on GitHub https://github.com/grambank/grambank-analysed/issues/65#issuecomment-1180117532, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEETOPBYTSCI7WURPZNMLLTVTPL77ANCNFSM53FYH4GA. You are receiving this because you were mentioned.

QuentinAtkinson commented 2 years ago

What is the k approach? Is that what we've been doing?

SimonGreenhill commented 2 years ago

yep. Sorry, should have been clearer