r-spatial / gstat

Spatial and spatio-temporal geostatistical modelling, prediction and simulation
http://r-spatial.github.io/gstat/
GNU General Public License v2.0
195 stars 49 forks source link

krige() NA predictions increasing with larger nmax #50

Closed felipefmoreira closed 4 years ago

felipefmoreira commented 5 years ago

Hello, I've been playing around with a King County house dataset posted on Kaggle (link: https://www.kaggle.com/harlfoxem/housesalesprediction) and tried to predict house prices with Kriging. The entire coding is shown below:

library(gstat)
library(sp)
library(ggplot2)
setwd("~/Documentos/R/Kriging")
dados = read.csv('kc_house_data.csv')

dados$price_sqft = dados$price / dados$sqft_living

ggplot(dados, aes(x = long, y = lat, size = price_sqft)) + 
  geom_point(alpha = 0.2) + scale_size_continuous(range = c(0.5, 5))

set.seed(1337)

treino_ind <- sample(seq_len(nrow(dados)), size = 0.80*nrow(dados))
dados_treino = dados[treino_ind,]
dados_teste = dados[-treino_ind,]

coordinates(dados_treino)  = ~ long + lat
coordinates(dados_teste) = ~ long + lat

vrg = variogram(log(price_sqft)~1, (dados_treino))

vrg

vrg.fit = fit.variogram(vrg, model = vgm(1, "Sph", 1, 1))
vrg.fit

plot(vrg, vrg.fit)

vrg.krige = krige(log(price_sqft)~1, dados_treino, dados_teste, 
      model = vrg.fit, nmax = 1)

At first, I ran the code without the nmax = 1 bit, but saw some posts where the inclusion of some restriction on the amount of calculations was advised and finally added it. The thing with this is that if I increase the nmax parameter, kringe() function returns many NA. For instance, when nmax = 3, there was 216 NAs. With nmax = 10, there was 766 NAs. With nmax = 50, there was 2568 NAs (to give a little perspective, the entire test set is composed by about 4300 obs).

My question is: does that actually make sense or is it a bug?

crabel99 commented 4 years ago

To answer your question, it makes sense. Here is an example to illustrate why, (the first solution is likely the one you are looking for):

I've been working with building a large area model of stratigraphic structures using several thousand cores covering 2,500 sq kilometer area. There were three issues that I came across that needed to be resolved.

  1. I had to first eliminate redundant datapoints to prevent singularities which caused the NA's:

    library(sp)
    # Remove duplicates
    zd <- zerodist(data.elev)
    data.elev <- data.elev[-zd[,2], ]
    rm(zd)
  2. I had to subset the data into ones where I had data for each of the the three structures I am modeling. When building the semi-variograms for co-kriging, gstat cannot handle missing data. I thought that it had pseudo variogram capability, but it unfortunately does not. This was good because it divided my data into two roughly equal parts, one for the model and one for testing. As a result, I was able to validate the models and select the one with the lowest BIC. The error that I got from this was gstat just aborting and kicking back an error.

  3. I had to implement local kriging. To accurately implement local kriging, I needed a metric to assess which datapoints were statistically significant. I did this by saying that all points within 95% of the semi-variance were statistically significant. This allowed my to back out a "local" radius using a function of the variorum model's range, e.g. for a spherical model like mine, this ended up being 19/30 of the range. When implemented, it looks something like this for universal co-kriging. This problem didn't cause an error per say, just by doing non-local kriging, I unnecessarily increased computation times. I'm modeling my structures at 1-arc second resolution, so for my area, I have north of 36 million points at which to estimate...

    
    library(gstat)
    # Estimate the best fit variogram model
    v.ml <- variogram( f.ml, data.elev )
    m.ml <- fit.variogram( v.ml, vgm("Sph") )
    plot(v.ml, m.ml)

Build the gstat structures for kriging

g <- gstat( NULL, "ml", f.ml, data.elev, nmin = 15, maxdist = m.ml$range[2] 19/30, force = TRUE ) g <- gstat( g, "nc", f.nc, data.elev, nmin = 15, maxdist = m.ml$range[2] 19/30, force = TRUE ) g <- gstat( g, "bc", f.bc, data.elev, nmin = 15, maxdist = m.ml$range[2] * 19/30, force = TRUE ) v <- variogram(g) g <- gstat(g, id = "vgm", model = m.ml, fill.all=T) g <- fit.lmc(v, g, fit.method=6, correct.diagonal=1.01) plot(variogram(g), model=g$model)

Krige

k.c <- predict( g, point.est )

edzer commented 4 years ago

These are probably cases where you have more than one observation in a single location; when they are there in the first place, this happens more often in prediction when the neighbourhood size is increase. In lack of a reproducible example, I'm closing here.

MichaelGAttia commented 2 years ago

I have the same problem with Gstat R and I checked if there is any duplication on my data but I didn't find any, however, I still got that worrying

".......: In predict.gstat(g, newdata = newdata, block = block, ... : Covariance matrix singular at location [427455,1.07679e+006,0]: skipping..."

and NA predictions for a lot of points. So, I wonder if there are any other solutions to get all the values.