Closed ailich closed 12 months ago
Here's an example of how it could be used
library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.62
rasterPCA <- function(x, nSamples = NULL, nComp = nlyr(x), spca = FALSE, maskCheck = TRUE, ...){
if(nlyr(x) <= 1) stop("Need at least two layers to calculate PCA.")
ellip <- list(...)
if(nComp > nlyr(x)) nComp <- nlyr(x)
if(!is.null(nSamples)){
trainData <- spatSample(x, size = nSamples, method="random", replace=FALSE, na.rm = TRUE)
if(nrow(trainData) < nlyr(x)) stop("nSamples too small or x contains a layer with NAs only")
model <- princomp(trainData, scores = FALSE, cor = spca)
} else {
if(maskCheck) {
totalMask <- !sum(app(x, is.na))
if(global(totalMask, sum)$sum == 0) stop("x contains either a layer with NAs only or no single pixel with valid values across all layers")
x <- mask(x, totalMask , maskvalue = 0) ## NA areas must be masked from all layers, otherwise the covariance matrix is not non-negative definite
}
covMat <- layerCor(x, fun = "cov", na.rm = TRUE)
model <- princomp(covmat = covMat[[1]], cor=spca)
model$center <- diag(covMat$mean)
model$n.obs <- global(!any(is.na(x)), sum)$sum
if(spca) {
## Calculate scale as population sd like in in princomp
S <- diag(covMat$covariance)
model$scale <- sqrt(S * (model$n.obs-1)/model$n.obs)
}
}
## Predict
out<- terra::predict(x, model = model, na.rm = TRUE, index = 1:nComp, wrArgs = ellip)
names(out) <- paste0("PC", 1:nComp)
return(structure(list(call = match.call(), model = model, map = out)))
}
#Create Data
r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
values(r)[sample(1:(ncell(r)*nlyr(r)), size = 1000, replace = FALSE)]<- NA #Add NA's
plot(r)
PCA<- rasterPCA(r, spca = TRUE)
summary(PCA$model)
#> Importance of components:
#> Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
#> Standard deviation 1.065035 1.060924 1.0353062 1.0232500 1.0026015
#> Proportion of Variance 0.113430 0.112556 0.1071859 0.1047041 0.1005210
#> Cumulative Proportion 0.113430 0.225986 0.3331719 0.4378759 0.5383969
#> Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
#> Standard deviation 0.9866337 0.98439181 0.96219138 0.95073612 0.91861112
#> Proportion of Variance 0.0973446 0.09690272 0.09258123 0.09038992 0.08438464
#> Cumulative Proportion 0.6357415 0.73264422 0.82522545 0.91561536 1.00000000
plot(PCA$map)
Created on 2023-11-30 with reprex v2.0.2
I am not opposed but not convinced either given the simplicity of this example in ?terra::predict
library(terra)
logo <- rast(system.file("ex/logo.tif", package="terra"))
### principal components of a SpatRaster
pca <- princomp(logo)
# or use sampling if you have a large raster
# and cannot process all cell values
sr <- spatSample(logo, 100000, "regular")
pca <- prcomp(sr)
x <- predict(logo, pca)
plot(x)
@rhijmans, I think this method doesn't require sampling for large rasters because it calls layerCor
and then works off the covariance matrix. That is unless layerCor
needs to read in the whole dataset to memory.
I added a princomp<SpatRaster>
method. Please have a look. I think it is best to have model building and prediction in separate steps as one might want to inspect the model and because this is standard R practice.
Thanks @rhijmans, I think splitting it up into model and prediction makes sense and like the idea of adding a dedicated princomp
method in terra
. I do have a few comments though.
stats::princomp
needs to have cor=cor
to link to what the user specified, and it should have ...
for other arguments that can be passed to it. Since currently that line only says model <- princomp(covmat = xcov$covariance)
, cor
and other user specified arguments will not be passed along.NA
values need to be masked from all layers prior to calculating the covariance matrix with LayerCor
. In RSToolbox the comment said "NA areas must be masked from all layers, otherwise the covariance matrix is not non-negative definite"$n.obs
in model output (should be number of non-NA cells)library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.62
rasterPCA <- function(x, nSamples = NULL, nComp = nlyr(x), spca = FALSE, maskCheck = TRUE, ...){
if(nlyr(x) <= 1) stop("Need at least two layers to calculate PCA.")
ellip <- list(...)
if(nComp > nlyr(x)) nComp <- nlyr(x)
if(!is.null(nSamples)){
trainData <- spatSample(x, size = nSamples, method="random", replace=FALSE, na.rm = TRUE)
if(nrow(trainData) < nlyr(x)) stop("nSamples too small or x contains a layer with NAs only")
model <- princomp(trainData, scores = FALSE, cor = spca)
} else {
if(maskCheck) {
totalMask <- !sum(app(x, is.na))
if(global(totalMask, sum)$sum == 0) stop("x contains either a layer with NAs only or no single pixel with valid values across all layers")
x <- mask(x, totalMask , maskvalue = 0) ## NA areas must be masked from all layers, otherwise the covariance matrix is not non-negative definite
}
covMat <- layerCor(x, fun = "cov", na.rm = TRUE)
model <- princomp(covmat = covMat[[1]], cor=spca)
model$center <- diag(covMat$mean)
model$n.obs <- global(!any(is.na(x)), sum)$sum
if(spca) {
## Calculate scale as population sd like in in princomp
S <- diag(covMat$covariance)
model$scale <- sqrt(S * (model$n.obs-1)/model$n.obs)
}
}
## Predict
out<- terra::predict(x, model = model, na.rm = TRUE, index = 1:nComp, wrArgs = ellip)
names(out) <- paste0("PC", 1:nComp)
return(structure(list(call = match.call(), model = model, map = out)))
}
princomp_edit<- function(x, cor=FALSE, ...) {
stopifnot(nlyr(x) > 1)
xcov <- layerCor(x, fun="cov", na.rm=TRUE)
model <- princomp(covmat = xcov$covariance, cor=cor, ...)
model$center <- diag(xcov$mean)
if (cor) {
## Calculate scale as population sd like in in princomp
S <- diag(xcov$covariance)
n <- diag(xcov$n)
model$scale <- sqrt(S * (n-1) / n)
}
model
}
#Create Data
r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
values(r)[sample(1:(ncell(r)*nlyr(r)), size = 1000, replace = FALSE)]<- NA #Add NA's
r_df<- as.data.frame(r)
r_df<- na.omit(r_df) # If you do not omit NA's from the dataframe, you get Error in cov.wt(z) : 'x' must contain finite values only
# PCA cor=FALSE
m1<- princomp(r_df) # This is on a dataframe so it serves as a reference to the correct output
m1$n.obs
#> [1] 859
m1$sdev
#> Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
#> 1.0826671 1.0704703 1.0429269 1.0255518 1.0028366 0.9855091 0.9820035 0.9674348
#> Comp.9 Comp.10
#> 0.9487587 0.9176803
m2<- rasterPCA(r)$model
m2$n.obs
#> [1] 859
m2$sdev
#> Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
#> 1.0832979 1.0710939 1.0435344 1.0261493 1.0034208 0.9860832 0.9825756 0.9679984
#> Comp.9 Comp.10
#> 0.9493114 0.9182149
m3<- terra::princomp(r)
m3$n.obs
#> [1] NA
m3$sdev
#> Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
#> 1.0745181 1.0494377 1.0409172 1.0199349 1.0037053 1.0022756 0.9830014 0.9800315
#> Comp.9 Comp.10
#> 0.9710888 0.9527138
# PCA cor=TRUE
m4<- terra::princomp(r, cor=TRUE)
m4$sdev==m3$sdev #cor= TRUE did not change result
#> Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
#> TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
m5<- princomp_edit(r, cor=TRUE)
m5$sdev==m3$sdev #Edited version properly passes along cor argument
#> Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
#> FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
Created on 2023-12-01 with reprex v2.0.2
I think the [call needs to stats::princomp] needs to have cor=cor
Yes, thanks, fixed that.
it should have ... for other arguments
The only other meaningful argument is fix_sign
, so I have added that explicitly.
I believe NA values need to be masked from all layers prior to calculating the covariance matrix with LayerCor.
I do not think that is strictly necessary. Our best estimate of the covariance matrix would use all pairs of data that are not NA. We can add a note in the manual that you can do mask(x, anyNA(x), maskvalue=TRUE)
if you want the same number of observations for all layers.
It may affect the computation of the scale (if cov=TRUE
) as that uses (n-1) / n
. In most cases n is very large with raster data so I am not worried about that. But again that can be pointed out.
Assign $n.obs in model output (should be number of non-NA cells)
I put it back in. However, there is not one n
as it can vary between layers. So you now get a "dist" object that shows that.
Thank you!
@rhijmans, although RSToolbox
used the population standard deviation to calculate scale, on further investigation that may have not been the correct choice (See example and discussion here). That being said though, as you mentioned rasters tend to have a lot of observations, so it likely doesn't make much difference.
I would prefer to use model$scale <- sqrt(S)
over model$scale <- sqrt(S * (n-1) / n)
because the value of n
is ambiguous if the number of non-NA cells varies.
The population covariance can be computed with asSample=FALSE
.
xcov <- layerCor(x, fun="cov", na.rm=TRUE, asSample=FALSE)
I think it generally makes more sense to to treat raster data as the population and not as a sample. It could also be an option that may lead to oh so small differences for small rasters.
Anyway, this is confusing stuff, and I would appreciate your guidance.
@rhijmans, since the "scale" should be the standard deviation and the S (the diagonal of the covariance matrix) is the variance, the square root of that would be appropriate. The pairwise n's must still be used to calculate the covariance matrix so it's not really circumventing that, but we it does avoid multiplying by (n-1)/n
which was just there to convert the sample standard deviation to the population standard deviation since
$$\text{Cov Population (x,y)}=\frac{\sum{(x{i}-\bar{x})(y{i}-\bar{y}})}{N}$$
$$\text{Cov Sample(x,y)}=\frac{\sum{(x{i}-\bar{x})(y{i}-\bar{y}})}{N-1}$$
That being said, I think it is slightly better as you suggested to use asSample=FALSE
and scale=sqrt(S)
since the covariance matrix and the scale are both calculated with the population formula rather than having a covariance matrix based on the sample formula and a scale based on the population formula.
I am unsure however why results of layerCor(asSample=TRUE)
multiplied by (n-1)/n
doesn't more closely match the results of layerCor(asSample=FALSE)
library(terra)
#> terra 1.7.62
cov_pop <- function(data) {
# Calculate the means of each variable
means <- colMeans(data)
# Number of observations
n_obs <- nrow(data)
# Number of variables
n_vars <- ncol(data)
# Initialize an empty covariance matrix
covariance_matrix <- matrix(0, nrow = n_vars, ncol = n_vars)
# Calculate the covariance for each pair of variables
for (i in 1:n_vars) {
for (j in 1:n_vars) {
covariance_matrix[i, j] <- sum((data[, i] - means[i]) * (data[, j] - means[j])) / n_obs
}
}
return(covariance_matrix)
}
r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
r_df<- as.data.frame(r)
n<- ncell(r) #No NA's in this example
cov_samp_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSamle=TRUE)
cov_pop_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSamle=FALSE)
cov_samp_R<- cov(r_df)
cov_pop_R<- cov_pop(r_df)
# Terra has differences on order of e-05
max((cov_samp_terra$covariance*(n-1)/n)-cov_pop_terra$covariance)
#> [1] 3.426207e-05
# R has differences on order of e-16
max((cov_samp_R*(n-1)/n)-cov_pop_R)
#> [1] 2.220446e-16
Created on 2023-12-06 with reprex v2.0.2
b <- rast(system.file("ex/logo.tif", package="terra"))
layerCor(b, "pearson")
x <- layerCor(b, "cov")
y <- layerCor(b, "cov", asSample=F)
x[["covariance"]] * (x[["n"]]-1)/x[["n"]] == y[["covariance"]]
# red green blue
#red TRUE TRUE TRUE
#green TRUE TRUE TRUE
#blue TRUE TRUE TRUE
or with your approach using asSample
(and not asSamle
;)
cov_pop_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSample=FALSE)
cov_samp_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSample=TRUE)
max((cov_samp_terra$covariance*(n-1)/n)-cov_pop_terra$covariance)
#[1] 2.220446e-16
but it does avoid multiplying by (n-1)/n
Yes, that is the n
I was thinking about. In the covariance computation, n
is computed for each pair.
Oh that was dumb of me 🤦♂️🤦♂️🤦♂️. Running it again, it all looks good! Since you created the n
matrix they're both pairwise and the scale will come out the same. I think it's cleaner and more consistent though to use asSample=FALSE
and $scale<- sqrt(S)
as you mentioned.
library(terra)
#> terra 1.7.62
#Create Data
r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
values(r)[sample(1:(ncell(r)*nlyr(r)), size = 1000, replace = FALSE)]<- NA #Add NA's
cov_samp_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSample=TRUE)
cov_pop_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSample=FALSE)
n_mat<- cov_samp_terra$n
n<- diag(n_mat)
max((cov_samp_terra$covariance*(n_mat-1)/n_mat) - cov_pop_terra$covariance)
#> [1] 6.938894e-18
n # n is a vector of pairwise sample sizes corresponding to the length of S
#> [1] 1497 1497 1509 1496 1493 1489 1506 1505 1507 1501
S1<- diag(cov_samp_terra$covariance)
scale1<- sqrt(S1*(n-1)/n)
S2<- diag(cov_pop_terra$covariance)
scale2<- sqrt(S2)
scale1==scale2
#> lyr.1 lyr.2 lyr.3 lyr.4 lyr.5 lyr.6 lyr.7 lyr.8 lyr.9 lyr.10
#> TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Created on 2023-12-06 with reprex v2.0.2
Thanks so much for adding this!
Thank you!
I have changed argument na.rm=TRUE/FALSE
to use="everything"/"complete.obs"/ "pairwise.complete.obs"/"masked.obs"
@rhijmans, an added bonus is I believe this also fixed a bug in layerCor
when calculating "cov" with maxcell
specified. In the latest commit I get approximately the same results whereas in the CRAN and a few commits ago I get half the covariance if sampling half the cells so it seems like the number of cells rather than the sample size was used.
library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.62
r<- rast(nrow=40, ncol=40, nlyrs=2)
set.seed(5)
values(r[[1]])<- rnorm(ncell(r))
values(r[[2]])<- values(r[[1]])*2+rnorm(ncell(r))
cov_full<- layerCor(r, fun="cov", maxcell=Inf)
cov_half<- layerCor(r, fun="cov", maxcell=ncell(r)/2)
cov_half$covariance
#> lyr.1 lyr.2
#> lyr.1 0.541334 1.058303
#> lyr.2 1.058303 2.621351
cov_full$covariance
#> lyr.1 lyr.2
#> lyr.1 1.019472 2.029023
#> lyr.2 2.029023 5.057638
cov_half$covariance/cov_full$covariance
#> lyr.1 lyr.2
#> lyr.1 0.5309946 0.5215826
#> lyr.2 0.5215826 0.5182955
df_half<- spatSample(r, 800, method="regular")
cov(df_half)
#> lyr.1 lyr.2
#> lyr.1 1.030468 2.014555
#> lyr.2 2.014555 4.989929
Created on 2023-12-07 with reprex v2.0.2
library(terra)
#> terra 1.7.62
r<- rast(nrow=40, ncol=40, nlyrs=2)
set.seed(5)
values(r[[1]])<- rnorm(ncell(r))
values(r[[2]])<- values(r[[1]])*2+rnorm(ncell(r))
cov_full<- layerCor(r, fun="cov", maxcell=Inf)
cov_half<- layerCor(r, fun="cov", maxcell=ncell(r)/2)
cov_half$covariance
#> lyr.1 lyr.2
#> lyr.1 1.030468 2.014555
#> lyr.2 2.014555 4.989929
cov_full$covariance
#> lyr.1 lyr.2
#> lyr.1 1.019472 2.029023
#> lyr.2 2.029023 5.057638
cov_half$covariance/cov_full$covariance
#> lyr.1 lyr.2
#> lyr.1 1.0107862 0.9928697
#> lyr.2 0.9928697 0.9866125
df_half<- spatSample(r, 800, method="regular")
cov(df_half)
#> lyr.1 lyr.2
#> lyr.1 1.030468 2.014555
#> lyr.2 2.014555 4.989929
Created on 2023-12-07 with reprex v2.0.2
Since "cov" and maxcell
seems to be working, the only extra thing I'd add to this function is allowing a specification of maxcell
in princomp
. By default that would be maxcell=Inf
but could be changed. The only thing that would need to be done is adding a maxcell
argument to princomp
and passing that to layerCor
via model <- princomp(covmat=xcov$covariance, cor=cor, fix_sign=fix_sign, maxcell=maxcell)
.
I did not include the maxcell
argument because I figured that you might as well use princomp<data.frame>
at that point; but I have added it now anyway.
It is interesting to see how few you cells you need with this (unrealistically simple) example
library(terra)
f <- system.file("ex/logo.tif", package = "terra")
r <- rast(f)
pca <- princomp(r)
pca100 <- princomp(r, maxcell=100)
plot(predict(r, pca), predict(r, pca100))
@rhijmans, that for adding this function -- I think it will be helpful for many.
Do you also plan to add some updates to prcomp
? Currently, princomp
works for data with NAs, but prcomp
not:
library(terra)
#> terra 1.7.65
f <- system.file("ex/logo.tif", package = "terra")
r <- rast(f)
r[100] <- NA
# works
pca <- princomp(r)
# fails (due to NA's)
pca2 <- prcomp(r)
#> Error in svd(x, nu = 0, nv = k): infinite or missing values in 'x'
From the princomp help file (?princomp
), it appears that the prcomp
should be preferred, in general:
"The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result. A preferred method of calculation is to use svd on x, as is done in prcomp."
@Nowosad, the point about prcomp
vs princomp
was raised in bleutner/RSToolbox#49 where it was stated that "prcomp
is not a suitable choice for large remote sensing data sets as it operates on X as opposed to princomp
which operates on the correlation or covariance matrix, both of which can be computed in a piecewise fashion. I.e. for prcomp you'd have to load all data into memory at once, while for princomp you can first compute the covariance matrix and the run eigen on it, which is not only faster, but more importantly memory-safe. In the end the results of both methods are equal. While numerical differences can exist, with SVD being more accurate, this is generally negligible compared to the measurement uncertainty inherent in typical remote sensing data."
If there are no NAs in SpatRaster x
you can do
prcomp(x)
Otherwise you need something like
maxcell <- Inf
v <- na.omit(spatSample(x, maxcell, "regular"))
prcomp(v)
I could add a prcomp
method that implements that to avoid that people trip over the NA issue.
@rhijmans @Nowosad, it could be useful to add a prcomp
method to make it a bit smoother for those who want to use that version of PCA. Seems like SpatRasters work with prcomp
by default since objects are automatically coerced to a matrix in prcomp
.
@ailich thanks for the explanation -- I was suspecting something like this, but was unable to find any discussion on the topic. And yes -- my thinking was to make prcomp
and princomp
more consistent (e.g., both allowing for NA values in input rasters and both having the maxcell argument). I believe this was done a few hours ago by @rhijmans -- thank you a lot. I plan to test it in the coming days.
I used to use
RSToolbox
for conducting PCA's on raster data, but it relies on theraster
package and is no longer actively maintained (I actually can't get it to install anymore). Since PCA is a common procedure it might be nice to have directly interra
like how kmeans was recently added. I've taken the code from their function and converted it toterra
code and it appears to work with the development version.