isciences / exactextractr

R package for fast and accurate raster zonal statistics
https://isciences.gitlab.io/exactextractr/
274 stars 26 forks source link

inconsistent behavior in exact_extract() #49

Closed dimitrijoe closed 3 years ago

dimitrijoe commented 3 years ago

I was exploring the function exact_extract() and its arguments, when I noticed a strange behavior by switching on the argument include_cell:

library(raster)
library(sf)
library(exactextractr)

set.seed(1)

# sample raster
r <- raster(xmn=0, xmx=10, ymn=0, ymx=10, ncols=10, nrows=10, vals = sample(1:3, size = 10*10, replace = T))

# sample feature collection
g <- st_make_grid(
        x = st_as_sfc(st_bbox(r)),
        cellsize = 2,
        what = "polygons",
        square = T
     )

# proportion of cells with 1's inside each feature with include_cell = F
res1 = exact_extract(r, g, include_cell = F, fun = function(value, fraction) {
  sum(fraction[value == 1]) / sum(fraction) }
)

# proportion of cell with 1's inside each feature with include_cell = T
res2 = exact_extract(r, g, include_cell = T, fun = function(value, fraction) {
  sum(fraction[value == 1]) / sum(fraction) }
)

# res1 & res2 should be equal, and indeed they are expcet for the weird NA in row 21
cbind(res1, res2)
> cbind(res1, res2)
      res1 res2
 [1,] 0.50 0.50
 [2,] 0.25 0.25
 [3,] 0.25 0.25
 [4,] 0.25 0.25
 [5,] 0.25 0.25
 [6,] 0.75 0.75
 [7,] 0.25 0.25
 [8,] 0.00 0.00
 [9,] 0.50 0.50
[10,] 0.50 0.50
[11,] 0.25 0.25
[12,] 0.00 0.00
[13,] 0.75 0.75
[14,] 0.00 0.00
[15,] 0.00 0.00
[16,] 0.50 0.50
[17,] 0.75 0.75
[18,] 0.25 0.25
[19,] 0.50 0.50
[20,] 0.00 0.00
[21,] 0.50   NA
[22,] 0.75 0.75
[23,] 0.25 0.25
[24,] 0.00 0.00
[25,] 0.25 0.25
dbaston commented 3 years ago

When you specify include_cell = TRUE, then values becomes a data.frame with columns value and cell. In your code, sum(fraction[value == 1]) will be true if the cell value OR the cell number is equal to one. And when the cell number is equals to one there is no corresponding entry in fraction, hence the NA.

I think this would be much less confusing if the summary function just accepted a single data frame, but it doesn't because I was originally borrowing the signature of the built-in weighted.mean.

dimitrijoe commented 3 years ago

I don't fully follow. When cell number is 1, I'm seeing fraction equals 1 (no surprise since every cell in my raster is, by construction, covered by my feature collection). The plot below shows this and (more importantly), the 1st row in data frame in the 21st element of the full object below also shows this. What is "special" about that 1st row is that both cell number AND value are equal to 1. As per your statement, value == 1 should evaluate to TRUE, no?

I might be missing something in your explanation, but I guess the point is that it's hard to expect this behavior from the existing documentation in R.

# plot    
plot(r, )
plot(st_geometry(g), border = "blue", add = T)
text(as(g, "Spatial"), 1:25)

image

# proportion of cell with 1's inside each feature with include_cell = T
full = exact_extract(r, g, include_cell = T)
> full
[[1]]
  value cell coverage_fraction
1     1   81                 1
2     1   82                 1
3     2   91                 1
4     2   92                 1

[[2]]
  value cell coverage_fraction
1     1   83                 1
2     3   84                 1
3     2   93                 1
4     3   94                 1

[[3]]
  value cell coverage_fraction
1     2   85                 1
2     1   86                 1
3     2   95                 1
4     2   96                 1

[[4]]
  value cell coverage_fraction
1     1   87                 1
2     3   88                 1
3     3   97                 1
4     3   98                 1

[[5]]
  value cell coverage_fraction
1     3   89                 1
2     3   90                 1
3     3   99                 1
4     1  100                 1

[[6]]
  value cell coverage_fraction
1     1   61                 1
2     3   62                 1
3     1   71                 1
4     1   72                 1

[[7]]
  value cell coverage_fraction
1     3   63                 1
2     2   64                 1
3     1   73                 1
4     3   74                 1

[[8]]
  value cell coverage_fraction
1     3   65                 1
2     3   66                 1
3     2   75                 1
4     3   76                 1

[[9]]
  value cell coverage_fraction
1     2   67                 1
2     3   68                 1
3     1   77                 1
4     1   78                 1

[[10]]
  value cell coverage_fraction
1     3   69                 1
2     1   70                 1
3     2   79                 1
4     1   80                 1

[[11]]
  value cell coverage_fraction
1     2   41                 1
2     1   42                 1
3     3   51                 1
4     2   52                 1

[[12]]
  value cell coverage_fraction
1     3   43                 1
2     2   44                 1
3     2   53                 1
4     2   54                 1

[[13]]
  value cell coverage_fraction
1     1   45                 1
2     1   46                 1
3     2   55                 1
4     1   56                 1

[[14]]
  value cell coverage_fraction
1     3   47                 1
2     2   48                 1
3     2   57                 1
4     2   58                 1

[[15]]
  value cell coverage_fraction
1     2   49                 1
2     3   50                 1
3     2   59                 1
4     2   60                 1

[[16]]
  value cell coverage_fraction
1     3   21                 1
2     1   22                 1
3     2   31                 1
4     1   32                 1

[[17]]
  value cell coverage_fraction
1     1   23                 1
2     1   24                 1
3     3   33                 1
4     1   34                 1

[[18]]
  value cell coverage_fraction
1     1   25                 1
2     2   26                 1
3     3   35                 1
4     2   36                 1

[[19]]
  value cell coverage_fraction
1     1   27                 1
2     1   28                 1
3     2   37                 1
4     2   38                 1

[[20]]
  value cell coverage_fraction
1     2   29                 1
2     2   30                 1
3     2   39                 1
4     3   40                 1

[[21]]
  value cell coverage_fraction
1     1    1                 1
2     3    2                 1
3     3   11                 1
4     1   12                 1

[[22]]
  value cell coverage_fraction
1     1    3                 1
2     2    4                 1
3     1   13                 1
4     1   14                 1

[[23]]
  value cell coverage_fraction
1     1    5                 1
2     3    6                 1
3     2   15                 1
4     2   16                 1

[[24]]
  value cell coverage_fraction
1     3    7                 1
2     2    8                 1
3     2   17                 1
4     2   18                 1

[[25]]
  value cell coverage_fraction
1     2    9                 1
2     3   10                 1
3     3   19                 1
4     1   20                 1
dbaston commented 3 years ago

Here's what your summary function for res2 looks like in the debugger for the offending geometry:

As I mentioned, value is a two-column data frame:

Browse[1]> value
  value cell
1     1    1
2     3    2
3     3   11
4     1   12

And fraction is a vector:

Browse[1]> fraction
[1] 1 1 1 1

Despite value being a data frame, R is allowing you to treat it like a vector, implicitly combining the columns end-to-end.

Browse[1]> which(value == 1)
[1] 1 4 5

So 1 and 4 come from value$value == 1, and 5 comes from value$cell == 1. This is fairly meaningless.

And when you index into fraction with value 5, there is no corresponding entry, so R returns `NA.

Browse[1]> fraction[value == 1]
[1]  1  1 NA

To back up a bit, when you write the code below, what are you expecting include_cell = TRUE to do?

# proportion of cell with 1's inside each feature with include_cell = T
res2 = exact_extract(r, g, include_cell = T, fun = function(value, fraction) {
  sum(fraction[value == 1]) / sum(fraction) }
)
dimitrijoe commented 3 years ago

Got it. Makes more sense now. This reminds me that I need to learn how to debug properly.

What was I expecting include_cell = TRUE to do? As I said in my original post, I was exploring the function a bit since it's the first time I try to use it. But since you asked, I was trying to get a solution you posted to work for a similar problem I have (I still don't understand why it doesn't work, and I was planning to post that on SO, but here it goes):

> library(raster)
> library(sf)
> library(exactextractr)
> 
> set.seed(1)
> 
> # set up
> r <- raster(xmn=0, xmx=10, ymn=0, ymx=10, ncols=10, nrows=10, vals = sample(1:3, size = 10*10, replace = T))
> g <- st_make_grid(
+   x = st_as_sfc(st_bbox(r)),
+   cellsize = 2,
+   what = "polygons",
+   square = T
+ )
> 
> # from S0
> library(dplyr)
> library(purrr)
> library(tidyr)
> 
> freqs <- exact_extract(r, g, function(value, coverage_fraction) {
+   data.frame(value = value,
+              frac = coverage_fraction / sum(coverage_fraction)) %>%
+     group_by(value) %>%
+     summarize(freq = sum(frac))
+ }) %>%
+   apply(2, function(x) pivot_wider(data.frame(x), 
+                                    names_from = value,
+                                    values_from = freq)) %>%
+   bind_rows()
  |================================================================================================================================================================================================| 100%
 Error: Can't subset columns that don't exist.
x Column `value` doesn't exist.
Run `rlang::last_error()` to see where the error occurred. > 
> freqs[is.na(freqs)] <- 0    
Error in freqs[is.na(freqs)] <- 0 : object 'freqs' not found
> 

Back to my original post. My point was that it's even hard to know what to expect from include_cell = TRUE. The documentation does not tell us what exactly is passed to fun. All I could find in the documentation was "If TRUE and fun is not NULL, add cell to the data frame passed to fun for each feature." But I don't know the name of the data frame, the name of other columns, and if there's anything else other than this data frame being passed to fun. So again, I was trying stuff out, hoping I could learn how to use the function. I noticed that the results didn't change by include_cell = TRUE, except for one weird NA. That's all.

Great job on the package, by the way. Once I learn how to use it properly, I'm sure I'll get stuff done more quickly and more accurately.

dbaston commented 3 years ago

This reminds me that I need to learn how to debug properly.

Not sure my technique is the best one, but you can just drop a call to browser() in your code where you want to break.

I was trying to get a solution you posted to work for a similar problem I have

I've updated that solution to work for the current package version.

But I don't know the name of the data frame, the name of other columns, and if there's anything else other than this data frame being passed to fun

Agreed, this could use some clarification. It's useful to hear where the stumbling points are, so I appreciate you taking the time to write them up clearly.

dimitrijoe commented 3 years ago

Great, thank you so much for the clarifications.

What I take away from this is that the documentation for the case that fun is a custom summary function could be improved. As is, it is hard for users to write their own functions since they don't know/control what are the arguments that are being passed to fun.