r-lidar / lidR

Airborne LiDAR data manipulation and visualisation for forestry application
https://CRAN.R-project.org/package=lidR
GNU General Public License v3.0
582 stars 130 forks source link

two metrics can't have the same value unless certain container types are used #714

Closed twest820 closed 11 months ago

twest820 commented 11 months ago

metrics_template.R has a couple checks of the form

  if (any(duplicated(metrics[[1]])))
    stop("Duplicated elements found. At least one of the metrics was not a number. Each metric should be a single number.", call. = FALSE)

at least one of which is easily triggered when two or more metrics returned by a user provided metrics function have the same value.

Some review of the sources and a bit of testing suggests this is an area where reliability, performance, and documentation improvements could be made. It's unclear why lidR 4.0.3 performs the duplicated([[1]]) checks but it seems likely they're not functioning as intended since it's frequently expected two or metrics will likely have the same value in at least some areas of a LiDAR tile. Since the repeated value's most likely a scalar number the error message raised is also likely to be incorrect. Additionally, the failure seems to be slow in the sense logging shows the metric function can be called and return duplicate values many times before the duplicated() check fires. This is inefficient since, if tile processing is going going to fail, it might as well fail when duplicate values are first encountered rather than continuing for some time and then failing.

Repex is trivial:

metrics_zero = function()
{
  return(c(value1 = 0, value2 = 0)) # list(value1 = 0, value2 = 0) also breaks
}

pixel_metrics(someLidarTile, ~metrics_zero())

c() and list() are natural choices when writing metrics functions as the stdmetrics implementations use those to return their values (suggesting duplicated() may not be getting called consistently across all metrics outputs). By inference from the lidR source's use of data.table() I've found data.table(value1 = 0, value2 = 0) provides a workaround—tibble(value1 = 0, value2 = 0) does too but is 3.8x slower—which raises questions over whether the apparent requirement metrics functions return data.table()s imposes a compute penalty over alternative packages such as c().

Jean-Romain commented 11 months ago

I can't see any problem here, the code behaves as expected. If you return c(0,0) you return a vector not a list and the doc explicitly mention you must return a list. The function throws the correct error:

Error : Duplicated elements found. At least one of the metrics was not a number. Each metric should be a single number.

If you return a list list(0,0,) it works

twest820 commented 11 months ago

If you return a list list(0,0,) it works

Nope. See repex above.

What the documentation states is

f = function(x) {list(mean = mean(x), max = max(x))}

correctly formed. It does not state use of a list is required nor, as noted above, does lidR necessarily use lists internally.

Pragmatically speaking, even if the docs did say a list was required it wouldn't matter because lidR's current code sometimes requires a list not be used.

Jean-Romain commented 11 months ago

Nope. See repex above.

Your code is not a reprex. Anyway I ran it and it works with a list

It does not state use of a list is required nor, as noted above,

It does

Parameter func

The function to be applied to each cell is a classical function (see examples) that returns a labelled list of metrics. For example, the following function f is correctly formed. f = function(x) {list(mean = mean(x), max = max(x))}

Pragmatically speaking, even if the docs did say a list was required it wouldn't matter because lidR's current code sometimes requires a list not be used.

Show me an example. There is indeed one case when a single number is returned. You do not need to wrap a single number in a list. Yet, if you want your raster to get layer name you must use a named list.

twest820 commented 11 months ago

if you want your raster to get layer name you must use a named list.

Nope, only containers I've tried so far which work are data.table() or tibble(). As indicated, the duplicated() checks fail out a named list.

Show me an example.

See repex.

Jean-Romain commented 11 months ago

A data.frame in R is simply a list. So, it might work with a single line data.frame (thus also data.table, and tibble). If you have multiple lines it should fail. I did not test actually. And your reprex (which is not a reprex) does work. Show me a real minimal reproducible example.

library(lidR)
LASfile <- system.file("extdata", "Megaplot.laz", package="lidR")
las = readLAS(LASfile)
metrics_zero = function()
{
  return(list(value1 = 0, value2 = 0)) 
}
pixel_metrics(las, ~metrics_zero())
#> class       : SpatRaster 
#> dimensions  : 13, 12, 2  (nrow, ncol, nlyr)
#> resolution  : 20, 20  (x, y)
#> extent      : 684760, 685000, 5017760, 5018020  (xmin, xmax, ymin, ymax)
#> coord. ref. : NAD83 / UTM zone 17N (EPSG:26917) 
#> source(s)   : memory
#> names       : value1, value2 
#> min values  :      0,      0 
#> max values  :      0,      0
library(lidR)
LASfile <- system.file("extdata", "Megaplot.laz", package="lidR")
las = readLAS(LASfile)
metrics_zero = function()
{
  return(data.frame(value1 = 0, value2 = 0)) 
}
pixel_metrics(las, ~metrics_zero())
#> class       : SpatRaster 
#> dimensions  : 13, 12, 2  (nrow, ncol, nlyr)
#> resolution  : 20, 20  (x, y)
#> extent      : 684760, 685000, 5017760, 5018020  (xmin, xmax, ymin, ymax)
#> coord. ref. : NAD83 / UTM zone 17N (EPSG:26917) 
#> source(s)   : memory
#> names       : value1, value2 
#> min values  :      0,      0 
#> max values  :      0,      0
library(lidR)
LASfile <- system.file("extdata", "Megaplot.laz", package="lidR")
las = readLAS(LASfile)
metrics_zero = function()
{
  return(data.frame(value1 = c(0,0), value2 = 0)) 
}
pixel_metrics(las, ~metrics_zero())
#> Error: Duplicated elements found. At least one of the metrics was not a number. Each metric should be a single number.

Everything looks good.