results for tree detection erroneous?

wiesehahn commented 3 months ago

Hey Piotr, great you are putting this together!

Regarding https://github.com/ptompalski/LidarSpeedTests?tab=readme-ov-file#detecting-treetops I am wondering if there is some mistake? Here lidR seems some orders of magnitude faster than lasR, however as mentioned by @Jean-Romain in https://github.com/r-lidar/lasR/discussions/38 they should be more or less on par. Also it looks strange that processing time goes up with increasing number of cores for 20 pts/m² while this is not the case for lower densities.

(https://github.com/ptompalski/LidarSpeedTests?tab=readme-ov-file#multiple-metrics looks suspicious to me as well)

Jean-Romain commented 3 months ago

Several results are either erroneous or should be reported urgently while lasR is in development. For example those 4 tests are not comparable, so the differences are meaningless. Especially the benchmarks were conducted with v0.5.1 and this is even more true with v0.7.0 were I simply deactivated the parallelization of stages with injected R code.

lidR::pixel_metric(..., mean(Z))
lasR::rasterize(mean(Z))
lasR::rasterize("z_mean")
lascanopy -i *.las -step 20 -avg -odir output -otif -cores 4 -buffered 20

If the results of normalization and Detecting treetops are correct and reproducible there is an urgent need to report. However the normalization test is not reproduced in Influence of drive/storage type which looks fine. There is something suspicious.

Jean-Romain commented 3 months ago

Quick test on 4 laz files 12 pts/m².

```r library(lidR) library(lasR) f = c("/home/jr/Documents/Ulaval/ALS data/BCTS//092L072224_BCTS_2.laz", "/home/jr/Documents/Ulaval/ALS data/BCTS//092L072222_BCTS_2.laz", "/home/jr/Documents/Ulaval/ALS data/BCTS//092L073111_BCTS_2.laz", "/home/jr/Documents/Ulaval/ALS data/BCTS//092L073113_BCTS_2.laz") ctg = readLAScatalog(f) ctg #> class : LAScatalog (v1.2 format 1) #> extent : 885293, 888882.9, 632916.1, 635780 (xmin, xmax, ymin, ymax) #> coord. ref. : NAD83 / BC Albers #> area : 10.28 km² #> points : 127.34 million points #> density : 12.4 points/m² #> density : 8.9 pulses/m² #> num. files : 4 ```

t0 = Sys.time()
lidr = locate_trees(ctg, lmf(5))
t1 = Sys.time()
lasr = exec(local_maximum(5), on = f, progress = T)
t2 = Sys.time()

difftime(t1,t0)
#> Time difference of 1.35 mins
difftime(t2,t1)
#> Time difference of 1.68 mins

t0 = Sys.time()
future::plan(future::multisession(workers = 4))
lidr = locate_trees(ctg, lmf(5))
t1 = Sys.time()
lasr = exec(local_maximum(5), on = f, progress = T, ncores = 4)
t2 = Sys.time()

> difftime(t1,t0)
#> Time difference of 56 secs
> difftime(t2,t1)
#> Time difference of 66.6 secs

lasR is 20% slower but we are not comparing the same tasks actually. lidR is holding everything in memory while lasR is storing on disk and reads back the results. With this code lidR should no be able to handle thousandth of km².

The following is closer to the actual lasR processing

t0 = Sys.time()
future::plan(future::multisession(workers = 4))
opt_output_files(ctg) = paste0(tempdir(), "/*_ttops")
lidr = locate_trees(ctg, lmf(5))
ans = lapply(lidr, sf::st_read)
u = Reduce(rbind, ans)
t1 = Sys.time()

difftime(t1,t0)
#> Time difference of 57.4 secs

The timing is not significantly different and lidR is still 20% faster. But still lasR and lidR are not performing the same tasks.

lidR produced 4 independent shapefiles. In each shapefile the trees are labelled with an ID from 1 to n. Labels are duplicated because files are processed independently with no way to know what was the latest ID in the other files to assign consecutive numbers. Moreover in this specific collection the files are overlapping a bit which mean that in addition to duplicated IDs with have duplicated trees in the overlaps.

lasR produced one single geopackage files in which each tree is guaranteed to get a unique ID even in parallel and duplicated trees in overlaps are guaranteed to be handled properly and removed.

In summay, lidR output must be post processed. It is not hard if the dataset is not too big but for very large dataset it is harder. lasR is designed for processing massive dataset and you will get a nice and clean output out of the box. This explains at least a part of the 20% of difference.

Comparing lasR and lidR at strictly equal capacities is very hard actually.

Jean-Romain commented 3 months ago

@ptompalski I'd appreciate that you do not put publicly some non validated benchmarks. Several of your tests are either suspicious or invalid. And if the the benchmark are validated you should put an explanation why it behaves how it behaves.

For example, yes local_maxima is slower (20% not an order of magnitude) but there are good reasons for it to be slower. If some results are consistently weird please report on the lasR repo. The local maximum is weird. Maybe with 100 files instead of 4 and a bigger dataset the behavior diverges? In this case this should be reported.

ptompalski commented 3 months ago

Thank you both! I agree - the results are suspicious, were generated a while ago (lasR v0.5.1) and need to be reviewed. I will run the benchmarks again (for lasR only) using the newest version.

What I do and what I post here in this repo is a work in progress. My intention is to help @Jean-Romain with lasR development by performing a series of long tests that would allow to test the behavior of various lidar processing tools when a larger dataset is used, when e.g. 32 cores are used, or when the density is higher than typical (although 50pts/m2 data are not yet included).

What I will do:

add a clear statement in the readme that this is a work in progress, results are subject to change while each tested software package is improved (especially true to lasR).
clearly explain that even though each task is designed to be as similar as possible across the different tools, it is not always possible and that is some?/many? cases the output from one tool may require postprocessing (e.g. like in the example with local maxima that @Jean-Romain mentioned)
re-run the benchmark for lasR using the newest version
re-run the entire benchmark on different workstations
write a guide to explain how to run the benchmark so that anyone can verify the results
make sure you are the first to know when I find something suspicious/weird

There are also some tasks that I still haven't tested, especially the one that combines multiple task together. This is were lasR should shine.

Jean-Romain commented 3 months ago

And I appreciate your help. Sounds like a plan. Thank you.

Jean-Romain commented 3 months ago

Just putting that here for your information local_maximum is slower mainly because my code for writing in gpkg file is slow. This is an improvement I should do.

ptompalski / LidarSpeedTests

results for tree detection erroneous? #1