r-lidar / lidR

Airborne LiDAR data manipulation and visualisation for forestry application
https://CRAN.R-project.org/package=lidR
GNU General Public License v3.0
601 stars 131 forks source link

Contributing a CBH (crown base height) estimation algorithm (Luo et al. 2018). #506

Closed mavavilj closed 2 years ago

mavavilj commented 2 years ago

Contributing a CBH (crown base height) estimation algorithm (Luo et al. 2018, https://www.osapublishing.org/oe/fulltext.cfm?uri=oe-26-10-A562&id=385987).

May be further developed based on user experience, but is usable currently, given a suitable point cloud.

Jean-Romain commented 2 years ago

I really appreciate your tentative to include a new algorithm in lidR and I'm keen to accept it. But introducing new code in an R package does not consists only in putting a standalone R script in a new file. It must be integrated, documented, validated, tested and follow R package rules which is not the case of your code. Moreover I see

library(TreeLS)
library(pracma)
library(equate)

Which means that 3 more dependencies must be added in lidR which is a definitive no on my side. Can you write a code without these packages? If not you will have to create a new package on your own but you can create a lidR plugins if you want.

mavavilj commented 2 years ago

REGARDING STRUCTURE:

I thought it would satisfy requirements for a "one function file":

https://r-pkgs.org/r.html#code-organising


REGARDING DELIVERY:

I was comparing it to this one in the repo:

https://github.com/r-lidar/lidR/blob/master/R/concaveman.R


REGARDING REQUIRED PACKAGES:

Maybe, maybe not. I thought TreeLS is closely related to lidR, so that I could use it:

https://github.com/tiagodc/TreeLS/blob/9033693503ee71ace2335a6b8fa2dce864df37cd/DESCRIPTION

being fully compatible with the 'LAS' infrastructure of 'lidR'

The functions utilized are such that would perhaps see use otherwise in lidR though, like finding zeroes of functions (from pracma).

The equate package is used for PercentRank (function px). Which cannot be replaced, but may perhaps be rewritten.


I can put it into lidRplugins if it's better.

Jean-Romain commented 2 years ago

I thought it would satisfy requirements for a "one function file":

This is not about being in one file. It's about your function not being documented, not being exported, using invalid statements in the code of an R package (use of library or ::: operator), without being unit-tested, without calling functions by namespace and so on.

Maybe, maybe not. I thought TreeLS is closely related to lidR, so that I could use it:

TreeLS is not on CRAN and thus cannot be a dependency of lidR. And even if it were TreeLS depends on lidR so lidR cannot depends on TreeLS (cyclic dependency)

I can put it into lidRplugins if it's better.

The problems I'm highlighting must be resolved first.

mavavilj commented 2 years ago

I don't know if you've discussed with the TreeLS author though, because it seems that these projects have quite a bit of overlap. With TreeLS being more geared towards trees (?), which might mean that it's actually the better place for this function. But lidR says it's also for forestry applications.

Jean-Romain commented 2 years ago

I personally know the guy behind TreeLS. I do not have new from him but last time we exchanged he said that he was no longer working with lidar. His package was removed from CRAN 1 years ago because he never fixed troubleshooting on CRAN. There is no chance the package will comeback imo. He never answered a single issue on his github for months. The project is abandoned. Here your 3 best options:

mavavilj commented 2 years ago

Okay, I will see what I can do. I'm aiming for the first one, but I'm not sure if unit testing is trivial for such a function (do you have any ideas how to unit test on point clouds?). It's not possible to test it on all kinds of inputs anyways, so it should leave the verification of reasonability of results up to the user.

bi0m3trics commented 2 years ago

I just glanced at the code (ok, looked line by line) and failed to see a TreeLS function (I'm pretty familiar with that code and package), so I must have missed it. What TreeLS function(s) are you using?

Jean-Romain commented 2 years ago

(do you have any ideas how to unit test on point clouds?

Run on an example data where you are sure it works as expected. Test how many tree are found. Test if errors are properly handled. Test if weird parameters trigger warning or error. Take inspiration of https://github.com/r-lidar/lidR/blob/devel/tests/testthat/test-segment_trees.R

If your function is a tree segmentation function it must be compatible with segment_trees. I can help for this step.

Jean-Romain commented 2 years ago

Last but not least I'm only accepting algorithms that are decently fast (and published in a peer review paper but that one is :heavy_check_mark:)

mavavilj commented 2 years ago

I just glanced at the code (ok, looked line by line) and failed to see a TreeLS function (I'm pretty familiar with that code and package), so I must have missed it. What TreeLS function(s) are you using?

https://www.rdocumentation.org/packages/TreeLS/versions/2.0.2/topics/readTLS

(do you have any ideas how to unit test on point clouds?

Run on an example data where you are sure it works as expected. Test how many tree are found. Test if errors are properly handled. Test if weird parameters trigger warning or error. Take inspiration of https://github.com/r-lidar/lidR/blob/devel/tests/testthat/test-segment_trees.R

If your function is a tree segmentation function it must be compatible with segment_trees. I can help for this step.

Did you even read what the code does? It operates on a single tree point cloud (e.g. a single segment). But based on my experiments it can produce wildly varying results depending on the quality of input (quality of segmentation, quality of separation from understory vegetation, point density, ...).

The reason I asked is that testing "robustness" of this is non-trivial. But I've already tried to cover the error cases. I've used this function succesfully for 18 trees. And on these it has not produced any sort of error in critical points of derivatives or something.

Jean-Romain commented 2 years ago

TreeLS::readTLS is only a wrapper around lidR::readLAS + a wrapper around data.table::fread and lidR::LAS. Yet I can't see any occurence of readTLS in your code.

Did you even read what the code does?

No, I've only seen that in many places the code is not a valid R code for an R package according to CRAN policy. Actually it is very hard to understand what it does just reading the code on github. I need to read the article first. Remember that your code is not documented so I was not able to refer to the doc or to a reproducible example. So I don't even know what it does.

The reason I asked is that testing "robustness" of this is non-trivial.

The goal is not to test robustness of tree detection method. I presume you ran testing for your article. The goal if to test if the output is always the same for a given input and that code modifications in lidR do not break anything, to test if the function does not crash unexpectedly especially in edge cases, to test if the function returns the errors it is expected to throw and so on...

mavavilj commented 2 years ago

readTLS is the example used for reading data in (Line 21):

https://github.com/r-lidar/lidR/pull/506/files#diff-c247f3a6f16f12ee6eb80aaea88009eba95d3c194f7c2e30d55b43f92f9313bcR21

Yes, but I find that it's also non-trivial to know all the cases where the algorithm could fail.

I've recognized some of them, but I'm not entirely sure, if I've recognized all of them. The paper does not explain the exceptions that might occur in programmatic implementation. But some of them are related to e.g. "0 point intervals" that might occur in some cases where the point cloud is sparse at places. But it could also fail e.g. inside some external function used to process derivatives. Or e.g. in the construction of splines. Unless these are fully tested as well.

I will try to make these changes though.

mavavilj commented 2 years ago

Can you clarify which of the CRAN conventions apply to this one function? I assume I'm not required to produce the equivalent of it for a "full package", since I only have one function.

So I would guess it's enough to do:

https://r-pkgs.org/man.html#man-functions

But do I e.g. need package metadata?

https://r-pkgs.org/description.html

Jean-Romain commented 2 years ago

First link. The DESCRIPTION file is the metadata of the package. You should never touch it. Your function must be associated with roxygen comments to generate the package manual. You can look at the other R files of the package for examples. If it is not perfect it is ok, I can fix some errors. But the parameter and the goal of the functions must be well documented. I can fix stuff but I can't write the doc.

Also before to work I think we should talk about your plan. Remember what I said: among other your function must be compatible with segment_trees() if it is a segmentation function.

mavavilj commented 2 years ago

What should I do with the findzeros() from pracma? How am I supposed to replace it w/o relying on a new dependency?

https://github.com/cran/pracma/blob/c1688b374d201c13fb40b4dda2d2a89e34b94ec6/R/findzeros.R#L6

It seems that I can merely copy that code from there to my own file, but I wonder if this is a desirable way to include it?

Jean-Romain commented 2 years ago

Copy pasting the code from pracma is not a desirable way for sure. I quickly looked into the paper especially figure 4 and the problem to solve is easy. It's a 10 lines of code problem and I think using findzeros is overkill. Especially in your code the following is overkill and unnecessarily slow

pred <- function(x) stats:::predict.smooth.spline(fit, x)$y
pred1stderiv <- function(x)  stats:::predict.smooth.spline(fit, x, deriv=1)$y
pred2ndderiv <- function(x)  stats:::predict.smooth.spline(fit, x, deriv=2)$y

ddzeros = findzeros(pred2ndderiv,0,slice_max,n=(slice_max)/dh)
d_at_zeros = pred1stderiv(ddzeros)

It could become something like

x <- seq(0, 30, 0.5)
pred <- stats::predict.smooth.spline(fit, x)$y
pred1stderiv <-  stats::predict.smooth.spline(fit, x, deriv=1)$y
pred2ndderiv <-  stats:::predict.smooth.spline(fit, x, deriv=2)$y

for (i in 2:length(pred2ndderiv)) {
  if (sign(pred2ndderiv[i-1]) != sign(pred2ndderiv[i]))
    # There is a 0 approximately at (x[i-1] + x[i])/2
}

After looking in the paper I now understand that we are talking about a metrics derived for each tree after a tree segmentation. Consequently your code should be compliant with the "metrics" concept of lidR and should work somehow like this:

tree_metrics(las, func = ~cbh_lu02008(X,Y,Z))
mavavilj commented 2 years ago

Actually, I'm not sure anymore, if I want to do the work for this, since I've been thinking that Nim (https://nim-lang.org/) would be better technology for point cloud analysis. I read some paper that argued that lidR will always be considerably slower compared to more native languages. Which will presumably be a bottleneck for very large data sets. Therefore I was interested in migrating everything there instead.

The code is not very hard to edit though if there's still interest in it.

All of the libraries can be replaced with alternative implementations or something already in lidR.

hriihimaki commented 2 years ago

Before doing that you should ask yourself a question - who (else) in the point cloud industry uses nim?

If you want to go native why not C++?

Just my two cents ...

mavavilj commented 2 years ago

Before doing that you should ask yourself a question - who (else) in the point cloud industry uses nim?

If you want to go native why not C++?

Just my two cents ...

A way to make it appear viable is to contribute towards it. I find it's a great improvement over C++. Particularly, fast as C/C++, easy to write and clean as Python, and robust as Java.

Another question to be asked is:

What's the use of R, when the data set is too large?

What's the use of C++, when the language is a horrible mess to use?

Jean-Romain commented 2 years ago

I read some paper that argued that lidR will always be considerably slower compared to more native languages. Which will presumably be a bottleneck for very large data sets.

References please? It is not completely wrong btw.

mavavilj commented 2 years ago

I read some paper that argued that lidR will always be considerably slower compared to more native languages. Which will presumably be a bottleneck for very large data sets.

References please? It is not completely wrong btw.

ROUSSEL, Jean-Romain, et al. lidR: An R package for analysis of Airborne Laser Scanning (ALS) data. Remote Sensing of Environment, 2020, 251: 112061.

https://reader.elsevier.com/reader/sd/pii/S0034425720304314?token=02ED1D1CA25007E10028213AF0DF7D1AA1040772C6A17A71CBEF4E1ABF7631A26BB8996D586123C6C06F445DBCC73749&originRegion=eu-west-1&originCreation=20220203172918

page 11

Or also (page 12):

For example, the algorithm for tree segmenta- tion developed and published by Li et al. (2012) has a quadratic com- plexity meaning that the computation time is multiplied by four when the number of points is doubled. It can therefore not really be per- formed on point clouds larger than a few hectares

Now, nothing is preventing from adding this algorithm to lidR, but these limitations have motivated me to look at developing equivalent nim implementations instead.

bi0m3trics commented 2 years ago

LOL... you realizes that's @Jean-Romain et al's paper (mine too), right? He and I did the benchmarks and I think it does quite well, besides anything with a scripting component (i.e., passed to and not in native c/c++ or fortan or java) will be slower than most lower-level programming languages. However, I've been a lidR user since the beginning and I've tried almost everything else and the functionality and scalability of lidR consistently impresses even me!

Also, we saw no hinderances using lidR for estimating canopy fuels metrics: Caden P. Chamberlain, Andrew J. Sánchez Meador, Andrea E. Thode. 2021. Airborne lidar provides reliable estimates of canopy base height and canopy bulk density in southwestern ponderosa pine forests. Forest Ecology and Management. 481: 118695 https://doi.org/10.1016/j.foreco.2020.118695.

Jean-Romain commented 2 years ago

Well, citing the paper we wrote to argue that lidR is considerably slower than other options is... audacious... at least. Moreover referring to one specific algorithm from the peer-reviewed literature for which none of us are authors is kind of unfair cherry picking. As @bi0m3trics said he ran the benchmarks for the paper and lidR is very decent even if it has some intrinsic limitations compared to native C++ tools such as LAStools. I'm the first one to aknowledge about that and I do not take that personally if the only paper against lidR is mine :wink:

Anyway it is your choice to chose to move somewhere else and I won't argue against that. I appreciate you thought about lidR for sharing your code even if you do not finalize the implementation partly because of the constraints I put.