r-lidar / lidR

Airborne LiDAR data manipulation and visualisation for forestry application
https://CRAN.R-project.org/package=lidR
GNU General Public License v3.0
582 stars 130 forks source link

pixel_metrics() runs single threaded with plan(multisession, workers > 1) and readLAS() #715

Closed twest820 closed 11 months ago

twest820 commented 11 months ago

From #523 it looks like pixel_metrics() in lidR 4.0.4 should support parallel execution. However, I'm only seeing a single core go active.

tile = readLAS(filePath)
plan(multisession, workers = 16)

metrics = pixel_metrics(tile, .stdmetrics)

I'm guessing this is because #523 enabled parallelism at the inter-tile (catalog) level but not within individual tiles. It'd be nice to also have lower level parallelism.

For example, at the moment I've 650 some tiles to process with a metrics function which takes about 20 thread-minutes and 20–30 GB DDR per tile. Given 128 GB of DDR, four or five jobs can thus load tiles in parallel without exhausting physical memory to the point of excessive swapping. Given a 16 core (all hyperthreaded) processor, each job can be allocated 4–8 threads, resulting in nominal runtime of something like 20 minutes/tile 650 tiles / (5 jobs 6 threads/job) = 7.2 hours. Since pixel_metrics() currently runs single threaded in this use case, the implementable divisor is 1 thread/job and the run time goes up close to two days.

A partial workaround's presumably to build a catalog for each job but, since the catalog level metrics need to be sharded back into rasters corresponding to the original tiles, it'd be simpler if pixel_metrics() could just call the metrics function in parallel over a given tile. I'm unsure how easy R makes this to implement in practice but didn't come up with a search hit for an issue tracking intra-tile parallelism.

Jean-Romain commented 11 months ago

parallelism within individual point cloud is not straightforward:

What can be parallelized is already parallelized in lidR. In the specific case of pixel_metrics the function relies on data.table. data.table is already world the fastest and already makes use of parallelism on key locations. There is nothing I can do to speed up pixel_metrics. At least not in lidR.

twest820 commented 11 months ago

The algorithm may not be parallelisable at all.

Not the case for pixel_metrics(), though, since it's a for loop over the cells in a raster. Also not the case for .stdmetrics and such as the calculations for each cell are independent of all others and thus parallelize trivially.

The algorithm may call code from third party package that is not parallelized

Sure, but 1) that's not the case for pixel_metrics() itself and 2) it's hard to see why someone somewhere potentially passing a thread unsafe function to pixel_metrics() sometime should force everyone else's thread safe code to run more slowly.

we must recall that R in single threaded and some limitations apply

Which is presumably why lidR already composes with future and uses OpenMP.

There is nothing I can do to speed up pixel_metrics.

Calling a function pointer within a parallel for loop's not difficult from C++. Since lidR already implements such loops and has C++ metrics functions it's unclear this statement holds, particularly since stdmetrics kinds of operations are pretty stable and could also be moved into C++.

The part I'm not entirely sure about is calling back into user provided R functions.. Since calling R from C++ is supported there's no fundamental obstruction to most such functions but, as you've pointed out, R is sometimes lacking routine basic thread safety. So it's possible internal R issues might prevent use of thread safe user functions. In which case an R core issue can be opened for fix.

R core would probably want a repex for that, though, and pixel_metrics() current state prevents attempting trying to write one.

Jean-Romain commented 11 months ago

In theory you are correct. In practice you are wrong. All *_metrics functions are based on data.table aggregation which means they all run through a single main function in lidR. Internally they rely on a super efficient and safe code from data.table. There is one exception however for point_metrics that uses parallelized code and pointers to R functions exactly the way you are describing. And trust me it is complex and it partially relies on undocumented parts of the poorly documented R's C API. However my code is not as safe as the data.table code because nothing replace decades of development by world class programmers.

I don't want to write custom C++ code with call to R function pointers for every _metrics functions. And even if I'd try it, it is not guaranteed I could beat data.table. And even if I beat data.table I'll run into so much potential troubles on CRAN.

Now if you thing you know R, lidR and the R's C API better than me (which is possible, I do not question that, I've still a lot to learn), feel free to make a pull request with a parallelized version of pixel_metrics() and a lot of unit tests.

twest820 commented 11 months ago

feel free to make a pull request with a parallelized version of pixel_metrics()

That would be some years out, I'm afraid. Basically we'd need a letter of support from you for a grant application and sufficient off GitHub discussion as to have reasonable assurances a fundable range of PRs would be accepted. Prerequisite to that's probably some discussion with R core, which I'm unfortunately not going to have cycles to drive anytime soon.

If I can find the time I'll try to throw together a point solution and put it under the GPL.

Jean-Romain commented 11 months ago

Hey, honestly if you are not happy with lidR you can just stop using it. I do not force anybody and I won't take it personally.

twest820 commented 10 months ago

If I can find the time I'll try to throw together a point solution and put it under the GPL.

Initial implementation (no effort at optimization): 2.0 hours, 1–2 threads at ~80% clock, ~4.0 GB DDR, IO bound at 260 MB/s on 1.9 TB of .las files Corresponding lidR extrapolation (from ~40 hours runtime): 117 days, 16 cores, ~40 GB DDR, occasional IO pulses to ~65 MB/s

A 5950X thus appears plausibly capable of saturating around ~3 GB/s IO but I don't have the NVMe capacity to test whether runtimes below 15 minutes are actually possible in this case.

Jean-Romain commented 10 months ago

Implementation of what? A function equivalent to pixel_metric() capable of mapping any user defined function safely in a raster format? Which user-defined function did you use? Did you include the time to read the LAS/LAZ files? What are you benchmarking? I don't even know what we are talking about. What are you comparing?

A 1500 to 6000x speed-up seems very unlikely. You would have come with a 10-fold speed-up, I could have taken it seriously.

By the way, you may be interested to know that I'm working on a new package. This package will be much more efficient than lidR (yet it will be less versatile). My benchmarks for 2 LAZ files, 30 millions points, code in C with R's C API, no parallelization, no data.table, no Rcpp are:

The amount of time dedicated to applying mean(Z) per pixels was approx 5 seconds. Thus, that means that reading 120 MB of LAZ files took ~20 seconds. For 5GB of LAZ data, it takes ~10 minutes to read the data. If I extrapolate to 100 GB it should take ~16 hours.

In LAS format you can divide by ~2 so, ~8 hours for 100 GB. JUST TO READ THE DATA! And you claim that you read AND processed 2 TB in 2 hours?

If you claim that you have done something extraordinary, at least show some reliable benchmarks and methodology to support your claim.