jblindsay / whitebox-tools

An advanced geospatial data analysis platform
https://www.whiteboxgeo.com/
MIT License
966 stars 161 forks source link

Mosaic a larger-than-memory LiDAR set #161

Open openSourcerer9000 opened 3 years ago

openSourcerer9000 commented 3 years ago

I want to say thank you for providing these whitebox tools, I’ve been trying to ween myself off of arcmap but had yet to find dedicated open source LiDAR tools.

I’m trying to mosaic some LiDAR tiles that are larger than memory, so I was looking into this recipe:

https://jblindsay.github.io/wbt_book/tutorials/mosaic.html

The explanation says that this script mosaics tiles in batches, before tying them all in to 1 larger raster, however it seems to simply run mosaic once naively on the provided datasets. Thinking the tool may have been updated to include this functionality, I tried it on a folder of larger-than-memory tiles, and got this result:

*********************
* Welcome to Mosaic *
*********************
Number of tiles: 3774
Reading data...
Progress: 0%
Progress: 1%
…
Progress: 99%
Progress: 100%
Output image size: (461190 x 418640)
memory allocation of 1544580652800 bytes failed

The tiles all together come in at ~50GB, not 1.5TB. Do you have this implementation available somewhere? I would imagine it would have to be implemented in Rust rather than abstracted to python, if the final step is running the naïve mosaic tool to tile all the sub-mosaics together, we’re still attempting to load the whole dataset into memory at once. I would think it would need to be processing lazily by chunk/seamline at the lowest level in the first place to work. I would just be scared to dive into rust since I haven't gotten my tetanus shot...

Afrancioni commented 3 years ago

Hello,

You are correct, the link in the manual that refers to a script that can batch mosaic is not the one shown. I have attached the correct script down below. This script preforms a batch mosaic of 250 tiles at a time. Hopefully you find this script useful and are able to modify further to your needs.

mosaic.py.zip

Regards

openSourcerer9000 commented 3 years ago

Ok thanks for that script. Unfortunately it doesn't solve the problem that the underlying mosaic process is attempting to read everything into memory, this could only be fixed within Rust. At some point it needs to mosaic the intermediate mosaicked rasters together to a file that's larger than memory.

On top of this, it's bloating the incoming data. I ran a test trying to mosaic 2 TIF's, sizes 3.4GB and 2.2GB on a 32GB RAM machine. It begins to fill the memory up on the machine, then fails with the following error. Somehow things get 50X larger and the files become 240GB. I'm not sure if it's calculations are off, but it does fill up the working memory with that operation.

*********************
* Welcome to Mosaic *
*********************
Number of tiles: 2   
Reading data...
Progress: 0%
Progress: 100%
Output image size: (191876 x 157466)
memory allocation of 241711569728 bytes failed
Processing mosaic 1; num. files = 2

To reproduce the bloating phenomenon, just try to mosaic any 2 rasters of a few GB each.

To reproduce the in-memory issue, download the rasters listed below, convert to TIF (total size should be ~50GB, I assume this will be larger than your RAM), and feed to mosaic, either with the above script (setting the read and write path as the same to ensure it mosaics everything together eventually) or with one mosaic operation.

The fix may involve coupling the mosaic processing with writing the output file (at least when files are larger than memory). It could write the output file chunk by chunk as it's mosaicking. The ability to read rasters lazily is ideal as well otherwise it may write rasters larger than it can read.

openSourcerer9000 commented 3 years ago

Links to rasters to reproduce issue, these can be downloaded in bulk with uGet or python requests module:

DL_rasters.zip

tombe-nm commented 7 months ago

Anyone found a workaround for this?

jblindsay commented 7 months ago

You might consider using the mosaic tool available in Whitebox Workflows (WbW) instead. It has an improved raster memory model over WhiteboxTools that often means that rasters require half as much system memory, meaning in this case, that you should be able to mosaic larger rasters.

tombe-nm commented 7 months ago

Great, I will give it a go.

And thank you for these excellent tools!

tombe-nm commented 7 months ago

Hi @jblindsay

I notice the documentation for the WbW mosaic function reads "Note that when the inputs parameter is left unspecified, the tool will use all of the .tif, .tiff, .rdc, .flt, .sdat, and .dep files located in the working directory."

However when I leave the 'images' argument out I get the following error:

Traceback (most recent call last):
  File "c:\tom_bennett\Development\DTM Creator\large_dataset_mosaic_test.py", line 57, in <module>
    mosaic_rasters()
  File "c:\tom_bennett\Development\DTM Creator\large_dataset_mosaic_test.py", line 42, in mosaic_rasters
    wbe.mosaic(
TypeError: WbEnvironmentBase.mosaic() missing 1 required positional argument: 'images'

And when I pass None as the argument I get:

Traceback (most recent call last):
  File "c:\tom_bennett\Development\DTM Creator\large_dataset_mosaic_test.py", line 57, in <module>
    mosaic_rasters()
  File "c:\tom_bennett\Development\DTM Creator\large_dataset_mosaic_test.py", line 42, in mosaic_rasters
    wbe.mosaic(
TypeError: argument 'images': 'NoneType' object cannot be converted to 'PyList'

What should I pass to 'images' to search the working directory?