jblindsay / whitebox-tools

An advanced geospatial data analysis platform
https://www.whiteboxgeo.com/
MIT License
967 stars 161 forks source link

Improving I/O #429

Closed InsolublePancake closed 2 months ago

InsolublePancake commented 3 months ago

I have been experimenting with using cloud computing to run whitebox tools on DEMs that are too large to process locally. Memory is the biggest issue of course, but I've noticed that I/O is also a bottleneck. I was wondering whether there was anything you could do to improve this within the tools? For example, I know that whitebox tools use parallel processing in places but does it take full advantage of this at the I/O stage? What about adjusting the read-ahead value when reading in large datasets? I'm afraid that my knowledge of low-level programming is limited so I won't presume to advise you, but any improvements you could make here would be appreciated.

Similarly, is there anything that I could do at my end to improve I/O? I notice that the raster output of your tools is arranged 1 row per block. I infer from this that this is more efficient for the tools. Presuming that is true, would there be any advantage to expanding blocksize further - such as two rows per block?

Any help or advice very welcome. Thank you

jblindsay commented 3 months ago

The problem of I/O is with the design of WhiteboxTools unfortunately. Each tool is effectively an independent program and it needs to read its inputs and write its outputs. When you have a long workflow involving many intermediate steps, this compounds the I/O requirements substantially. However, the I/O issue is largely resolved with the use of Whitebox Workflows for Python (WbW) rather than WhiteboxTools (WbT), since there is no need for I/O in the intermediate steps. Each WbW tool takes an input in-memory geospatial objects and outputs in-memory objects. With WbW, you only need to read the input and write the output but not any of the steps in between.

InsolublePancake commented 2 months ago

Ok, that makes sense. Unfortunately our company has a policy that we can only use software libraries with fully opensource licenses. They are very wary of freeware, or indeed anything but the most permissive licensing. They do not seem likely to budge on this, so I am limited to using the basic python API.

Can the I/O be improved with parallelising? Or is this already being done? Please excuse my ignorance of the code.

jblindsay commented 2 months ago

Reading/writing files is bound by the disc hardware not the parallel CPU. If you have an SSD it should be quite fast but if you are using a hard-disc instead, it'll be relatively slow.

InsolublePancake commented 2 months ago

Ok, well that's good to know at least. Thank you