martibosch / pylandstats

Computing landscape metrics in the Python ecosystem
https://doi.org/10.1371/journal.pone.0225734
GNU General Public License v3.0
82 stars 16 forks source link

[feature thought] Xarray(Dask) as backend for performance increase #3

Closed CWen001 closed 4 years ago

CWen001 commented 4 years ago

Description

Describe your feature request:

First, thank you very much for the pythonic package, convenient and powerful.

I'm already impressed by the performance comparison in the pylandstats-notebook. Calculating landscape metrics is often computationally expensive, especially when the image is large or class_id increases. When testing my about 60mb-each TIF files GLC10, I'm wondering if pylandstats could be more performant.

Recently, the Dask team posted a use case to accelerate raster analysis using Xarray and Dask. Possibly it is relevant to Pylandstats and worth considerating. Additional benefits to speed may be the temporal-spatial analysis, which consumes a list of inputs. Is this path a potential direction for the package?

Additional notes: The following is related to the topic but possibly should be in annother issue, so I apologize in advance if not proper. For the same performance reason, I tried running the analysis on Colab to overcome the local machine limits. However, the installs failed. Sor far I've tested:

!pip install pylandstats
# Try installing 1.1.1, but get: ERROR: Command errored out with exit status 1: /usr/bin/python3 /usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmp539jdlhk Check the logs for full command output.

!pip install pylandstats==2.0.0b1
# Trying installing 2.0.0ba, but get: 

#Installing build dependencies ... done
#Getting requirements to build wheel ... error
#ERROR: Command errored out with exit status 1: /usr/bin/python3 #/usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/_in_process.py #get_requires_for_build_wheel /tmp/tmppayw9hpt Check the logs for full command output.

# using git clone and python setup.py install get the same error message. 

I couldn't figure out a way to install pylandstats although all its dependencies can be installed well. I'm wondering if you could provide some insights? Thank you very much.

martibosch commented 4 years ago

Hello! first of all, thank you for your feedback, I am glad that you find the package convenient.

Let me first focus on the installation error in Colab (I will write a separate response regarding the performance).

By running

$ !pip install pylandstats --log foo.txt

in a Colab notebook, I get the same error as you do, and then by running

$ !cat foo.txt

I found the part below:

...
 Successfully installed beniget-0.1.0 decorator-4.4.0 gast-0.2.2 networkx-2.3 numpy-1.17.2 ply-3.11 pythran-0.9.3.post1 setuptools-41.4.0 six-1.12.0 wheel-0.33.6
2019-10-07T12:05:00,372   Cleaning up...
2019-10-07T12:05:00,623   Cleaned build tracker '/tmp/pip-req-tracker-_fkg791l'
2019-10-07T12:05:00,748   Running command /usr/bin/python3 /usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/_in_process.py get_requires_for_build_wheel /tmp/tmp9fv_a_0u
2019-10-07T12:05:01,166   Traceback (most recent call last):
2019-10-07T12:05:01,166     File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/_in_process.py", line 207, in <module>
2019-10-07T12:05:01,166       main()
2019-10-07T12:05:01,166     File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/_in_process.py", line 197, in main
2019-10-07T12:05:01,166       json_out['return_val'] = hook(**hook_input['kwargs'])
2019-10-07T12:05:01,166     File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/pep517/_in_process.py", line 54, in get_requires_for_build_wheel
2019-10-07T12:05:01,166       return hook(config_settings)
2019-10-07T12:05:01,167     File "/usr/local/lib/python3.6/dist-packages/setuptools/build_meta.py", line 146, in get_requires_for_build_wheel
2019-10-07T12:05:01,167       return self._get_build_requires(config_settings, requirements=['wheel'])
2019-10-07T12:05:01,167     File "/usr/local/lib/python3.6/dist-packages/setuptools/build_meta.py", line 127, in _get_build_requires
2019-10-07T12:05:01,167       self.run_setup()
2019-10-07T12:05:01,167     File "/usr/local/lib/python3.6/dist-packages/setuptools/build_meta.py", line 237, in run_setup
2019-10-07T12:05:01,167       self).run_setup(setup_script=setup_script)
2019-10-07T12:05:01,167     File "/usr/local/lib/python3.6/dist-packages/setuptools/build_meta.py", line 142, in run_setup
2019-10-07T12:05:01,167       exec(compile(code, __file__, 'exec'), locals())
2019-10-07T12:05:01,167     File "setup.py", line 10, in <module>
2019-10-07T12:05:01,167       from pythran.dist import PythranExtension
2019-10-07T12:05:01,167     File "/tmp/pip-build-env-924w_3mx/overlay/lib/python3.6/site-packages/pythran/__init__.py", line 40, in <module>
2019-10-07T12:05:01,167       from pythran.toolchain import (generate_cxx, compile_cxxfile, compile_cxxcode,
2019-10-07T12:05:01,167     File "/tmp/pip-build-env-924w_3mx/overlay/lib/python3.6/site-packages/pythran/toolchain.py", line 6, in <module>
2019-10-07T12:05:01,167       from pythran.backend import Cxx, Python
2019-10-07T12:05:01,167     File "/tmp/pip-build-env-924w_3mx/overlay/lib/python3.6/site-packages/pythran/backend.py", line 8, in <module>
2019-10-07T12:05:01,167       from pythran.analyses import LocalNodeDeclarations, GlobalDeclarations, Scope
2019-10-07T12:05:01,168     File "/tmp/pip-build-env-924w_3mx/overlay/lib/python3.6/site-packages/pythran/analyses/__init__.py", line 12, in <module>
2019-10-07T12:05:01,168       from .aliases import Aliases, StrictAliases
2019-10-07T12:05:01,168     File "/tmp/pip-build-env-924w_3mx/overlay/lib/python3.6/site-packages/pythran/analyses/aliases.py", line 6, in <module>
2019-10-07T12:05:01,168       from pythran.syntax import PythranSyntaxError
2019-10-07T12:05:01,168     File "/tmp/pip-build-env-924w_3mx/overlay/lib/python3.6/site-packages/pythran/syntax.py", line 7, in <module>
2019-10-07T12:05:01,168       from pythran.tables import MODULES
2019-10-07T12:05:01,168     File "/tmp/pip-build-env-924w_3mx/overlay/lib/python3.6/site-packages/pythran/tables.py", line 173, in <module>
2019-10-07T12:05:01,168       BINARY_UFUNC = {"accumulate": FunctionIntr()}
2019-10-07T12:05:01,168     File "/tmp/pip-build-env-924w_3mx/overlay/lib/python3.6/site-packages/pythran/intrinsic.py", line 94, in __init__
2019-10-07T12:05:01,168       super(FunctionIntr, self).__init__(**kwargs)
2019-10-07T12:05:01,168     File "/tmp/pip-build-env-924w_3mx/overlay/lib/python3.6/site-packages/pythran/intrinsic.py", line 58, in __init__
2019-10-07T12:05:01,168       [to_ast(d) for d in kwargs.get('defaults', [])])
2019-10-07T12:05:01,168     File "/usr/local/lib/python3.6/dist-packages/gast/gast.py", line 19, in create_node
2019-10-07T12:05:01,168       format(Name, nbparam, len(Fields))
2019-10-07T12:05:01,168   AssertionError: Bad argument number for arguments: 6, expecting 7
...

which suggests me that this is related to an issue that I also encounter in the Azure pipelines, although in this case, pythran's version is correctly set to pythran-0.9.3.post1 (and gast==0.2.2 beniget==0.1.0). Do you have any thoughs on why that might happen, @serge-sans-paille @paugier (sorry to bring you into this)? And by the way, the same error appears when trying to install v2.0.0b1 as in !pip install pylandstats==2.0.0b1.

paugier commented 4 years ago

I wonder what is the version of /usr/local/lib/python3.6/dist-packages/gast

From the error log, it could be the new gast, which is not compatible with pythran-0.9.3.post1

It seems strange that /tmp/pip-build-env-924w_3mx/overlay/lib/python3.6/site-packages/pythran uses /usr/local/lib/python3.6/dist-packages/gast during an isolated build of pylandstats (using pep517)...

CWen001 commented 4 years ago

Thank you very much. Now I can use pip to install pylandstats on Colab without any problem.

!pip install pylandstats can successfully install version 1.1.1. !pip install pylandstats==2.0.0b1 can successfully install version 2.0.0b1.

martibosch commented 4 years ago

Sorry for the delay in my response regarding the performance. Here it goes.

I see a big issue regarding the use of Dask arrays: Dask splits arrays into chunks, which in our case, would break the patches and therefore distort the metrics. If you do not mind such distortion, any raster file can be splitted into tiles outside PyLandStats and then each tile can be instantiated as a Landscape object and analyzed (also in parallel). This would be similar to the "zonal statistics" featured in the LecoS QGIS plugin.

In any case, there is room for performance improvements, especially since the metrics are computed for each class separately (see the numerous for class_val in self.classes spread throughout landscape.py. These loops are in fact embarrassingly parallel and significant speed-ups can be achieved (e.g., the file 000E00N.tif of your dataset has 9 distinct land use classes, so in theory speed-ups could be up to x9).

I have explored the file 000E00N.tif of your dataset in my desktop computer, the same machine used in the "performance notes" notebook, and I see room for a 216 s speed-up by caching the results to each call to scipy.ndimage.label.

Unluckily, the computation of the euclidean nearest neighbour and pixel adjacency matrix (needed for several edge metrics and the contagion) cannot be parallelized so easily.

I have drafted some ideas here in case anybody feels adventurous. I could try to implement the two trivial improvements noted above in a separate branch and notify you when I push the changes. It shouldn't take me much time and you can experiment with it. Nevertheless, I cannot promise improving the overall performance if you need the euclidean nearest neighbour and the adjacency matrix (these are more complicated and I do not have much time since I am in the last year of my PhD).

I hope this helps somehow. In any case, thank you for using PyLandStats. Cheers, Martí

CWen001 commented 4 years ago

Dear Martí,

Thank you very much for your thoughts. Now I see the point that the use case from Dask calculating NDVI is different from the landscape metrics, as the pixel value does not have to consider patches. I spent some time searching but indeed cannot find the use case of Xarray (and the PANGEO) computing landscape metrics. However, as the Colab can use Pylandstats now, using cloud infrastructure seems to be a way to mitigate much heavy computation. As from non-technical background, I'm happy that the ease of this package can pull me from GUI tools to stay in the Python ecosystem. I would keep using it and hope one day I can contribute as coding skills grow.

I'm changing the issue title from [feature request] to [feature thought], and closing as all questions answered. Thank you.

Greetings from Hannover Chen