r-spatial / qgisprocess

R package to use QGIS processing algorithms
https://r-spatial.github.io/qgisprocess/
GNU General Public License v3.0
198 stars 20 forks source link

qgis_crop runs slower on subset than on full dataset #160

Closed bholtdwyer closed 1 year ago

bholtdwyer commented 1 year ago

Hi! I'm noticing something very strange: when I try running qgis_crop on a large dataset, the code finishes running almost immediately; when I try running it on a small subset of the data to be cropped, the code hangs and does not finish for minutes on end. I have no idea why this should be the case. Example below (I can send the shapefiles I'm using if that would be helpful).

#Code runs very quickly when run on the full files:
shrug_centroids = my_read_sf("../../data/created/shrug_2_0_pakora_centroids.gpkg")
nrow(shrug_centroids)
#576000 points
nrow(single_final_command)
#824 disjoint polygons
microbenchmark({
  qgis::qgis_clip(shrug_centroids, single_final_command, "../../data/created/treatment_shrugs.gpkg")
  shrugs_in_final_command_qgis = my_read_sf("../../data/created/treatment_shrugs.gpkg")},
  times=1)
#Takes only about 30 seconds.

#Now trying on a small subsample of the data:
small_sample = shrug_centroids[1:100,]
microbenchmark(
  {qgis::qgis_clip(small_sample, single_final_command, "../../data/created/temp.gpkg")
  bob = read_sf("../../data/created/temp.gpkg")},
ntimes=1)
#Runs for several minutes before I give up and interrupt it.

#Bizarrely, seems to take more time on a subset. Why? Is the code faster if the input coincides with a file on disk?
write_sf(small_sample, "../../data/created/small_sample_temp.gpkg")
small_sample = read_sf("../../data/created/small_sample_temp.gpkg")
microbenchmark(
  {qgis::qgis_clip(small_sample, single_final_command, "../../data/created/temp.gpkg")
    bob = read_sf("../../data/created/temp.gpkg")},
  ntimes=1)
#Also does not finish running after several minutes.

Please let me know if there's anything I can do to be helpful, and thanks so much for making this package available!

bholtdwyer commented 1 year ago

I should add that if I export small_sample and single_final_command to .gpkg files, load them in the QGIS GUI, and call "Clip", the process finishes in 4.5 seconds. So it's not that there's something about the input data that should make the process stall out. Also, I've run st_make_valid and st_is_valid on both; they both have valid geometries.

florisvdh commented 1 year ago

Hi @bholtdwyer can you provide a minimal reproducible example? Also the output of sessioninfo::session_info(), and used QGIS version please. You can use reprex::reprex(session_info = TRUE) to make this easy. Thanks!

bholtdwyer commented 1 year ago

So sorry, I went back this morning and found a bug in my code that was the source of the poor performance. (Can you spot it? Answer in reverse in postscript.) My apologies.

One question: would it be possible to add a global option to the package that makes its functions output sf objects? I do all my other spatial manipulation in R using sf objects, and having to call st_as_sf() on the result of each call to qgisprocess just uglies up the code. (In fact I only recently found out that a qgis_result was coercible to sf.)

Thanks so much for all your work on this! The speed of qgis_clip on my full dataset is currently orders of magnitude faster than the closest sf equivalent, which has made a seemingly slow computation blazingly fast.

P.S. [!eman tnemugra eht ton si "semitn"]

florisvdh commented 1 year ago

Thanks for your feedback on this @bholtdwyer, great that you have found this yourself. No problem, we're all learning!

would it be possible to add a global option to the package that makes its functions output sf objects

Not all algorithms return a spatial object as part of the qgis_result, and some return multiple such objects. So this would present pitfalls when trying to generalize across algorithms. The main function qgis_run_algorithm() best remains robust with a predictable result. You could write your own wrapper function to use in your own scripts though, something like:

myfun <- function(algorithm, ...) {
  result <- qgis_run_algorithm(algorithm = algorithm, ...)
  st_as_sf(result)
}

Do note however that the more you hide, the harder it will be to debug in case of problems.