providing vector attribute tables to dplyr is 40 times slower with terra than with .csv files

rspatial / terra

R package for spatial data handling https://rspatial.github.io/terra/reference/terra-package.html

GNU General Public License v3.0

532 stars 86 forks source link

providing vector attribute tables to dplyr is 40 times slower with terra than with .csv files #745

Closed twest820 closed 2 years ago

twest820 commented 2 years ago

I have a small vector layer of about 80,000 grid squares and not quite 50 attribute table columns. Function timing with R 4.2.1, dplyr 1.0.9, readr 2.1.2, and terra 1.5-34 results in

as_tibble(vect("test layer.gpkg")) # 50 MB GeoPackage: 6.8 s total, 1.3 s vect() => 5.5 s as_tibble()
as_tibble(vect("test layer.fgb")) # 59 MB FlatGeoBuf: 7.8 s total, 2.1 s vect() => 5.7 s as_tibble()
read_csv("test layer.csv", col_types = resourceUnitColumnTypes) # 61 MB .csv: 0.17 seconds

As a result, even if the layer only needs to be read into R once, it's faster to run the PyQGIS to export it to .csv and read it with readr::read_csv() than it is to just to leave the layer in a geospatial format. Since GeoPackage and FlatGeoBuf are binary file formats I'd expect terra to be able to read the attribute table more quickly than readr. Most of the columns are doubles and using .csv therefore imposes substantial string parsing costs. But vect() is an order of magnitude slower than read_csv(). And, even if as_tibble() requires a full memcpy() of the attribute table, that should take less than 5 ms at the system's 25.6 GB/s bandwidth (DDR4-3200). So as_tibble() is three orders of magnitude slower than I would expect.

Is there a way to reduce the overhead of using terra in such cases? It would be nice to simplify the .csv files out of the surrounding workflows.

rhijmans commented 2 years ago

Can you provide an example that I can look at?

kadyb commented 2 years ago

I can recommend the interesting blogpost from @paleolimbot about using geoarrow and GeoParquet file format to fast loading vector data.

In readr::read_csv() you should use num_threads = 1 because {terra} uses one thread and hence there may be some difference in timings, but I believe it will still be magnitude faster.

twest820 commented 2 years ago

Can you provide an example that I can look at?

I see you've gotten the data link; feel free to email me back on specifics. Most of the columns are public data but if you need to redistribute there's a few which should probably be removed. Let me know if that's the case and I'll trim them out (and, yes, I'm aware one column that should be double is mistyped to string; a typo on my part which I've fixed on the PyQGIS side but isn't reflected in this iteration of the files).

In readr::read_csv() you should use num_threads = 1

While I understand the point you're making from a benchmarking perspective, there are many idle cores at this point in the workflow. If terra (or GDAL) chooses not to use them that doesn't influence my performance expectations as an end user. Besides, the majority of the slowness is attributable to as_tibble() and read_csv(num_threads = 1) comes in around 480 ms.

Looks like for GeoParquet to be an option here GDAL 3.5 would be a minimum requirement but QGIS 3.22.6 is on GDAL 3.4.2. I'll see about updating to 3.22.9—some of the timings in Dewey's post look attractive, thanks!—but have to get to a meeting at the moment.

kadyb commented 2 years ago

Here I made some reproducible benchmark. I hope it is helpful.

Code

```r library("sf") library("terra") library("readr") tmp = tempfile(fileext = ".gpkg") url = "https://github.com/paleolimbot/geoarrow-data/releases/download/v0.0.1/nshn_water_line.gpkg" download.file(url, tmp, mode = "wb") ## read data using `sf` system.time({ v = read_sf(tmp, as_tibble = FALSE) }) #> user system elapsed #> 20.47 1.77 22.33 ## read data using `terra` system.time({ df = data.frame(vect(tmp)) }) #> user system elapsed #> 17.08 2.58 19.68 ## save gpkg as csv ## it is quite slow (took ~8 mins on my PC) tmp2 = tempfile(fileext = ".csv") write_sf(v, tmp2, layer_options = "GEOMETRY=AS_WKT") ## read data using `readr` ## `data.table::fread()` is even faster (~6 sec on my PC) system.time({ csv = read_csv(tmp2, num_threads = 1) }) #> user system elapsed #> 10.13 0.45 10.61 ## create SpatVector from csv system.time({ vv = vect(csv$WKT) # takes ~70 sec vv = cbind(vv, csv[, -1]) # takes ~4 sec }) #> user system elapsed #> 73.16 1.47 74.84 ## create sf from csv system.time({ vv = st_as_sfc(csv$WKT) # takes ~80 sec vv = st_as_sf(vv, csv[, -1]) # takes ~1 sec }) #> user system elapsed #> 80.49 0.17 80.67 ```

paleolimbot commented 2 years ago

I don't know the internals of the CSV reading in reader::read_csv() vs. arrow::read_csv_arrow() vs. data.table::fread() vs. GDAL's OGR CSV reader, except that I know it's complicated and the ability to use multiple threads is key to increasing performance. Even though setting num_threads = 1 may be a more "fair" comparison, it doesn't reflect the average user perception/experience. That said, it's unlikely that the terra package can do anything about it except pass on a request to OGR to use more than one thread if it can (I don't know if that's an option for the OGR driver).

I would personally use whatever CSV reader is fastest for whatever I happen to be doing...if read_csv() + dplyr is faster...do it! You can also try geos::geos_read_wkt() and wk::wk_handle(wk::new_wk_wkt(the_wkt_column), wk::sfc_writer()), which might be faster than the WKT readers I see in the benchmarks above.

If I were trying to optimize a CSV workflow today, I'd probably use arrow::open_dataset("the_file.csv", format = "csv") %>% filter(...) %>% collect() %>% <do some conversion to a spatial type>, which uses multiple threads to do the reading and the filtering (although today you can only filter the non-spatial stuff).

rhijmans commented 2 years ago

Thank you for sharing the file. I have not been able to find anything that I can do to speed things up. read_csv is indeed very fast; but I do not see as large a difference as you report (perhaps your use of col_types = resourceUnitColumnTypes makes a big difference)

library(terra)
system.time(x <- readr::read_csv("grid 100 m.csv", show_col_types=F))
#   user  system elapsed 
#   1.40    0.12    0.35 

system.time(v <- vect("grid 100 m.gpkg"))
# user  system elapsed 
#   1.85    0.78    2.62

I have added an argument what to vect such that you can only read the geometries or attributes. In this case, the comparison with attributes is most relevant as the csv file does not have the geometries either.

system.time(w <- vect("grid 100 m.gpkg", what="attributes"))
#   user  system elapsed 
#   1.19    0.36    1.55 

system.time(w <- vect("grid 100 m.gpkg", what="geoms"))
#   user  system elapsed 
#   0.83    0.39    1.22