Closed twest820 closed 2 years ago
Can you provide an example that I can look at?
I can recommend the interesting blogpost from @paleolimbot about using geoarrow and GeoParquet file format to fast loading vector data.
In readr::read_csv()
you should use num_threads = 1
because {terra}
uses one thread and hence there may be some difference in timings, but I believe it will still be magnitude faster.
Can you provide an example that I can look at?
I see you've gotten the data link; feel free to email me back on specifics. Most of the columns are public data but if you need to redistribute there's a few which should probably be removed. Let me know if that's the case and I'll trim them out (and, yes, I'm aware one column that should be double is mistyped to string; a typo on my part which I've fixed on the PyQGIS side but isn't reflected in this iteration of the files).
In readr::read_csv() you should use num_threads = 1
While I understand the point you're making from a benchmarking perspective, there are many idle cores at this point in the workflow. If terra (or GDAL) chooses not to use them that doesn't influence my performance expectations as an end user. Besides, the majority of the slowness is attributable to as_tibble()
and read_csv(num_threads = 1)
comes in around 480 ms.
Looks like for GeoParquet to be an option here GDAL 3.5 would be a minimum requirement but QGIS 3.22.6 is on GDAL 3.4.2. I'll see about updating to 3.22.9—some of the timings in Dewey's post look attractive, thanks!—but have to get to a meeting at the moment.
Here I made some reproducible benchmark. I hope it is helpful.
I don't know the internals of the CSV reading in reader::read_csv()
vs. arrow::read_csv_arrow()
vs. data.table::fread()
vs. GDAL's OGR CSV reader, except that I know it's complicated and the ability to use multiple threads is key to increasing performance. Even though setting num_threads = 1
may be a more "fair" comparison, it doesn't reflect the average user perception/experience. That said, it's unlikely that the terra package can do anything about it except pass on a request to OGR to use more than one thread if it can (I don't know if that's an option for the OGR driver).
I would personally use whatever CSV reader is fastest for whatever I happen to be doing...if read_csv()
+ dplyr is faster...do it! You can also try geos::geos_read_wkt()
and wk::wk_handle(wk::new_wk_wkt(the_wkt_column), wk::sfc_writer())
, which might be faster than the WKT readers I see in the benchmarks above.
If I were trying to optimize a CSV workflow today, I'd probably use arrow::open_dataset("the_file.csv", format = "csv") %>% filter(...) %>% collect() %>% <do some conversion to a spatial type>
, which uses multiple threads to do the reading and the filtering (although today you can only filter the non-spatial stuff).
Thank you for sharing the file. I have not been able to find anything that I can do to speed things up. read_csv is indeed very fast; but I do not see as large a difference as you report (perhaps your use of col_types = resourceUnitColumnTypes
makes a big difference)
library(terra)
system.time(x <- readr::read_csv("grid 100 m.csv", show_col_types=F))
# user system elapsed
# 1.40 0.12 0.35
system.time(v <- vect("grid 100 m.gpkg"))
# user system elapsed
# 1.85 0.78 2.62
I have added an argument what
to vect
such that you can only read the geometries or attributes. In this case, the comparison with attributes is most relevant as the csv file does not have the geometries either.
system.time(w <- vect("grid 100 m.gpkg", what="attributes"))
# user system elapsed
# 1.19 0.36 1.55
system.time(w <- vect("grid 100 m.gpkg", what="geoms"))
# user system elapsed
# 0.83 0.39 1.22
I have a small vector layer of about 80,000 grid squares and not quite 50 attribute table columns. Function timing with R 4.2.1, dplyr 1.0.9, readr 2.1.2, and terra 1.5-34 results in
As a result, even if the layer only needs to be read into R once, it's faster to run the PyQGIS to export it to .csv and read it with
readr::read_csv()
than it is to just to leave the layer in a geospatial format. Since GeoPackage and FlatGeoBuf are binary file formats I'd expect terra to be able to read the attribute table more quickly than readr. Most of the columns are doubles and using .csv therefore imposes substantial string parsing costs. Butvect()
is an order of magnitude slower thanread_csv()
. And, even ifas_tibble()
requires a fullmemcpy()
of the attribute table, that should take less than 5 ms at the system's 25.6 GB/s bandwidth (DDR4-3200). Soas_tibble()
is three orders of magnitude slower than I would expect.Is there a way to reduce the overhead of using terra in such cases? It would be nice to simplify the .csv files out of the surrounding workflows.