Closed sigmafelix closed 1 year ago
Basic config and data input examples are below (from sparklyr homepage, reorganizing split code blocks into one)
if (!require(pacman)) {
install.packages("pacman")
library(pacman)
}
p_load(sparklyr, apache.sedona, catalog, dplyr, bench, nycflights13)
# sparklyr::spark_install(version="3.4.0")
conf <- spark_config() # Load variable with spark_config()
# Setting "executor"
conf$spark.executor.memory <- "4G"
conf$spark.executor.cores <- 4
conf$spark.executor.instances <- 2
conf$spark.dynamicAllocation.enabled <- "false"
# Setting "cache"
# Total cache memory for repetitive access (not by file i/o)
conf$`sparklyr.shell.driver-memory` <- "16G"
sc <- spark_connect(master = "local",
config = conf)
# sparklyr dashboard: http://localhost:4040/
# Data load into Spark
flights_tbl <- copy_to(sc, nycflights13::flights, "flights")
airlines_tbl <- copy_to(sc, nycflights13::airlines, "airlines")
Since Spark Sedona only accepts ESRI Shapefiles, WKT, Geoparquet, and GeoJSON, we will need to discuss the common data exchange format across the team. This needs to be done for both raster and vector data. Points of considerations are:
Thank you. I have been dealing with errors in converting apache.sedona
's spatial resilient distributed dataset (SpatialRDD) back to a sf
dataset with all attributes intact. The capability of apache.sedona
and sparklyr
is not as mature as the conventional approaches with sf
or terra
, so I will take the file-based (Zarr and geoparquet/geopackage) and multithreaded approach with sf
or terra
to calculate covariates for alpha version development at this moment.
Test completed. May revisit the issue after figuring out alternative interfaces to Spark and Sedona (e.g., Python).
sparklyr
: a connector between spark anddplyr
spark_install()
, no pain in spark configurationsedona
: a spark extension for spatial data analysis (previouslygeospark
)*.jar
extension) that support geospatial capabilities in Spark engineapache.sedona
offers functions to use sedona withsparklyr
list
with geometry in WKT strings) back tosf
column assparklyr
strongly assumes that every column in a table is a vector.rapids
extension: no R API exists. Might need to make one from the scratch.