harryprince / geospark

bring sf to spark in production
https://github.com/harryprince/geospark/wiki
57 stars 17 forks source link

Feature request: H3 indexing #22

Open jfulponi opened 1 year ago

jfulponi commented 1 year ago

Hi. I'm working with the geospark sparklyr extension with huge spatial datasets (mostly points datasets). When I need to compute a geospatial index like H3, I have to use spark_apply() with the R h3 package, but it usually takes hours. The same task with the h3 expansion for pyspark is a lot faster, obviously because all the cores are working at the same time. Is there a plan to add the H3 index functionalities? I could help in some coding if you want, I think I can be helpful. Thanks.

harryprince commented 5 months ago

thanks for your advice, actual when I take the spark_apply it is pretty fast in my experience through the context function, did you user spark_apply with package distributed mode?

this is my code example how to use spark_apply:


st_dump <- function(x){

    tryCatch({
        wkt_list_ <- stringr::str_split(x, ";")

         wkt_list_[[1]] %>% 
          sf::st_as_sfc() %>%
          sf::st_sf() %>%
          sf::st_union() %>%
          sf::st_cast("POLYGON") %>% 
          lwgeom::st_astext() %>%
          paste0(collapse=";")
    },error=function(cond){
        message("Here's the original error message st_dump:")
        message(cond)
        x
    },finally={
            # message("Some other message at the end")
        })

gh_aggr_fun <- function(e, context){
    library(dplyr);

    for (name in names(context)) assign(name, context[[name]], envir = .GlobalEnv);

    e %>% 
    select(wkt_list_str=wkt_list, g7_list,g6) %>%
    mutate(g6_wkt = purrr::map_chr(.x = wkt_list_str,
                                   .f = st_dump))
}

context_f = list(st_dump = st_dump)

sdf_shd_ <- sdf_g6_ %>%

            sparklyr::spark_apply(gh_aggr_fun
                    ,columns = c("wkt_list","g7_list","g6","g6_wkt")
                    ,name = 'g6_well_tbl' # cache table name
                    ,memory = TRUE
                    ,context = context_f)

in my example, I convert the raw data to geohash, which is pretty similar like H3. wish this can help you.

harryprince commented 5 months ago

here is another way to remove the hole by sf data.frame object directly:

st_rm_holes <- function(e, context){
    library(dplyr);
    library(sf);
    library(sfheaders);
    for (name in names(context)) assign(name, context[[name]], envir = .GlobalEnv);

    e %>% 
    sf::st_as_sf(wkt="wkt_val") %>% 
    sfheaders::sf_remove_holes() %>%
    as.data.frame() %>%
    mutate(wkt_val = lwgeom::st_astext(wkt_val))

}