Support for Livy, Spark 2.4, dplyr example

harryprince / geospark

bring sf to spark in production

https://github.com/harryprince/geospark/wiki

57 stars 17 forks source link

Support for Livy, Spark 2.4, dplyr example #2

Closed javierluraschi closed 5 years ago

javierluraschi commented 5 years ago

@harryprince this is such a great extension! Nice job putting it together! I couldn't help myself to play with it and send some improvements for you to consider:

Support for Spark 2.4.
Enable support when using Livy connections.
Add dplyr example to README, move data files to data folder and config/benchmarks to their own section.

We are also adding your extension to spark.rstudio.com, you can probably consider publishing to CRAN at some point, it's already quite useful!

harryprince commented 5 years ago

@javierluraschi thanks a lot, I will try my best to follow the guideline.

Currently, the geospark with Spark 2.4 is under development. I will follow the latest edition if geospark scala master is updated.

harryprince commented 5 years ago

I update geospark to 1.2.0 which fix lots bugs and add some new SQL functions. Here is the release note: https://github.com/harryprince/geospark/releases,

in that this PR might need to update the geospark scala package version.

harryprince commented 5 years ago

@javierluraschi

in this PR, the dplyr related example does not work.

And I added another example, see more discussion in #9

library(dplyr)
polygons_wkt <- mutate(polygons_wkt, y = st_geomfromwkt(geom))
points_wkt <- mutate(points_wkt, x = st_geomfromwkt(geom))

sc_res = full_join(polygons_wkt  %>% mutate(dummy=TRUE) %>% compute(),
                   points_wkt %>% mutate(dummy=TRUE) %>% compute(),
                   by = "dummy") %>% 
   filter(sql("st_contains(y,x)")) %>%
  group_by(area, state) %>%
  summarise(cnt = n()) 

sc_res %>%
  head()