sparklyr / sparklyr

R interface for Apache Spark
https://spark.rstudio.com/
Apache License 2.0
956 stars 310 forks source link

Connecting to Cluster and MLib #353

Closed rlumor closed 7 years ago

rlumor commented 7 years ago

Does the package work with only data stored locally on R from a yarn client or I can run a model with data sitting in the cluster itself without loading the data into R?

javierluraschi commented 7 years ago

@rlumor Yes, sparklyr works with data sitting in the cluster itself, you never have to load distributed data into R with sparklyr, at least not as of version <= 0.5.

rlumor commented 7 years ago

dataspark <- dbGetQuery(sc, "select * from twn_prime_db.varimpdata_governmentspark limit 1000000")

dataspark[sapply(dataspark, is.character)] <- lapply(dataspark[sapply(dataspark, is.character)], as.factor)

table(dataspark$was_inq_gov)

Copy data into sc

copy_to(sc, dataspark, overwrite=TRUE) dt_tbl <- tbl(sc, "dataspark")

summary(dt_tbl) library(magrittr)

dt_tbl <- na.omit(dt_tbl) dt_tbl %>% count

model_partition <- dt_tbl %>% sdf_partition(train=0.8, test=0.2, seed=1234)

train_tbl <- sdf_register(model_partition$train, "data_train") test_tbl <- sdf_register(model_partition$test, "data_test")

glm_model <- train_tbl %>% ml_random_forest(was_inq_gov~ averageofactsum_projected_income+number_open_revolving_trades, type="classification")

summary(glm_model)

i had to copy the data in sc instead of running it directly against the entire database sitting remotely

javierluraschi commented 7 years ago

Don't quite understand the intention of applying as.factor locally rather than working with tables in Spark, if you provide more context I can suggest something more adequate, closing for now.