Closed rlumor closed 7 years ago
@rlumor Yes, sparklyr
works with data sitting in the cluster itself, you never have to load distributed data into R
with sparklyr
, at least not as of version <= 0.5
.
dataspark <- dbGetQuery(sc, "select * from twn_prime_db.varimpdata_governmentspark limit 1000000")
dataspark[sapply(dataspark, is.character)] <- lapply(dataspark[sapply(dataspark, is.character)], as.factor)
table(dataspark$was_inq_gov)
copy_to(sc, dataspark, overwrite=TRUE) dt_tbl <- tbl(sc, "dataspark")
summary(dt_tbl) library(magrittr)
dt_tbl <- na.omit(dt_tbl) dt_tbl %>% count
model_partition <- dt_tbl %>% sdf_partition(train=0.8, test=0.2, seed=1234)
train_tbl <- sdf_register(model_partition$train, "data_train") test_tbl <- sdf_register(model_partition$test, "data_test")
glm_model <- train_tbl %>% ml_random_forest(was_inq_gov~ averageofactsum_projected_income+number_open_revolving_trades, type="classification")
summary(glm_model)
i had to copy the data in sc instead of running it directly against the entire database sitting remotely
Don't quite understand the intention of applying as.factor
locally rather than working with tables in Spark, if you provide more context I can suggest something more adequate, closing for now.
Does the package work with only data stored locally on R from a yarn client or I can run a model with data sitting in the cluster itself without loading the data into R?