sparklyr / sparkxgb - Githubissues

okiyuki99 commented 5 years ago

とは？

RからSparkをバインディングしてdplyr記法等を利用し、使いやすいRのインタフェースを提供する
公式サイトを読んだ印象では、dplyr, MLib等でSparkの各種操作をラッピングして使いやすくしてるのがわかる
XGBoostをSparkでするためのsparkxgbとやらもある

接続

sparklyr::spark_connect
sparklyr::spark_disconnect

conn <- sparklyr::spark_connect(
  master = "yarn-client",
  config = conf
)
...
sparklyr::spark_disconnect(
  master = "yarn-client",
  config = conf
)

sdf読み込み

dplyr::tbl(sc, table) : table名からsdfとして読み込み
sparklyr::sdf_sql(sc, sql) : SQLからsdfとして読み込み
sparklyr::sdf_copy_to : ローカルからsdfとして読み込み (主に簡易テスト用）
- sparklyr::copy_to = dplyr::copy_to でも同じことができる

mtcars_tbl <- sparklyr::sdf_copy_to(sc, mtcars, name = "mtcars_tbl", overwrite = TRUE)

Table操作

sparklyr::spark_write_table(x = sdf, name=table, mode = "overwrite") : tableに保存
sparklyr::sdf_register(sdf, name = table) : sdfをtableとして登録する(SparkSQLで参照するためのキャッシュ的な使い方？）

データ確認

sparklyr::sdf_dim(sdf) : sdfの行数と列数を確認

Partition操作

Sparkはデータをpartitionという単位で並列処理するので、パフォーマンスを決める上で重要

sparklyr::sdf_num_partitions(sdf) : partitionの数を数える
sparklyr::sdf_repartition(sdf, 10) : partitionの数を変更する

ML general

sparklyr::ml_predict : 予測する
sparklyr::ml_*_evaluator : 評価する

pred <- sparklyr::ml_predict(rf_model, mtcars_test)
pred_proba <- sparklyr::ml_predict(glr, mtcars_tbl)

ml_multiclass_classification_evaluator(pred)
ml_regression_evaluator(pred, label_col = "cyl")
ml_binary_classification_evaluator(pred)

sparklyr::ml_save : MLモデルの保存
sparklyr::ml_load : MLモデルのロード

ML methds

sparklyr::ml_als : Alternating Least Squares (ALS) matrix factorization
- https://rdrr.io/cran/sparklyr/man/ml_als.html
sparkxgb::xgboost_classifier : xgboost
- https://github.com/rstudio/sparkxgb
sparklyr::ml_generalized_linear_regression : glm

glr <- sparklyr::ml_generalized_linear_regression(
  mtcars_tbl, 
  vs ~ ., 
  family = "binomial"
)
tidy_glr <- broom::tidy(glr)

config

Deployment and Configuration
spark.* : spark contextで指定するオプション

config	meaning
`spark.dynamicAllocation.maxExecutors`	1 jobに対して割り当てる最大のexecutorの数。大きなデータをフルスキャンして抽出する系のタスクならこのパラメータを大きくするのが望ましい
`spark.executor.memory`	1 executorあたり使用できる最大メモリ
`spark.driver.memory`	driverの使用できる最大メモリ
`spark.yarn.executor.memoryOverhead`
`spark.yarn.driver.memoryOverhead`
`spark.executor.instances`	executorの数
`spark.driver.cores`	driverのコア数
`spark.executor.cores`	executorのコア数
`spark.serializer`	シリアライザの設定

spark.sql.* : SparkSQL周りのパフォーマンスチューニングのためのオプション

config	meaning
`spark.sql.shuffle.partitions`	シャッフル後のDataFrameのパーティション数

参考 : Sparkの性能向上のためのパラメータチューニングとバッチ処理向けの推奨構成 | Think IT（シンクイット）

sparklyr.* : spark-submitコマンドで指定するオプション

config	meaning
`sparklyr.shell.driver-memory`

# 設定例
conf<- sparklyr::spark_config()
conf$spark.executor.cores <- 8
conf$spark.executor.memory <- "4G"
conf$spark.yarn.queue <- "dev"
conf$spark.dynamicAllocation.initialExecutors <- 10
conf$spark.dynamicAllocation.enabled <- "true"
conf$spark.dynamicAllocation.maxExecutors <- 100
conf$spark.shuffle.service.enabled <- "true"

注意点

collect() : Rのメモリ空間に渡すときに使うが、すべてのデータがexecutorからdriver nodeに最初に渡されるので、driver nodeのメモリも十分に必要

okiyuki99 commented 5 years ago

spark_web(sc) ブラウザで開く

okiyuki99 commented 5 years ago

The sparklyr package aids in using the “push compute, collect results” principle.

okiyuki99 commented 5 years ago

こういう使い方

okiyuki99 commented 5 years ago

The compute() command can take the end of a dplyr piped command set and save the results to Spark memory.

で、一旦Spark memoryに結果を保存できる

okiyuki99 / HowToR

sparklyr / sparkxgb #40

とは？

接続

sdf読み込み

Table操作

データ確認

Partition操作

ML general

ML methds

config

注意点

公式サイト

Book

まとめ

参考