gmobi-wush / gmobi-analytic-system

0 stars 0 forks source link

一鍵部署zeppelin + spark + sparkR + spark-mongo #1

Open gmobi-wush opened 8 years ago

gmobi-wush commented 8 years ago

目前仍然不清楚 aws emr 的bootstrap action 要如何運作

gmobi-wush commented 8 years ago

發現可以用: s3://elasticmapreduce/libs/script-runner/script-runner.jar

Reference: http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-hadoop-script.html

gmobi-wush commented 8 years ago

path not found

在emr-4.0之後,script-runner.jar已經移入AMI之中了

Reference: http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-release-differences.html

gmobi-wush commented 8 years ago

仍然失敗。改成用steps看看

Reference: http://docs.aws.amazon.com/cli/latest/reference/emr/add-steps.html

gmobi-wush commented 8 years ago

仍然失敗,改用 bootstrap-actions的run-if 用GUI版本的

gmobi-wush commented 8 years ago

仍然失敗,aws cli 下載檔案失敗

沒招啦!

gmobi-wush commented 8 years ago

在local 玩:

ruby /usr/share/aws/emr/scripts/run-if instance.isMaster=true aws s3 sync s3://gmobi-emr-bootstrap ~/gmobi-emr-bootstrap
Executing command: aws "s3" "sync" "s3://gmobi-emr-bootstrap" "/home/hadoop/gmobi-emr-bootstrap"
warning: Skipping file /home/hadoop/gmobi-emr-bootstrap/. File does not exist.
download: s3://gmobi-emr-bootstrap/spark-mongo/casbah-core_2.10-2.8.0.jar to gmobi-emr-bootstrap/spark-mongo/casbah-core_2.10-2.8.0.jar
download: s3://gmobi-emr-bootstrap/sync-required-jars.sh to gmobi-emr-bootstrap/sync-required-jars.sh
download: s3://gmobi-emr-bootstrap/spark-mongo/casbah-query_2.10-2.8.0.jar to gmobi-emr-bootstrap/spark-mongo/casbah-query_2.10-2.8.0.jar
download: s3://gmobi-emr-bootstrap/spark-mongo/casbah-commons_2.10-2.8.0.jar to gmobi-emr-bootstrap/spark-mongo/casbah-commons_2.10-2.8.0.jar
download: s3://gmobi-emr-bootstrap/spark-mongo/spark-mongodb_2.10-0.11.1.jar to gmobi-emr-bootstrap/spark-mongo/spark-mongodb_2.10-0.11.1.jar
download: s3://gmobi-emr-bootstrap/spark-mongo/mongo-java-driver-2.13.0.jar to gmobi-emr-bootstrap/spark-mongo/mongo-java-driver-2.13.0.jar
download: s3://gmobi-emr-bootstrap/R/rstudio-server-rhel-0.99.893-x86_64.rpm to gmobi-emr-bootstrap/R/rstudio-server-rhel-0.99.893-x86_64.rpm

可以跑呀...

gmobi-wush commented 8 years ago

路徑中移除~試試看

gmobi-wush commented 8 years ago

可能是因為stderr不為空,所以失敗 嘗試消除所有會產生stderr的因素

gmobi-wush commented 8 years ago

stderr 不為空時,run-if會return error code '2'

目前已經把s3://gmobi-emr-bootstrap sync到 /home/hadoop之中,

接下來嘗試做sparkR與spark-mongo的bootstrap

gmobi-wush commented 8 years ago

目前sparkR 的部分設定完成

但是rstudio 遇到兩個問題:

user rstudio不能和spark建立連線:

Error in invokeJava(isStatic = TRUE, className, methodName, ...) : 
  org.apache.hadoop.security.AccessControlException: Permission denied: user=rstudio, access=WRITE, inode="/user":hdfs:hadoop:drwxr-xr-x

user hadoop 可以和spark建立連線,但是rstudio 無法登入... @#$!#$!

查到: http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-setting-system-directory-permissions.html 可能可以解決rstudio 無法連線的問題,重啟動服務中...

gmobi-wush commented 8 years ago

成功連接到mongodb,從database / collection 中撈出資料:

library(SparkR)
Sys.setenv("HADOOP_USER_NAME" = "hadoop")
sc <- sparkR.init(sparkJars = paste(dir("/home/hadoop/spark-mongo", full.names = TRUE), collapse = ","))
sqlContext <- sparkRHive.init(sc)
df <- cache(read.df(sqlContext, source = "com.stratio.datasource.mongodb", host = "<ip>", 
              database = "<db>", collection = "<collection>", splitSize = 2,
              splitKey = "_id", samplingRatio = 1.0))
registerTempTable(df, "buffer")
collect(sql(sqlContext, "SELECT * FROM buffer LIMIT 1"))
gmobi-wush commented 8 years ago

sparkR有資料結構轉換的問題:

Error in as.data.frame.default(x[[i]], optional = TRUE) : 
  cannot coerce class ""jobj"" to a data.frame
gmobi-wush commented 8 years ago

zeppelin 目前無法直接連上(要查一下網路設定)

但是透過 ssh tunnel 確認是可以啟動的,但還來不及確認能否連到mongodb

gmobi-wush commented 8 years ago

在interpreter開始工作之前輸入:

%dep
z.load("/home/hadoop/spark-mongo/casbah-commons_2.10-2.8.0.jar")
z.load("/home/hadoop/spark-mongo/casbah-core_2.10-2.8.0.jar")
z.load("/home/hadoop/spark-mongo/casbah-query_2.10-2.8.0.jar")
z.load("/home/hadoop/spark-mongo/mongo-java-driver-2.13.0.jar")
z.load("/home/hadoop/spark-mongo/spark-mongodb_2.10-0.11.1.jar")

就可以連接到mongodb了