piccolbo / dplyr.spark.hive

spark and hive backends for dplyr
8 stars 1 forks source link

[Proposition] additional database connection configuration parameters #18

Closed MarcinKosinski closed 8 years ago

MarcinKosinski commented 8 years ago

After reading the source code of your package I have learnt how to connect, via JDBC drivers, to HiveServer2 from R statistical package. It is brilliant in its simplicity.

I have realized that the following command will create a connection with HiveServer (despite I didn't specify log4j-properties file) and the next statement will list number of tables in a database.

> dbConnect(drv = JDBC(driverClass = "org.apache.hive.jdbc.HiveDriver",
+                      classPath = c("/opt/hive/lib/hive-jdbc-1.0.0-standalone.jar",
+                                    "/usr/share/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar",
+                                    "/usr/share/hadoop/share/hadoop/common/hadoop-common-2.4.1.jar"),
+                      identifier.quote = "`"),
+           url = paste0("jdbc:hive2://", host = 'tools-1.hadoop.srv', ":", port = 10000, "/loghost;auth=noSasl"),
+           username = "mkosinski") -> conn
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
> length(dbListTables(conn = conn))
[1] 1792

My connection will work only when I'll pass a user authentication parameters at the end of a url parameter, such as ;auth=noSasl. So far in it is impossible to pass such arguments to src_HS2 https://github.com/piccolbo/dplyr.spark.hive/blob/master/pkg/R/src-HS2.R#L39, since url only enables to pass host and port. Do you think it is possible to somehow pass it via ... parameter to JDBC function, or one might create a PullRequest with an additional parameter specifying what should be after port/ in ul parameter in src_HS2 function?

So far this is somehow the workaround/walkaround

my_db = src_Hive(host = 'tools-1.hadoop.srv',  port = "10000/loghost;auth=noSasl" )
piccolbo commented 8 years ago

So what is the relation between this and pull request #15 ? Sorry for being dense, but we can't solve the same problem twice.

MarcinKosinski commented 8 years ago

Via ... #15 one can pass arguments described at the end of this answer http://stackoverflow.com/a/26119143/3857701 . So this means:

formalArgs(getMethod(dbConnect, "JDBCDriver"))
## [1] "drv"      "url"      "user"     "password" "..."

But how can one pass parameters to dbConnect, when the original database URL looks like this:

jdbc:hive2://<host>:<port>/<db>;principal=hive/_HOST@<REALM>;ssl=true;sslTrustS‌​tore=<trust_store_path>;trustStorePassword=<trust_store_password>;transportMode=h‌​ttp;httpPath=<http_endpoint>

when through dplyr.spark.hive (src_HS2) one can only pass jdbc:hive2://<host>:<port>/

(I have ask about that on SO: http://stackoverflow.com/questions/34861844/passing-additional-parameters-to-dbconnect-function-for-jdbcdriver-in-r?noredirect=1#comment57470540_34861844 ). Since this might not be doable, maybe one should think about extending url parameter in scr_HS2.

My question should rather be entitled additional database connection configuration parameters. I titled this user authentication as I am only interested in auth=noSasl parameter.

MarcinKosinski commented 8 years ago

So far I am doing such workaround > my_db <- src_Hive(host = 'tools-1.hadoop.srv', port = "10000/loghost;auth=noSasl" )

> classPath = c("/opt/hive/lib/hive-jdbc-1.0.0-standalone.jar",
+   "/usr/share/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar",
+   "/usr/share/hadoop/share/hadoop/common/hadoop-common-2.4.1.jar")
> Sys.setenv(HADOOP_JAR = paste0(classPath, collapse=.Platform$path.sep))
> library(dplyr)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

> library(dplyr.spark.hive)
Warning: changing locked binding for ‘over’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘partial_eval’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning: changing locked binding for ‘default_op’ in ‘dplyr’ whilst loading ‘dplyr.spark.hive’
Warning messages:
1: package ‘dplyr.spark.hive’ was built under R version 3.2.2 
2: replacing previous import by ‘purrr::%>%’ when loading ‘dplyr.spark.hive’ 
3: replacing previous import by ‘purrr::order_by’ when loading ‘dplyr.spark.hive’ 
> 
> copy_to_from_local = dplyr.spark.hive:::copy_to_from_local
> 
> my_db <- src_Hive(host = 'tools-1.hadoop.srv', port = "10000/loghost;auth=noSasl" )
log4j:WARN No appenders could be found for logger (org.apache.hive.jdbc.Utils).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
> 
> 
> tbl(my_db, sql("select *
+                from TBL
+                where log_date = udf.current_day(-1)
+                limit 1000")) -> dane
> 
> #dplyr.spark.hive:::tbl.src_HS2
piccolbo commented 8 years ago

So you are telling me #15 doesn't work? Did you test it at all?

MarcinKosinski commented 8 years ago

It does, but not resolve All problems. I can not pass 'username' and 'password' to url so I am passing this though '...' in dbConnect. Now I'm trying to somehow Try to pass other arguments that are specified in url but this might probably be not doable :)

Marcin Kosinski

Dnia 19.01.2016 o godz. 18:53 Antonio Piccolboni notifications@github.com napisał(a):

So you are telling me #15 doesn't work? Did you test it at all?

— Reply to this email directly or view it on GitHub.

piccolbo commented 8 years ago

The dbconnect method just stuffs all the ... args in a java objects and passes it to a java connect call. What that accepts should be documented here https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients I may have missed it, but have you tried dbConnect(drv = mydrv, user = name, password = pwd, auth = token etc) When you say "it might probably be not doable" do you mean you have tried and run into trouble?

MarcinKosinski commented 8 years ago

I have tried various configurations of parameters' names in ... in dbConnect, and non of them enable me to connect with regular short url jdbc:hive2://host:port/ . How do you think this token should like like? Just a character on a specific S3 class / S4 class in R? I've tried with a character, but without a succes.

2016-01-20 8:18 GMT+01:00 Antonio Piccolboni notifications@github.com:

The dbconnect method just stuffs all the ... args in a java objects and passes it to a java connect call. What that accepts should be documented here https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients I may have missed it, but have you tried dbConnect(drv = mydrv, user = name, password = pwd, auth = token etc) When you say "it might probably be not doable" do you mean you have tried and run into trouble?

— Reply to this email directly or view it on GitHub https://github.com/piccolbo/dplyr.spark.hive/issues/18#issuecomment-173115878 .

piccolbo commented 8 years ago

Token should be a character and it gets converted to a character before being passed to the java side. So to recap you have a token that works when provided as part of the URL but does not when provided as an additional arg.

MarcinKosinski commented 8 years ago

Yes. So probably I should have a look on JDBCDriver method for dbConnect function in RJDBC package or in java jars

2016-01-22 18:36 GMT+01:00 Antonio Piccolboni notifications@github.com:

Token should be a character and it gets converted to a character before being passed to the java side. So to recap you have a token that works when provided as part of the URL but does not when provided as an additional arg.

— Reply to this email directly or view it on GitHub https://github.com/piccolbo/dplyr.spark.hive/issues/18#issuecomment-173987697 .

piccolbo commented 8 years ago

moving this to rzilla/dplyr.spark.hive#1