sparklyr / sparklyr

R interface for Apache Spark
https://spark.rstudio.com/
Apache License 2.0
952 stars 309 forks source link

Hive metastore not found after Spark 2.3 upgrade #1823

Closed rjsteckel closed 5 years ago

rjsteckel commented 5 years ago

The company I work for upgraded our data lake's HDP installation to 2.6.5.4-1 last month. This upgraded Spark to 2.3 (previously it was 2.1). Since then, sparklyr has been unable to find the data lake hive metastore. It works fine in spark-shell and pyspark.

I've included the log output for spark-shell and sparklyr. The two relevant lines I see are:

hive config file: file:/etc/spark2/2.6.5.4-1/0/hive-site.xml spark.sql.warehouse.dir='file:/home/rs990e/spark-warehouse/'

Both of the logs show the same settings for these, however, spark-shell initializes a HiveMetastoreConnection after loading hive-site.xml and sparklyr does not. I assume I'm missing some config, but don't know what it is. spark.sql.catalogImplementation is set to 'hive'

Using sparklyr_0.9.3

Spark-shell logs: scala> spark.sql("show tables") INFO SharedState: loading hive config file: file:/etc/spark2/2.6.5.4-1/0/hive-site.xml INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/rs990e/spark-warehouse/'). INFO SharedState: Warehouse path is 'file:/home/rs990e/spark-warehouse/'. INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6108fd23{/SQL,null,AVAILABLE,@Spark} INFO ContextHandler: Started o.s.j.s.ServletContextHandler@746f8520{/SQL/json,null,AVAILABLE,@Spark} INFO ContextHandler: Started o.s.j.s.ServletContextHandler@306bf4c3{/SQL/execution,null,AVAILABLE,@Spark} INFO ContextHandler: Started o.s.j.s.ServletContextHandler@5cf80dfb{/SQL/execution/json,null,AVAILABLE,@Spark} INFO ContextHandler: Started o.s.j.s.ServletContextHandler@17fb5184{/static/sql,null,AVAILABLE,@Spark} INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint ***INFO HiveUtils: Initializing HiveMetastoreConnection version 1.2.1 using Spark classes. INFO metastore: Trying to connect to metastore with URI thrift://ylpd270.kmdc.att.com:9083 INFO metastore: Connected to metastore. INFO SessionState: Created local directory: /tmp/9d6d48d3-454e-4f53-956f-993362a4b858_resources INFO SessionState: Created HDFS directory: /tmp/hive/rs990e/9d6d48d3-454e-4f53-956f-993362a4b858 INFO SessionState: Created local directory: /tmp/rs990e/9d6d48d3-454e-4f53-956f-993362a4b858 INFO SessionState: Created HDFS directory: /tmp/hive/rs990e/9d6d48d3-454e-4f53-956f-993362a4b858/_tmp_space.db INFO HiveClientImpl: Warehouse location for Hive client (version 1.2.2) is file:/home/rs990e/spark-warehouse/

Sparklyr logs: INFO SharedState: loading hive config file: file:/etc/spark2/2.6.5.4-1/0/hive-site.xml INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/rs990e/spark-warehouse/'). INFO SharedState: Warehouse path is 'file:/home/rs990e/spark-warehouse/'. INFO ContextHandler: Started o.s.j.s.ServletContextHandler@47ae6251{/SQL,null,AVAILABLE,@Spark} INFO ContextHandler: Started o.s.j.s.ServletContextHandler@4f5346e1{/SQL/json,null,AVAILABLE,@Spark} INFO ContextHandler: Started o.s.j.s.ServletContextHandler@6a6a7e99{/SQL/execution,null,AVAILABLE,@Spark} INFO ContextHandler: Started o.s.j.s.ServletContextHandler@420d3811{/SQL/execution/json,null,AVAILABLE,@Spark} INFO ContextHandler: Started o.s.j.s.ServletContextHandler@1ee42247{/static/sql,null,AVAILABLE,@Spark} INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint INFO CodeGenerator: Code generated in 299.611091 ms INFO SparkContext: Starting job: collect at utils.scala:44

1750

harryprince commented 5 years ago

I guess your hive related config might be wrong.

my example is:

Sys.setenv(HADOOP_CONF_DIR = '/etc/hadoop/conf') 
config = sparklyr::spark_config()

config$`sparklyr.shell.files` <- c("/etc/hive/conf/hive-site.xml") #Only use under yarn-cluster mode

please paste your full spark configuration code, if you need more help.

rjsteckel commented 5 years ago

I don't think it's related to hive-site.xml because the same one is used in spark-shell and it works. That's why I posted the logs.

My spark_config object is:

spark_config(sc) $spark.env.SPARK_LOCAL_IP.local [1] "127.0.0.1"

$sparklyr.connect.csv.embedded [1] "^1.*"

$spark.sql.catalogImplementation [1] "hive"

$spark.sql.hive.metastore.version [1] "1.2.1"

$spark.sql.hive.metastore.jars [1] "builtin"

$sparklyr.connect.cores.local [1] 12

$spark.sql.shuffle.partitions.local [1] 12

attr(,"config") [1] "default" attr(,"file") [1] "/opt/data/share05/sandbox/sandbox37/sndbx_scripts/udfs/RLibs/sparklyr/conf/config-template.yml"

genobobeno commented 5 years ago

Same issue. HDP 2.6.5, Spark 2.3.0, Hive 1.2.1000, YARN 2.7.3. Sparklyr 0.9.3.9001. We've had to completely stop all development using sparklyr. All of this Hadoop technology, frankly, is pretty pathetic in terms of its "robustness". Been working with this stuff for 18 months and we still have problems that the Hortonworks software engineers can't even explain. Their suggestion is always: Update your stack.

Sparklyr connects, RStudio reports "(No tables)".

sdanielzafar commented 5 years ago

Having the same error as I was having months back, after generating a spark context with HDP3 and the following config: conf$hive.metastore.uris <- "thrift://my_server:9083" I'm getting the following errors:

> DBI::dbGetQuery(sc, "create table iris_hive as SELECT * FROM iris_spark_table") # not working!
Error: org.apache.spark.sql.AnalysisException: Hive support is required to CREATE Hive TABLE (AS SELECT);;

and

> DBI::dbGetQuery(sc, "use my_db")
Error: org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'my_db' not found;
sdanielzafar commented 5 years ago

it looks to me like @kevinykuo solved one big issue in the previous thread, so now only Hortonworks users are experiencing issues. I volunteer for to do a screen share (after rstudio::conf) if needed for reprex.

javierluraschi commented 5 years ago

@sdanielzafar would we be able to reproduce this with Hortonworks Sandbox VMs? Looks like Hortonworks 3.0.1 and 3.1.1 would be really easy to install, does this reproduce in both versions?

sdanielzafar commented 5 years ago

@javierluraschi I can't say for certain, as I'm using Hortonworks Data Platform 3.0.0, but my guess is that it will be a reprex since it has an updated Spark version.

sdanielzafar commented 5 years ago

I'm not sure if it's related, but it's worth mentioning that when I tried to read Hive tables via spark_read_jdbc I only got the schema, no data:

 sparklyr::spark_read_jdbc(
    sc,
    "tbl",
    options = c(
      list(
        url = paste0("<url>/", db, ";serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2"),
        user = user,
        password = password,
        dbtable = "tbl",
        driver = "org.apache.hive.jdbc.HiveDriver")
    )
  )

returns

# Source: spark<tbl> [?? x 5]
# ... with 5 variables: tbl.id <int>, tbl.format <chr>, tbl.account <chr>, tbl.premise <chr>,
#   tbl.svcpt_id <chr>
kevinykuo commented 5 years ago

About to spin a HDP sandbox up to investigate. It sounds like this issue is specific to Hortonworks? If you're running into this on another distro please let us know.

genobobeno commented 5 years ago

We spoke on Friday about our current issue with the Hive metastore and sparklyr. Consistent with Hortonworks. Here are the details you requested. I’m also amenable to any of your engineers poking around in our cluster if you think that would be helpful. If so, please let me know and I’ll provide more details about the architecture of our cluster, initialize my kerberos ticket, and login to all of the nodes of my cluster (with sudo privileges) as well as the Rstudio server.

Sparklyr version that DOES NOT work with Hive metastore: 0.9.3.9001 Sparklyr version that DOES work with Hive metastore: 0.8.4.9012

Here is a snapshot of our Hortonworks stack via Ambari 2.6.2.2:

And here is the Versions tab:

Sent from my iPhone

On Jan 22, 2019, at 3:23 PM, Kevin Kuo notifications@github.com wrote:

About to spin a HDP sandbox up to investigate. It sounds like this issue is specific to Hortonworks? If you're running into this on another distro please let us know.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

kevinykuo commented 5 years ago

This seems relevant although a couple reports in this thread note HDP 2.6.5: https://community.hortonworks.com/articles/223626/integrating-apache-hive-with-apache-spark-hive-war.html

Note: From HDP 3.0, catalogs for Apache Hive and Apache Spark are separated, and they use their own catalog; namely, they are mutually exclusive - Apache Hive catalog can only be accessed by Apache Hive or this library, and Apache Spark catalog can only be accessed by existing APIs in Apache Spark . In other words, some features such as ACID tables or Apache Ranger with Apache Hive table are only available via this library in Apache Spark. Those tables in Hive should not directly be accessible within Apache Spark APIs themselves.

sdanielzafar commented 5 years ago

What are the implications of that?

eddjberry commented 5 years ago

@kevinykuo having the same issue here with Cloudera rather than Hortonworks as our distro.

I encountered this issue using sparklyr 0.9.4, Spark 2.3.1 and Hadoop 2.6. I ran into the issue when when trying to add dynamic allocation to the spark config, while simultaneously updating sparklyr. I don't think the dynamic allocation is a factor as removing it makes no difference. I've also tried adding spark.sql.catalogImplementation <- 'hive', but it doesn't help.

Previously using sparklyr 0.8.4.9005 with Spark 2.3.1 and Hadoop 2.6 things were working fine. [Update: I have now successfully rolled back by installing version 0.8.4.9005 again.]

Below is how the config is being specified.

# Set config options based on global variables
config$spark.executor.memory <- "2G"
config$spark.executor.cores <- Sys.getenv("SPARK_CORES")
config$spark.yarn.executor.memoryOverhead <- "4G"
config$spark.dynamicAllocation.enabled <- "true"
config$spark.dynamicAllocation.maxExecutors <- Sys.getenv("SPARK_EXECUTORS")
config$spark.shuffle.service.enabled <- "true"
config$spark.shuffle.port <- "7337"
config$spark.dynamicAllocation.cachedExecutorIdleTimeout <- "7200"

# Add extra Java options
config$sparklyr.shell.conf <-
  paste0("spark.driver.extraJavaOptions=-Dhive.metastore.uris=",
         Sys.getenv("HIVE_METASTORE_URIS"))

sc <- spark_connect(master = "yarn",
                    config = config)
javierluraschi commented 5 years ago

@eddjberry there is a related fix for Hortonworks worth trying out, would you mind using this patch and report back if it fixes this issue for you as well?

devtools::install_github("rstudio/sparklyr", ref = "bugfix/hive-context-spark2")
javierluraschi commented 5 years ago

This one should be resolved by https://github.com/rstudio/sparklyr/pull/1872, fix can be installed through:

devtools::install_github("rstudio/sparklyr")
eddjberry commented 5 years ago

Awesome 🎉. I should get chance to test this out later in the week. I'll let you know how it goes

eddjberry commented 5 years ago

I couldn't wait so just tested it. It all works fine 😃. Thanks so much for your help!

I used devtools::install_github("rstudio/sparklyr", ref = "bugfix/hive-context-spark2") in order to lock down my build to a specify version.

sdanielzafar commented 5 years ago

Hey I just tested this. Looks like there is a big improvement! I'm getting solid connections to the hive metastore 👍

I'm not sure if it's just me, but I am having issues pulling my compressed ORC hive tables. On Hive CLI I have no issue, but from sparklyr I'm only getting metadata:

> # setting my db
> DBI::dbGetQuery(sc, "use loggers")

with sparklyr:

> tbl(sc, "ami_raw")
# Source: spark<logger_data> [?? x 8]
# … with 8 variables: id <chr>, read_date <chr>, read_time <chr>, value <dbl>, units <chr>, estimate <chr>, offset <int>

with DBI,

> DBI::dbGetQuery(sc, "SELECT * FROM logger_data LIMIT 10")
[1] id   read_date  read_time      value  units      estimate   offset
<0 rows> (or 0-length row.names)

When I collect I get 0 rows as well.

With Spark SQL:

> hive_context(sc) %>% 
      invoke("sql", "SELECT * FROM logger_data LIMIT 10") %>% 
      collect()
# A tibble: 0 x 5
# … with 5 variables: id <chr>, read_date <int>, read_time <int>, ... 

in Hive CLI:

SELECT COUNT(*) from logger_data;

results in:

+--------------+
|     _c0      |
+--------------+
| 45542072425  |
+--------------+

Are ORC tables supported?

sdanielzafar commented 5 years ago

Any updates here?

rjsteckel commented 5 years ago

It worked for me.

kellen-t-oconnor commented 5 years ago

Is there any confirmation that this works on HDP versions > than 2.x. We recently updated to 3.1 and I'm afraid that R users are going to be left out to dry if we're stuck on this version.

sdanielzafar commented 5 years ago

It seems my issues mentioned above are not related to sparklyr, I think we're good here!