hortonworks-spark / spark-llap

Apache License 2.0
102 stars 68 forks source link

Dependency on Tez? #186

Closed skliarpawlo closed 6 years ago

skliarpawlo commented 6 years ago

Hi, I'm a bit confused, why using this lib may cause such error, as by my understanding Tez is an alternative to spark execution engine, so as this is spark-llap lib, it should not be used at all.

How to reproduce: I've built pyspark package with patch provided in docs of this lib:

  1. Build step

    # apply patch like in docs
    # then:
    ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn
  2. Used sparkly lib (https://github.com/Tubular/sparkly) and did a basic test script :

    pip install sparkly==2.2.0
    
    from sparkly import SparklySession

class Session(SparklySession): repositories = ['http://repo.hortonworks.com/content/groups/public/'] packages = ['com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1'] options = { 'spark.sql.hive.llap':'true', 'hive.metastore.uris': 'thrift://hive2-host:9083', 'spark.sql.hive.hiveserver2.jdbc.url': 'jdbc:hive2://hive2-host:10000/', }

spark = Session()

spark.sql("SELECT * FROM ...").show()


What I got:
Stack trace:

Traceback (most recent call last): File "test_spark.py", line 18, in spark.sql("SELECT FROM qqq").show() File "/mnt/pavlo/venv/lib/python3.5/site-packages/pyspark/sql/dataframe.py", line 318, in show print(self._jdf.showString(n, 20)) File "/mnt/pavlo/venv/lib/python3.5/site-packages/py4j/java_gateway.py", line 1133, in call answer, self.gateway_client, self.target_id, self.name) File "/mnt/pavlo/venv/lib/python3.5/site-packages/pyspark/sql/utils.py", line 63, in deco return f(a, **kw) File "/mnt/pavlo/venv/lib/python3.5/site-packages/py4j/protocol.py", line 319, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o34.showString. : java.io.IOException: org.apache.hive.service.cli.HiveSQLException: java.lang.NoClassDefFoundError: org/apache/tez/dag/api/TezConfiguration at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getSplits(LlapBaseInputFormat.java:230) at org.apache.hadoop.hive.llap.LlapRowInputFormat.getSplits(LlapRowInputFormat.java:45) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.getPartitions(HadoopRDD.scala:408) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) [60/1818] at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.rdd.RDD.partitions(RDD.scala:250) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2390) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57) at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2792) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2389) at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2396) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2132) at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2131) at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2822) at org.apache.spark.sql.Dataset.head(Dataset.scala:2131) at org.apache.spark.sql.Dataset.take(Dataset.scala:2346) at org.apache.spark.sql.Dataset.showString(Dataset.scala:248) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.Gateway.invoke(Gateway.java:280) [29/1818] at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.hive.service.cli.HiveSQLException: java.lang.NoClassDefFoundError: org/apache/tez/dag/api/TezConfiguration at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256) at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242) at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365) at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getSplits(LlapBaseInputFormat.java:222) ... 50 more Caused by: java.lang.RuntimeException: java.lang.NoClassDefFoundError: org/apache/tez/dag/api/TezConfiguration at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:89) at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36) at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59) at com.sun.proxy.$Proxy42.fetchResults(Unknown Source) at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:559) at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:751) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1717) at org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1702) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56) at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [0/1818] ... 1 more Caused by: java.lang.NoClassDefFoundError: org/apache/tez/dag/api/TezConfiguration at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits.createPlanFragment(GenericUDTFGetSplits.java:225) at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits.process(GenericUDTFGetSplits.java:190) at org.apache.hadoop.hive.ql.exec.UDTFOperator.process(UDTFOperator.java:116) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95) at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897) at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:438) at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:430) at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147) at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208) at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:494) at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:307) at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:878) at sun.reflect.GeneratedMethodAccessor72.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78) ... 18 more Caused by: java.lang.ClassNotFoundException: org.apache.tez.dag.api.TezConfiguration at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 36 more



However ddl operations worked fine like: `show databases`, `show tables`, `drop table`

Any hints what am I doing wrong are much appreciated. Thanks
zjffdu commented 6 years ago

I believe this because spark-llap depends on hive and hive depends on tez.

skliarpawlo commented 6 years ago

Thanks @zjffdu for quick response, if hive just depends on Tez but doesn't uses it in this specific case that's ok. If it is actually used than I'm really confused on how all this works? How Tez is related to Spark query execution? If that is not the case I still wonder: if spark-llap depends on hive and hive depends on tez, why aren't all the requirements being installed when I specify 'spark-llap' package to be installed in spark? And the overall question how can I fix this, and should usage docs be updated with the instructions to avoid such confusion. Thanks

Upd: currently I'm thinking to add tez jars to spark env where I'm testing all of this, but I really don't know what I'm doing :)

jdere commented 6 years ago

spark-llap is a Spark connector to Hive/LLAP. It talks to Hive/LLAP for reads of Hive tables, rather than going through the normal Spark path. This results in spark-llap having dependencies on Hive/LLAP/Tez libs. That's just a fact of how it works.

If you are just trying to use vanilla Spark, then you should set spark.sql.hive.llap=false, but if you have spark.sql.hive.llap=true in your config then I am assuming you are trying to use spark-llap.

So the actual error seems to be occurring in HiveServer2 .. check your HiveServer2 logs to confirm that you have the same java.lang.NoClassDefFoundError error there. That is a bit surprising, since I would have expected Hive to have had the necessary Tez dependencies. If not might want to check that HiveServer2 is running properly and that it has the right Tez libraries in its classpath.

Are you running HDP and what version? Is the HiveServer2 your spark config is pointed to running HiveServer2-Interactive (with LLAP)?

skliarpawlo commented 6 years ago

Thanks for your response @jdere , that makes total sense to me. No, I'm not running HDP of any version, we deployed Hive/Hive Metastore to our machines, and Hive it's definitely not in LLAP mode, so this is what I need to figure out for now. Strange thing to me is that Tez is not present here anyway, so probably I need either some other Hive build or add Tez to Hive's classpath. I'll try this, after verifying that the error replicates on Hive Side. Thanks

skliarpawlo commented 6 years ago

To be clear of my use case: We don't use hdfs in our stack (we use s3), so using llap (if I get things right) is not what we can do, the only thing we need is applying Apache Ranger policies when querying Metastore. I know that is not full secure, but that is what we have, and we can be happy with that if that will work for us.

skliarpawlo commented 6 years ago

Hive Server logs doesn't contain any errors, btw, it's only 'OK' s for succeeded operations, so seems like error is not on the HiveServers side, but I'll still going to look into how to run hive with llap support

skliarpawlo commented 6 years ago

Update on this: audit logs and permission checks on Hive and Ranger side seems good, no errors in logs and it raises Auth error when appropriate, so the problem is definitely on spark side, wonder which jars do I need to provide will try all of tez :) hints are still welcome

skliarpawlo commented 6 years ago

Okay tried to add tez 0.9.0 jars to $HIVE_HOME/ext in hive and to $SPARK_HOME/ext in spark, and now at least error changed.

Summary of versions used:

package version
spark-llap-assembly_2.11 1.1.3-2.1
Tez 0.9.0
Hive 2.3.2
Ranger 0.6.2

Does that versions makes sense? Not sure where to find compatibility info

Now it's sql strange parsing error and it's coming from Hive - i.e. I see this error in HiveServer2 logs as well as in spark job output: when I'm doing

spark.sql('select * from xxx')

Exception is:

Traceback (most recent call last):
  File "test_spark.py", line 27, in <module>
    spark.sql("SELECT * FROM qqq").show()
  File "/mnt/pavlo/venv/lib/python3.5/site-packages/pyspark/sql/dataframe.py", line 318, in show
  File "/mnt/pavlo/venv/lib/python3.5/site-packages/py4j/java_gateway.py", line 1133, in __call__                                                    
    answer, self.gateway_client, self.target_id, self.name)
  File "/mnt/pavlo/venv/lib/python3.5/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/mnt/pavlo/venv/lib/python3.5/site-packages/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o40.showString.
: java.io.IOException: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile
query: org.apache.hadoop.hive.ql.parse.ParseException: line 1:10 missing \' at 'from' near '<EOF>'
        at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getSplits(LlapBaseInputFormat.java:230)
        at org.apache.hadoop.hive.llap.LlapRowInputFormat.getSplits(LlapRowInputFormat.java:45)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.getPartitions(HadoopRDD.scala:408)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)                                                                       [58/1411]
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)
        at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
        at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2390)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
        at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2792)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2389)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2396)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2132)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2131)
        at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2822)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:2131)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:2346)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)                                                                                    
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query: org.a
pache.hadoop.hive.ql.parse.ParseException: line 1:10 missing \' at 'from' near '<EOF>'
        at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
        at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
        at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
        at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getSplits(LlapBaseInputFormat.java:222)
        ... 50 more
Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query: org.a
pache.hadoop.hive.ql.parse.ParseException: line 1:10 missing \' at 'from' near '<EOF>'
        at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:499)
        at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:307)
        at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:878)
        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
        at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
        at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
        at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
        at com.sun.proxy.$Proxy42.fetchResults(Unknown Source)
        at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:559)
        at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:751)
        at org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1717)
        at org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1702)                                     
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
        at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more
Caused by: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query: org.apache.hadoop.hive.ql.parse.ParseException: lin
e 1:10 missing \' at 'from' near '<EOF>'
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:165)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
        at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:494)
        ... 24 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query: org.apache.hadoop.hive.ql.parse.ParseException: line 1:10 missing \' at
'from' near '<EOF>'
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits.createPlanFragment(GenericUDTFGetSplits.java:234)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits.process(GenericUDTFGetSplits.java:190)
        at org.apache.hadoop.hive.ql.exec.UDTFOperator.process(UDTFOperator.java:116)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
        at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:438)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:430)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
        ... 26 more
jdere commented 6 years ago

Hmm if the HiveServer2 log only consists of OK lines, then you might be looking at the wrong log file. This file should have all of the log4j logging.

So llap can work with s3, but that doesn't really seem to be what you're trying to do anyway. So what you are trying to restrict access to Hive Metastore objects (as opposed to access to the tables and table data?), based on Ranger policies? So I'm not sure if spark-llap is going to help with that either. spark-llap will be able to restrict table/column access to the Hive tables based on the Ranger rules when you are running SparkSQL queries. Asking around, it does not sound like Ranger actually has rules that restrict access to the Hive Metastore. The only model supported there in terms of restricting Metastore access is Hive StorageBasedAuthorization (https://cwiki.apache.org/confluence/display/Hive/Storage+Based+Authorization+in+the+Metastore+Server), where the Hive Metastore will limit access to Hive database/table metadata based on the user's FS permissions on the database/table locations.

skliarpawlo commented 6 years ago

@jdere sorry for my bad wording, I'm trying to achieve is what you explained spark-llap does. I need to restrict access to our warehouse tables, which are implemented using s3 + hive metastore. what I understood from spark-llap docs and presentations is that it requires two things - ranger plugin in hive and llap service/mode on hdfs name node, but as I mentioned we don't use hdfs (we use s3) so that part is not of interest for me.

Then as I understood nothing will prevent users from reading directly from s3 from security perspective, but for now I assume this OK. That's why I said that I'm only interesed in "protecting metadata access" sorry again for bad wording.

Can you please elaborate more about: "So llap can work with s3" ? Afaik S3 doesn't have file level access modificators right?

jdere commented 6 years ago

I don't really have any specifics or details about LLAP with s3, but I don't believe that your table directories living on S3 will cause errors if you are running LLAP.

I'm not sure what that query compilation error means .. can you see in the Hive logs if the query that is being sent and compiled matches "select * from xxx" .. you can search Hive logs for "Parsing " to see the query that was being compiled.

skliarpawlo commented 6 years ago

@jdere that's right, no errors, but no access control as well, that is what I'm talking about, seems like it should work, but with only metadata retrieval protection, nothing will protect users from reading from s3 directly if they know where to look, despite hive could raise permission exception. But again that is fine for our use case, so I'm still fighting to make this work. Thanks for your advices, very useful. Currently trying to make hive logging more verbose

skliarpawlo commented 6 years ago

UPD: I managed to setup log4j logs, and I see that now full sql which fails to compile:

select get_splits("select _1 from (select * from yyy.qqq) as q_2464617cc8ec41faac5c955c199e2433 ",2)

Inner query of this fails

select _1 from (select * from yyy.qqq) as q_2464617cc8ec41faac5c955c199e2433

with error:

Error: Error while compiling statement: FAILED: ParseException line 1:10 missing \' at 'from' near '<EOF>' (state=42000,code=40000)

Maybe problem in this specific column name, I feel like I'm so close :)

skliarpawlo commented 6 years ago

I bit gave up for a while to make things work on s3 - there are couple of exception which may be related to my local s3 settings. What I've tried to do is to create a table located at local fs where I run single process spark and then read it. Code:

from sparkly import SparklySession

class Session(SparklySession):
    repositories = ['http://repo.hortonworks.com/content/groups/public/']
    packages = ['com.hortonworks.spark:spark-llap-assembly_2.11:1.1.3-2.1']
    options = {
               'spark.sql.hive.llap':'true',
               'hive.metastore.uris': 'thrift://host:9083',
               'spark.sql.hive.hiveserver2.jdbc.url': 'jdbc:hive2://host:10000/',
              }

spark = Session()

df = spark.createDataFrame([('foo', 1), ('bar', 2)], ['name', 'age'])
df.write.saveAsTable('tableA', format='parquet', path='/tmp/tableA', mode='overwrite')

spark.table('tableA').show()

Exception raised on the last line (read from the table):

  File "test_spark.py", line 23, in <module>
    spark.table('yyy.ttt').show()
  File "/mnt/pavlo/venv/lib/python3.5/site-packages/pyspark/sql/dataframe.py", line 318, in show
    print(self._jdf.showString(n, 20))
  File "/mnt/pavlo/venv/lib/python3.5/site-packages/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/mnt/pavlo/venv/lib/python3.5/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/mnt/pavlo/venv/lib/python3.5/site-packages/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o37.showString.
: java.io.IOException: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Was expecting a si
ngle TezTask.
        at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getSplits(LlapBaseInputFormat.java:230)
        at org.apache.hadoop.hive.llap.LlapRowInputFormat.getSplits(LlapRowInputFormat.java:45)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.getPartitions(HadoopRDD.scala:408)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)                                                                                        
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)
        at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
        at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2390)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
        at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2792)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2389)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2396)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2132)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2131)
        at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2822)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:2131)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:2346)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)                
                at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)                                                     
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Was expecting a single TezTask
.
        at org.apache.hive.jdbc.Utils.verifySuccess(Utils.java:256)
        at org.apache.hive.jdbc.Utils.verifySuccessWithInfo(Utils.java:242)
        at org.apache.hive.jdbc.HiveQueryResultSet.next(HiveQueryResultSet.java:365)
        at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getSplits(LlapBaseInputFormat.java:222)
        ... 50 more
Caused by: org.apache.hive.service.cli.HiveSQLException: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Was expecting a single TezTask
.
        at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:499)
        at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:307)
        at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:878)
        at sun.reflect.GeneratedMethodAccessor9.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
        at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
        at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
        at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
        at com.sun.proxy.$Proxy42.fetchResults(Unknown Source)                                                                                        
        at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:559)
        at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:751)
        at org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1717)
        at org.apache.hive.service.rpc.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1702)
        at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
        at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
        at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:56)
        at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        ... 1 more
Caused by: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: Was expecting a single TezTask.
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:165)
        at org.apache.hadoop.hive.ql.Driver.getResults(Driver.java:2208)
        at org.apache.hive.service.cli.operation.SQLOperation.getNextRowSet(SQLOperation.java:494)
        ... 24 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Was expecting a single TezTask.
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits.createPlanFragment(GenericUDTFGetSplits.java:242)
        at org.apache.hadoop.hive.ql.udf.generic.GenericUDTFGetSplits.process(GenericUDTFGetSplits.java:190)
        at org.apache.hadoop.hive.ql.exec.UDTFOperator.process(UDTFOperator.java:116)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
        at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:95)
        at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:897)
        at org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:130)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:438)
        at org.apache.hadoop.hive.ql.exec.FetchOperator.pushRow(FetchOperator.java:430)
        at org.apache.hadoop.hive.ql.exec.FetchTask.fetch(FetchTask.java:147)
        ... 26 more

Have no idea what is going on, the only assumption is that I use incompatible hive/hadoop/spark-ldap/tez versions? Does anybody has any clue about this error?

skliarpawlo commented 6 years ago

To be clear: queries to hive like this:

select get_splits("select * from yyy.ttt", 2);

Fails basically with such error:

org.apache.hadoop.hive.ql.metadata.HiveException: Was expecting a single TezTask

Hope I just set up tez in a wrong way, digging further

skliarpawlo commented 6 years ago

Okay after setuping execution engine on hive to tez, now the problem is more obvious I have to setup tez properly and start llap service daemon. Kinda obvious, it was just wasn't clear to me why we need llap in the first place, what is it's role. Closing the ticket for now. If you have some final hints to me still welcome :)

skliarpawlo commented 6 years ago

i actually do have one more question: if I use spark-llap and all this stuff, is kerberos is the only authentication option?

jdere commented 6 years ago

Yes I believe that Kerberos would be the only authentication option. You can try it out without any authentication and you should still be able to see the per-user filters/masking applied, but you would need kerberos for a real deployment.

I'm not sure how spark-llap would have generated the SQL of "select _1 from (select * from yyy.qqq) as q_2464617cc8ec41faac5c955c199e2433". Can you do a DESCRIBE of yyy.qqq, from both Spark as well as from Hive? Is yyy.qqq a view?

Also surprised about the "Was expecting a single TezTask" error. What is the SQL that was submitted to HiveServer2 for that query? Can you try running the Hive EXPLAIN for the inner query (within the get_splits()) and seeing what that looks like?

skliarpawlo commented 6 years ago

@jdere yeah I tried that, but I don't understand which user is being authenticated, how can I change it for local debugging different policies, not having local Kerberos deployment.

yyy.qqq is not a view it's a simple table which I created for testing, _1 is it's only column name. I will provide all this DESCRIBES and EXPLAINS if the exception will reoccur, for now I'm trying to make minimal yarn/hdfs/llap cluster to work so that I can test all the things within implied environment. Thank you very much for debug hints!

jdere commented 6 years ago

Without kerberos, the user is specified based on the user.name property, which is passed to the HiveServer2 connection. Might also be able to set the user in the HiveServer2 JDBC URL.

skliarpawlo commented 6 years ago

Hi guys. I finally managed to setup Hive LLAP service in staging and came back to testing script. Now I encountered a new quite confusing exception: Code:

... init spark session stuff
spark.sql("select name from yyy.ttt").show()

Exception:

: java.io.IOException: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING' but 'STRING' is found.
        at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getSplits(LlapBaseInputFormat.java:230)
        at org.apache.hadoop.hive.llap.LlapRowInputFormat.getSplits(LlapRowInputFormat.java:45)
        at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:207)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.HadoopRDD$HadoopMapPartitionsWithSplitRDD.getPartitions(HadoopRDD.scala:408)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
        at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
        at scala.Option.getOrElse(Option.scala:121)
        at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
        at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:311)
        at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
        at org.apache.spark.sql.Dataset$$anonfun$org$apache$spark$sql$Dataset$$execute$1$1.apply(Dataset.scala:2390)
        at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
        at org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2792)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$execute$1(Dataset.scala:2389)
        at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collect(Dataset.scala:2396)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2132)
        at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2131)
        at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2822)
               at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2131)
        at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2822)
        at org.apache.spark.sql.Dataset.head(Dataset.scala:2131)
        at org.apache.spark.sql.Dataset.take(Dataset.scala:2346)
        at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:280)
        at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
        at py4j.commands.CallCommand.execute(CallCommand.java:79)
        at py4j.GatewayConnection.run(GatewayConnection.java:214)
        at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'STRING' but 'STRING' is found.
        at shadehive.org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:372)
        at shadehive.org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.expect(TypeInfoUtils.java:355)
        at shadehive.org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseType(TypeInfoUtils.java:416)
        at shadehive.org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils$TypeInfoParser.parseTypeInfos(TypeInfoUtils.java:329)
        at shadehive.org.apache.hadoop.hive.serde2.typeinfo.TypeInfoUtils.getTypeInfoFromTypeString(TypeInfoUtils.java:831)
        at org.apache.hadoop.hive.llap.FieldDesc.readFields(FieldDesc.java:63)
        at org.apache.hadoop.hive.llap.Schema.readFields(Schema.java:72)
        at org.apache.hadoop.hive.llap.LlapInputSplit.readFields(LlapInputSplit.java:148)
        at org.apache.hadoop.hive.llap.LlapBaseInputFormat.getSplits(LlapBaseInputFormat.java:226)
        ... 50 more

We also dumped the thrift response from Hive server, which it seems it fails to parse:

fields=[(struct, 0, fields=[(string, 1, ttt), (string, 2, yyy), (string, 3, pavlo), (i32, 4, 1517174147), (i32, 5, 0), (i32, 6, 0), (struct, 7, fields=[(list, 1, [fields=[(string, 1, name), (string, 2, string)], fields=[(string, 1, age), (string, 2, bigint)]]), (string, 2, file:/tmp/hms/test1), (string, 3, org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat), (string, 4, org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat), (bool, 5, None), (i32, 6, -1), (struct, 7, fields=[(string, 2, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe), (map, 3, {'path': '/tmp/hms/test1', 'serialization.format': '1'})]), (list, 8, []), (list, 9, []), (map, 10, {}), (struct, 11, fields=[(list, 1, []), (list, 2, []), (map, 3, {})]), (bool, 12, None)]), (list, 8, []), (map, 9, {'spark.sql.sources.schema.part.0': '{"type":"struct","fields":[{"name":"name","type":"string","nullable":true,"metadata":{}},{"name":"age","type":"long","nullable":true,"metadata":{}}]}', 'transient_lastDdlTime': '1517174147', 'spark.sql.sources.provider': 'parquet', 'EXTERNAL': 'TRUE', 'spark.sql.sources.schema.numParts': '1'}), (string, 12, EXTERNAL_TABLE), (bool, 15, None)])]

I also see no errors on Hive's side this time, however note that we have hive version 2.3.2, however the version officially supported by spark-llap seems to be 2.1.x - could this be the problem?

Any clues are so apprciated, thanks

jdere commented 6 years ago

Yeah, there may be a version conflict going on here .. you can try rebuilding spark-llap, and specifying the versions of Hive/Spark that you either built or are using by running "build/sbt assembly -Dhive.version= -Dspark.version=" from the spark-llap directory.