df select from table limit 10, but spark can show 20 records

rollingdeep commented 5 years ago

just try it in spark-shell 2.3.2.

scala> import org.apache.spark.sql.{DataFrame, SparkSession}
scala> import com.hortonworks.hwc.HiveWarehouseSession
scala> val spark = SparkSession.builder.enableHiveSupport().appName("Test").getOrCreate()
scala> val hive = HiveWarehouseSession.session(spark).build()
scala> hive.setDatabase("my_db")
scala> val df = hive.executeQuery("select * from my_db.test_table limit 10")
scala> df.count()
res2: Long = 10   
scala> df.show(20)
Here really show 20 records.

god, any help?

jasonjiang8866 commented 5 years ago

use execute instead of executeQuery (limit is not working for this method).

rollingdeep commented 4 years ago

use execute instead of executeQuery (limit is not working for this method).

It works up to 1000 lines. If you need more lines, you need to use executeQuery(), which does not work properly with LIMIT...

I found a solution to visit hive external tables(ACID tables not support) for spark. First, remove spark-llap jar. Second, copy the create table clause from hive, then create table in spark-sql. In short, create same schema in spark-sql which point to the same hdfs location. Partition tables need add partitoin manually. Third, you can use spark.sql() and spark.write.orc*() without the limit of spark-llap.

Warning, You need to add partition using bash or hql script on hive partition table to ensure the same schema on both hive and spark.

hortonworks-spark / spark-llap

df select from table limit 10, but spark can show 20 records #267