hortonworks-spark / shc

The Apache Spark - Apache HBase Connector is a library to support Spark accessing HBase table as external data source or sink.
Apache License 2.0
552 stars 281 forks source link

Fail to insert a basic Dataframe + jar (shc-core-1.0.1-1.6-s_2.10.jar) on Horton Public Repo is doc instead of jar of class !!! #73

Closed samuelsayag closed 7 years ago

samuelsayag commented 7 years ago

Hello,

The command line given are from my sparkshell: spark-shell --master yarn \
--deploy-mode client \
--name "hive2hbase" \
--repositories "http://repo.hortonworks.com/content/groups/public/" \
--packages "com.hortonworks:shc:1.0.1-1.6-s_2.10" \
--jars "shc-core-1.0.1-1.6-s_2.10.jar" --files "/usr/hdp/current/hive-client/conf/hive-site.xml" \
--driver-memory 1G \
--executor-memory 1500m \
--num-executors 6 2> ./spark-shell.log

I have a simple Dataframe of Row of count 5:

scala> newDf
res5: org.apache.spark.sql.DataFrame = [offer_id: int, offer_label: string, universe: string, category: string, sub_category: string, sub_label: string]

That is made of type Row

scala> newDf.take(1)
res6: Array[org.apache.spark.sql.Row] = Array([28896458,Etui de protection bleu pour li...liseuse Cybook Muse Light liseuse Cybook Muse Light liseuse Cybook Muse HD Etui de protection bleu pour lis... Etui de protection noir pour lis... Etui de protection rose pour lis... Etui de protection orange liseus...,null,null,null,null])

I try to insert this with the following catalog:

scala> cat res0: String = { "table":{"namespace":"default", "name":"offDen3m"}, "rowkey":"key", "columns":{ "offer_id":{"cf":"rowkey", "col":"key", "type":"int"}, "offer_label":{"cf":"cf1", "col":"col1", "type":"string"}, "universe":{"cf":"cf2", "col":"col2", "type":"string"}, "category":{"cf":"cf3", "col":"col3", "type":"string"}, "sub_category":{"cf":"cf4", "col":"col4", "type":"string"}, "sub_label":{"cf":"cf5", "col":"col5", "type":"string"} } }

When I try to insert with the following code:

newDf.write.options( Map(HBaseTableCatalog.tableCatalog -> cat, HBaseTableCatalog.newTable -> "5")) .format("org.apache.spark.sql.execution.datasources.hbase") .save()

And I obtain the following stack:

17/01/03 10:36:42 INFO BlockManagerInfo: Removed broadcast_2_piece0 on 149.202.161.158:37691 in memory (size: 6.4 KB, free: 511.1 MB) java.lang.NoSuchMethodError: scala.runtime.IntRef.create(I)Lscala/runtime/IntRef;
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog.initRowKey(HBaseTableCatalog.scala:142)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog.(HBaseTableCatalog.scala:152)
at org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog$.apply(HBaseTableCatalog.scala:209)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.(HBaseRelation.scala:163)
at org.apache.spark.sql.execution.datasources.hbase.DefaultSource.createRelation(HBaseRelation.scala:58)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:222)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)

My question is double:

  1. Is is possible to insert a org.apache.spark.sql.Dataframe[org.apache.spark.sql.Row] using a shc and a catalog ?
  2. Given my current catalog, is it suppose to work ?

Thank you very much for helping

samuelsayag commented 7 years ago

Hello,

I signals (I possibly err on this but it looks strange), after having done:

$ wget http://repo.hortonworks.com/content/groups/public/com/hortonworks/shc-core/1.0.1-1.6-s_2.10/shc-core-1.0.1-1.6-s_2.10.jar

$ mkdir test $ cp shc-core-1.0.1-1.6-s_2.10.jar test $ cd test $ unzip shc-core-1.0.1-1.6-s_2.10.jar

I have the stange:

$ ls -lah total 556K drwxr-xr-x 6 spark hadoop 4.0K Jan 3 13:48 . drwxr-xr-x 3 spark spark 4.0K Jan 3 13:48 .. drwxr-xr-x 2 spark hadoop 4.0K Dec 13 09:53 index -rw-r--r-- 1 spark hadoop 16K Dec 13 09:53 index.html -rw-r--r-- 1 spark hadoop 6.0K Dec 13 09:53 index.js drwxr-xr-x 2 spark hadoop 4.0K Dec 13 09:53 lib drwxr-xr-x 2 spark hadoop 4.0K Dec 13 09:53 META-INF drwxr-xr-x 3 spark hadoop 4.0K Dec 13 09:53 org -rw-r--r-- 1 spark hadoop 3.5K Dec 13 09:53 package.html -rw-r--r-- 1 spark hadoop 504K Jan 3 13:48 shc-core-1.0.1-1.6-s_2.10.jar

an further ... ls -lah org/apache/spark/sql/execution/datasources/hbase/ AvroException.html HBaseConnectionKey.html RDDResources.html SchemaConverters$$SchemaType.html AvroSedes$.html HBaseFilter$.html ReferencedResource.html SchemaMap.html Bound.html HBaseRelation.html RegionResource.html Sedes.html BoundRange.html HBaseRelation$.html Resource.html SerializableConfiguration.html BoundRange$.html HBaseResources$.html RowKey.html SerializedTypedFilter.html BoundRanges.html HBaseTableCatalog.html ScanRange.html SparkHBaseConf$.html DoubleSedes.html HBaseTableCatalog$.html ScanRange$.html TableResource.html Field.html HRF.html ScanResource.html TypedFilter.html FilterType$.html HRF$.html SchemaConversionException.html TypedFilter$.html GetResource.html package.html SchemaConverters$.html Utils$.html

...without being a jar of doc it is quite stange...

=> This explain why I had to add by myself a self-compiled jar of the project of the tag v1.0.1-1.6 in the command line of the sparkshell because it could not find the class in its classpath.

I compiled the shc project by myself doing this: $ git clone https://github.com/hortonworks-spark/shc.git $ git checkout v1.0.1-1.6 $ mvn clean compile package -P scala-2.10 -DskipTests

=> Is it possible that this compilation gives a version of the jar the cause the error of my first post ? java.lang.NoSuchMethodError: scala.runtime.IntRef.create(I)Lscala/runtime/IntRef;

Thanks for helping

weiqingy commented 7 years ago

Thanks, @samouille666. I am looking into it. It should not be "**$.html". I will check Hortonworks repo.

(1) Is is possible to insert a org.apache.spark.sql.Dataframe[org.apache.spark.sql.Row] using a shc and a catalog ? => Yes. (2) Given my current catalog, is it suppose to work ? => Yes. (3) Is it possible that this compilation gives a version of the jar the cause the error of my first post ? java.lang.NoSuchMethodError: scala.runtime.IntRef.create(I)Lscala/runtime/IntRef;) => You may want to check the Spark version you were using. v1.0.1-1.6 of SHC is for Spark 1.6.*.

samuelsayag commented 7 years ago

Hello,

Many thanks for your answer. I am using spark 1.6.2 (using HDP 2.5 I do the export SPARK_MAJOR_VERSION=1, and my log display SPARK_MAJOR_VERSION is set to 1, using Spark). This is what I receive in the console: [spark@cluster1-node10 ~]$ export SPARK_MAJOR_VERSION=1 [spark@cluster1-node10 ~]$ spark-shell --version SPARK_MAJOR_VERSION is set to 1, using Spark Welcome to


 / __/__  ___ _____/ /__
_\ \/ _ \/ _ `/ __/  '_/

// ./_,// //_\ version 1.6.2 /_/

Type --help for more information.

But a search on Internet reveals that the IntRef.create method changes between 2.10 and 2.11 scala version. As can you confirm that: $ git checkout v1.0.1-1.6 $ mvn clean compile package -P scala-2.10 -DskipTests is the correct way to compile against scala 2.10.5 ?

Many thanks

weiqingy commented 7 years ago

Yes, it is correct, but you can just simply use: mvn clean -Pscala-2.10 -DskipTests package. The jars in Hortonworks repo (http://repo.hortonworks.com/content/groups/public/com/hortonworks/) work well now, you can use them directly.