huaweicloud / obsa-hdfs

Apache License 2.0
29 stars 31 forks source link

spark2.3.2+hadoop3.1.1环境下,根据手册配置后无法使用 #3

Open zmzeng opened 3 years ago

zmzeng commented 3 years ago

问题描述: 按照该仓库的手册配置插件后,在代码中访问obs路径仍然失败。报Class org.apache.hadoop.fs.obs.OBSFileSystem not found错误。

环境信息如下: spark 2.3.2 无部署,直接通过spark-submit调用; hadoop 3.1.1 单机伪分布式部署; spark可以正常访问hadoop,详见最后的执行日志。

hadoop@ecs-c04d:~$ spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.2
      /_/

Using Scala version 2.11.8, OpenJDK 64-Bit Server VM, 1.8.0_275
Branch
Compiled by user jshao on 2018-09-16T12:15:32Z
Revision
Url
Type --help for more information.
hadoop@ecs-c04d:~$ hdfs version
Hadoop 3.1.1
Source code repository https://github.com/apache/hadoop -r 2b9a8c1d3a2caf1e733d57f346af3ff0d5ba529c
Compiled by leftnoteasy on 2018-08-02T04:26Z
Compiled with protoc 2.5.0
From source with checksum f76ac55e5b5ff0382a9f7df36a3ca5a0
This command was run using /usr/local/hadoop/share/hadoop/common/hadoop-common-3.1.1.jar

下载了hadoop-huaweicloud-3.1.1-hw-40.jar和esdk-obs-java-3.20.6.1.jar并放置到了spark和hadoop的依赖目录: /usr/local/spark/jars/ /usr/local/hadoop/share/hadoop/common/lib/ /usr/local/hadoop/share/hadoop/tools/lib/ /usr/local/hadoop/share/hadoop/hdfs/lib

core-site.xml内容(obs桶在北京四region)

<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>file:/usr/local/hadoop/tmp</value>
        <description>Abase for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
    <property>
        <name>fs.obs.impl</name>
        <value>org.apache.hadoop.fs.obs.OBSFileSystem</value>
    </property>
    <property>
        <name>fs.obs.access.key</name>
        <value>此处已更改为我的ak</value>
    </property>
    <property>
        <name>fs.obs.secret.key</name>
        <value>此处已更改为我的sk</value>
    </property>
    <property>
        <name>fs.obs.endpoint</name>
        <value>obs.cn-north-4.myhuaweicloud.com</value>
    </property>
    <property>
        <name>fs.obs.buffer.dir</name>
        <value>/home/hadoop/obs-buffer</value>
    </property>
</configuration>

运行代码:

# obs_test.py
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("obs test") \
    .getOrCreate()

df = spark.read.csv("obs://dev-modelarts/kaggle-CTR/data/data/train.csv", header=True, inferSchema=True)
df.printSchema()
df.show()

运行命令: spark-submit --jars hadoop-huaweicloud-3.1.1-hw-40.jar,esdk-obs-java-3.20.6.1.jar obs_test.py

报错日志:

hadoop@ecs-c04d:~$ spark-submit --jars hadoop-huaweicloud-3.1.1-hw-40.jar,esdk-obs-java-3.20.6.1.jar obs_test.py
21/01/03 09:04:55 WARN Utils: Your hostname, ecs-c04d resolves to a loopback address: 127.0.1.1; using 192.168.0.230 instead (on interface eth0)
21/01/03 09:04:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/01/03 09:04:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
21/01/03 09:04:56 INFO SparkContext: Running Spark version 2.3.2
21/01/03 09:04:56 INFO SparkContext: Submitted application: obs test
21/01/03 09:04:56 INFO SecurityManager: Changing view acls to: hadoop
21/01/03 09:04:56 INFO SecurityManager: Changing modify acls to: hadoop
21/01/03 09:04:56 INFO SecurityManager: Changing view acls groups to:
21/01/03 09:04:56 INFO SecurityManager: Changing modify acls groups to:
21/01/03 09:04:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
21/01/03 09:04:56 INFO Utils: Successfully started service 'sparkDriver' on port 36681.
21/01/03 09:04:56 INFO SparkEnv: Registering MapOutputTracker
21/01/03 09:04:56 INFO SparkEnv: Registering BlockManagerMaster
21/01/03 09:04:56 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
21/01/03 09:04:56 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
21/01/03 09:04:56 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-e294dec0-2f9d-4f3f-9e7e-3875c2b20d58
21/01/03 09:04:56 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
21/01/03 09:04:56 INFO SparkEnv: Registering OutputCommitCoordinator
21/01/03 09:04:57 INFO Utils: Successfully started service 'SparkUI' on port 4040.
21/01/03 09:04:57 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.230:4040
21/01/03 09:04:57 INFO SparkContext: Added JAR file:///home/hadoop/hadoop-huaweicloud-3.1.1-hw-40.jar at spark://192.168.0.230:36681/jars/hadoop-huaweicloud-3.1.1-hw-40.jar with timestamp 1609635897154
21/01/03 09:04:57 INFO SparkContext: Added JAR file:///home/hadoop/esdk-obs-java-3.20.6.1.jar at spark://192.168.0.230:36681/jars/esdk-obs-java-3.20.6.1.jar with timestamp 1609635897155
21/01/03 09:04:57 INFO SparkContext: Added file file:/home/hadoop/obs_test.py at file:/home/hadoop/obs_test.py with timestamp 1609635897166
21/01/03 09:04:57 INFO Utils: Copying /home/hadoop/obs_test.py to /tmp/spark-a51e2865-0465-4a1b-a6c5-1da954078da6/userFiles-6c5a6a09-bdbe-45bb-8629-a2041d237232/obs_test.py
21/01/03 09:04:57 INFO Executor: Starting executor ID driver on host localhost
21/01/03 09:04:57 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39711.
21/01/03 09:04:57 INFO NettyBlockTransferService: Server created on 192.168.0.230:39711
21/01/03 09:04:57 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
21/01/03 09:04:57 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.0.230, 39711, None)
21/01/03 09:04:57 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.230:39711 with 366.3 MB RAM, BlockManagerId(driver, 192.168.0.230, 39711, None)
21/01/03 09:04:57 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.0.230, 39711, None)
21/01/03 09:04:57 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.0.230, 39711, None)
21/01/03 09:04:57 INFO EventLoggingListener: Logging events to hdfs://localhost:9000/spark-logs/local-1609635897202
21/01/03 09:04:57 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/home/hadoop/spark-warehouse').
21/01/03 09:04:57 INFO SharedState: Warehouse path is 'file:/home/hadoop/spark-warehouse'.
21/01/03 09:04:58 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
21/01/03 09:04:58 WARN FileStreamSink: Error while looking for metadata directory.
Traceback (most recent call last):
  File "/home/hadoop/obs_test.py", line 11, in <module>
    df = spark.read.csv("obs://dev-modelarts/kaggle-CTR/data/data/train.csv", header=True, inferSchema=True)
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 441, in csv
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o55.csv.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.obs.OBSFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2596)
    at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3320)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3352)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3403)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3371)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:477)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
    at org.apache.spark.sql.execution.datasources.DataSource$.org$apache$spark$sql$execution$datasources$DataSource$$checkAndGlobPathIfNecessary(DataSource.scala:709)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:390)
    at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$15.apply(DataSource.scala:390)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.immutable.List.foreach(List.scala:381)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.immutable.List.flatMap(List.scala:344)
    at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:389)
    at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
    at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.obs.OBSFileSystem not found
    at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:2500)
    at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2594)
    ... 30 more

21/01/03 09:04:58 INFO SparkContext: Invoking stop() from shutdown hook
21/01/03 09:04:58 INFO SparkUI: Stopped Spark web UI at http://192.168.0.230:4040
21/01/03 09:04:58 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
21/01/03 09:04:58 INFO MemoryStore: MemoryStore cleared
21/01/03 09:04:58 INFO BlockManager: BlockManager stopped
21/01/03 09:04:58 INFO BlockManagerMaster: BlockManagerMaster stopped
21/01/03 09:04:58 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
21/01/03 09:04:58 INFO SparkContext: Successfully stopped SparkContext
21/01/03 09:04:58 INFO ShutdownHookManager: Shutdown hook called
21/01/03 09:04:58 INFO ShutdownHookManager: Deleting directory /tmp/spark-a51e2865-0465-4a1b-a6c5-1da954078da6/pyspark-c89583cc-2dd6-419c-ae4a-c7119739455c
21/01/03 09:04:58 INFO ShutdownHookManager: Deleting directory /tmp/spark-c0e74089-56ef-4ee7-8944-446c3bd77482
21/01/03 09:04:58 INFO ShutdownHookManager: Deleting directory /tmp/spark-a51e2865-0465-4a1b-a6c5-1da954078da6
iRitiLopes commented 3 years ago

Same issue

chzzzyyyjjj commented 6 months ago

你好,请问解决了吗