ibm-watson-data-lab / ibmos2spark

Facilitates Data I/O between Spark and IBM Object Storage services.
10 stars 8 forks source link

Permission Denied when writing to Softlayer COS #45

Closed malladip-ibm closed 6 years ago

malladip-ibm commented 6 years ago

Exception in thread "main" org.apache.hadoop.security.AccessControlException: Permission denied: s3n://sparktest/onesample.txt at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:449) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55) at java.lang.reflect.Method.invoke(Method.java:508) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)

gadamc commented 6 years ago

We need more information.

The path "s3n://sparktest/onesample.txt" doesn't look valid to me. Please post code snippet of what you're trying to do.

gilv commented 6 years ago

@malladip-ibm why s3n://?

malladip-ibm commented 6 years ago

I was trying multiple things (direct s3 interface) to make the Spark integration work with COS

I am getting a NoSuchMethodError with stocator integration and also a NPE when I write to COS storage

From: Gil Vernik notifications@github.com To: ibm-watson-data-lab/ibmos2spark ibmos2spark@noreply.github.com Cc: malladip-ibm malladip@us.ibm.com, Mention mention@noreply.github.com Date: 11/28/2017 08:30 PM Subject: Re: [ibm-watson-data-lab/ibmos2spark] Permission Denied when writing to Softlayer COS (#45)

@malladip-ibm why s3n://? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

gilv commented 6 years ago

@malladip-ibm please provide more details what you did...Stocator is working for others. s3n:// was deprecated about 5 years ago

malladip-ibm commented 6 years ago
val pathOut = "/Spark/results/wordCountEn.txt";
val nameNodeIP = "hdfs://9.1.70.248:9000";

var credentials = scala.collection.mutable.HashMap[String, String](
  "endPoint"->"s3-api.us-geo.objectstorage.softlayer.net",
  "accessKey"->"xxx",
  "secretKey"->"xxx"
)

val accessKey = "xxx"
val secretKey = "zzz"

// val endPoint = "s3-api.sjc-us-geo.objectstorage.softlayer.net" val endPoint = "s3-api.us-geo.objectstorage.softlayer.net"

var hconf = sc.hadoopConfiguration;
hconf.set("fs.s3n.service.endpoint",endPoint);
hconf.set("fs.s3n.awsAccessKeyId", accessKey);
hconf.set("fs.s3n.awsSecretAccessKey", secretKey);
hconf.set("fs.cos.impl", "com.ibm.stocator.fs.ObjectStoreFileSystem")
hconf.set("fs.stocator.scheme.list", "cos")
hconf.set("fs.stocator.cos.impl", "com.ibm.stocator.fs.cos.COSAPIClient")
hconf.set("fs.stocator.cos.scheme", "cos")
hconf.set("fs.cos.mycos.access.key", accessKey)
hconf.set("fs.cos.mycos.endpoint", endPoint)
hconf.set("fs.cos.mycos.secret.key", secretKey)
hconf.set("spark.hadoop.fs.cos.softlayer.endpoint",
          endPoint)
hconf.set("spark.hadoop.fs.cos.softlayer.access.key",
        accessKey)
hconf.set("spark.hadoop.fs.cos.softlayer.secret.key",
        secretKey)

var configurationName = "mycos";
var cos = new CloudObjectStorage(sc, credentials, configurationName=configurationName);
var url = cos.url("sparktest", "sampledata.txt");

val input = sc.textFile(url);

cleanWorkSpace(pathOut, nameNodeIP);
val count = input.flatMap(line ⇒ line.split("\n")).map(word ⇒ (word, 1)).reduceByKey(_ + _)
count.saveAsTextFile(nameNodeIP + pathOut);

Here is the error I am getting when I run the above code

[main] INFO org.apache.spark.SparkContext - Starting job: saveAsTextFile at WordCount.scala:236 [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Registering RDD 3 (map at WordCount.scala:235) [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Got job 0 (saveAsTextFile at WordCount.scala:236) with 2 output partitions [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Final stage: ResultStage 1 (saveAsTextFile at WordCount.scala:236) [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Parents of final stage: List(ShuffleMapStage 0) [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Missing parents: List(ShuffleMapStage 0) [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:235), which has no missing parents [dag-scheduler-event-loop] INFO org.apache.spark.storage.MemoryStore - Block broadcast_1 stored as values in memory (estimated size 4.1 KB, free 133.5 KB) [dag-scheduler-event-loop] INFO org.apache.spark.storage.MemoryStore - Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.3 KB, free 135.8 KB) [dispatcher-event-loop-4] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_1_piece0 in memory on 9.85.186.127:34271 (size: 2.3 KB, free: 159.0 MB) [dag-scheduler-event-loop] INFO org.apache.spark.SparkContext - Created broadcast 1 from broadcast at DAGScheduler.scala:1006 [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.DAGScheduler - Submitting 2 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCount.scala:235) [dag-scheduler-event-loop] INFO org.apache.spark.scheduler.TaskSchedulerImpl - Adding task set 0.0 with 2 tasks [dispatcher-event-loop-3] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 0.0 in stage 0.0 (TID 0, hdfsu10.almaden.ibm.com, partition 0,PROCESS_LOCAL, 2516 bytes) [dispatcher-event-loop-3] INFO org.apache.spark.scheduler.TaskSetManager - Starting task 1.0 in stage 0.0 (TID 1, hdfsu4.almaden.ibm.com, partition 1,PROCESS_LOCAL, 2516 bytes) [dispatcher-event-loop-4] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_1_piece0 in memory on hdfsu4.almaden.ibm.com:33272 (size: 2.3 KB, free: 511.1 MB) [dispatcher-event-loop-7] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_1_piece0 in memory on hdfsu10.almaden.ibm.com:53347 (size: 2.3 KB, free: 511.1 MB) [dispatcher-event-loop-7] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_0_piece0 in memory on hdfsu4.almaden.ibm.com:33272 (size: 10.4 KB, free: 511.1 MB) [dispatcher-event-loop-4] INFO org.apache.spark.storage.BlockManagerInfo - Added broadcast_0_piece0 in memory on hdfsu10.almaden.ibm.com:53347 (size: 10.4 KB, free: 511.1 MB) [task-result-getter-0] WARN org.apache.spark.scheduler.TaskSetManager - Lost task 1.0 in stage 0.0 (TID 1, hdfsu4.almaden.ibm.com): java.lang.NoSuchMethodError: org.apache.http.conn.ssl.SSLConnectionSocketFactory.(Ljavax/net/ssl/SSLContext;Ljavax/net/ssl/HostnameVerifier;)V at com.amazonaws.http.conn.ssl.SdkTLSSocketFactory.(SdkTLSSocketFactory.java:56) at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.getPreferredSocketFactory(ApacheConnectionManagerFactory.java:91) at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:65) at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:58) at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:51) at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:39) at com.amazonaws.http.AmazonHttpClient.(AmazonHttpClient.java:302) at com.amazonaws.AmazonWebServiceClient.(AmazonWebServiceClient.java:164) at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:523) at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:503) at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:485) at com.amazonaws.services.s3.AmazonS3Client.(AmazonS3Client.java:457) at com.ibm.stocator.fs.cos.COSAPIClient.initiate(COSAPIClient.java:340)

malladip-ibm commented 6 years ago

I am running from a driver(my laptop) connecting to the spark cluster running remotely. It is spark 1.6.

The above exception suggests that there is a classpath conflict for SSLConnectionSocketFactory. When I print out the jar from the scala code from which the class is loaded, it correctly displays it as httpclient-4.5.2.jar. It is probably coming from the scala-1.6 httpclient jar and I dont know how I can override it on the cluster

gadamc commented 6 years ago

The whole point of ibmos2spark is that you don't need most of that code. Also, instead of writing to an output, let's just try to make sure you can read from COS, so instead try this:

var credentials = scala.collection.mutable.HashMap[String, String](
  "endPoint"->"s3-api.us-geo.objectstorage.softlayer.net",
  "accessKey"->"xxx",
  "secretKey"->"xxx"
)

var configurationName = "mycos";
var cos = new CloudObjectStorage(sc, credentials, configurationName=configurationName);
var url = cos.url("sparktest", "sampledata.txt");

val input = sc.textFile(url);
val count = input.flatMap(line ⇒ line.split("\n")).map(word ⇒ (word, 1)).reduceByKey(_ + _).collect();

print(count);
malladip-ibm commented 6 years ago

Thats what exactly I expected to happen based on the documentation and it unfortunately did not work. It kept saying "unknown scheme:cos". When I set those properly values, it moved further until it hit the exception. when I ran the code without those hconf settings on a dsx spark instance, it worked just fine

gadamc commented 6 years ago

Okay, so this did work on a DSX spark instance. That's good to know.

This looks like a stocator configuration issue on your local Spark cluster and not an issue with ibmos2spark. https://github.com/SparkTC/stocator.

What is the error when you run my sample code above on your Spark cluster?

malladip-ibm commented 6 years ago

we are going to upgrade our spark to 2.0. It should be working fine there. Thanks for your help. I will close the issue