ibm-watson-data-lab / ibmos2spark

Facilitates Data I/O between Spark and IBM Object Storage services.
10 stars 8 forks source link

swift 1 driver no longer works? #1

Closed gadamc closed 8 years ago

gadamc commented 8 years ago

This code works

import ibmos2spark as oss
credentials = { user_id, pass_id, etc... } 
credentials['name'] = 'my_bluemix_os'
bmos = oss.bluemix2d(sc, credentials)  ##NOTE: bluemix2d   "2d"  
data = sc.textFile(bmos.url('test_container', 'hello.go'))
data.count() # returns 18

But, if I don't use 2d,

bmos = oss.bluemix(sc, credentials)  
data = sc.textFile(bmos.url('test_container', 'hello.go'))
data.count()

raises exception


Py4JJavaErrorTraceback (most recent call last)
<ipython-input-64-0106b6ff8a89> in <module>()
----> 1 data.count()

/usr/local/src/spark160master/spark/python/pyspark/rdd.py in count(self)
   1002         3
   1003         """
-> 1004         return self.mapPartitions(lambda i: [sum(1 for _ in i)]).sum()
   1005 
   1006     def stats(self):

/usr/local/src/spark160master/spark/python/pyspark/rdd.py in sum(self)
    993         6.0
    994         """
--> 995         return self.mapPartitions(lambda x: [sum(x)]).fold(0, operator.add)
    996 
    997     def count(self):

/usr/local/src/spark160master/spark/python/pyspark/rdd.py in fold(self, zeroValue, op)
    867         # zeroValue provided to each partition is unique from the one provided
    868         # to the final reduce call
--> 869         vals = self.mapPartitions(func).collect()
    870         return reduce(op, vals, zeroValue)
    871 

/usr/local/src/spark160master/spark/python/pyspark/rdd.py in collect(self)
    769         """
    770         with SCCallSiteSync(self.context) as css:
--> 771             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    772         return list(_load_from_socket(port, self._jrdd_deserializer))
    773 

/usr/local/src/spark160master/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py in __call__(self, *args)
    811         answer = self.gateway_client.send_command(command)
    812         return_value = get_return_value(
--> 813             answer, self.gateway_client, self.target_id, self.name)
    814 
    815         for temp_arg in temp_args:

/usr/local/src/spark160master/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     43     def deco(*a, **kw):
     44         try:
---> 45             return f(*a, **kw)
     46         except py4j.protocol.Py4JJavaError as e:
     47             s = e.java_exception.toString()

/usr/local/src/spark160master/spark/python/lib/py4j-0.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    306                 raise Py4JJavaError(
    307                     "An error occurred while calling {0}{1}{2}.\n".
--> 308                     format(target_id, ".", name), value)
    309             else:
    310                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.hadoop.fs.swift.exceptions.SwiftConfigurationException: Invalid host name
    at org.apache.hadoop.fs.swift.util.SwiftUtils.validSchema(SwiftUtils.java:222)
    at org.apache.hadoop.fs.swift.http.SwiftRestClient.<init>(SwiftRestClient.java:510)
    at org.apache.hadoop.fs.swift.http.SwiftRestClient.getInstance(SwiftRestClient.java:1914)
    at org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystemStore.initialize(SwiftNativeFileSystemStore.java:81)
    at org.apache.hadoop.fs.swift.snative.SwiftNativeFileSystem.initialize(SwiftNativeFileSystem.java:129)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2596)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:91)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2630)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2612)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:296)
    at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:256)
    at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:228)
    at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:313)
    at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:199)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:35)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.api.python.PythonRDD.getPartitions(PythonRDD.scala:58)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
    at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:237)
    at scala.Option.getOrElse(Option.scala:120)
    at org.apache.spark.rdd.RDD.partitions(RDD.scala:237)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:1934)
    at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:405)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:95)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:55)
    at java.lang.reflect.Method.invoke(Method.java:507)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:209)
    at java.lang.Thread.run(Thread.java:785)

Is this wrong: https://github.com/ibm-cds-labs/ibmos2spark/blob/master/ibmos2spark/osconfig.py#L111

Checking the URL

bmos = oss.bluemix2d(sc, credentials)
bmos.url('test_container','hello.go')

returns

swift2d://test_container.my_bluemix_os/hello.go

and

bmos = oss.bluemix(sc, credentials)
bmos.url('test_container','hello.go')

returns

'swift://test_container.my_bluemix_os/hello.go'

as expected

gadamc commented 8 years ago

The problem is that the swift protocol does not support underscores, "_", in the configuration name or the container names. This results in an 'invalid host name' error.

There is no fix for this, other than to avoid underscores with swift.

The latest updates to this package, as of 0.0.6, raise a more explicit ValueError when an underscore is found using the swift protocol. The swift2d protocol is unchanged.