ibm-watson-data-lab / ibmos2spark

Facilitates Data I/O between Spark and IBM Object Storage services.
10 stars 8 forks source link

Python Cloud Object Storage Support #14

Closed bassel-zeidan closed 7 years ago

bassel-zeidan commented 7 years ago

Background

This PR adds a class which gives the ability to setup the hadoop configs to connect to the cloud object storage (COS).

gadamc commented 7 years ago

Remove the version bump from this commit. I bump the version when I make a release (https://github.com/gadamc/release-python)

gadamc commented 7 years ago

I think it will be good practice going forward that you post a link here directly to the tests that demonstrate it's working. That would be the notebook link you sent to me on Slack. Remember to use the #@hidden_cell magic to hide your credentials thought

gadamc commented 7 years ago

You need to add instructions and examples to the README. Also, you need to describe the difference between 'CloudObjectStorage' and the bluemix/softlayer object storage

gadamc commented 7 years ago

I'm not a fan of the CamelCase class 'CloudObjectStorage'. At the same time, I don't like 'cloudobjectstorage' or 'cloud_object_storage'.

gilv commented 7 years ago

@gadamc @bassel-zeidan I will review it. Some things not clear here, will comment later.

gilv commented 7 years ago

@bassel-zeidan why we need this class? why not a simple function?

gadamc commented 7 years ago

@gilv @bassel-zeidan -- the reason Bassel has added this as a class is because he's following the precedent I set when I first created this project. In order to configure a connection to 'softlayer' or 'bluemix' object stores, one instantiates the appropriate class (see the README.md).

I have separate classes, mostly for historical reasons, in order to support previous 'swift' and 'swift2d' protocols. But also, having these as classes that a user instantiates facilitates, slightly, using multiple different object storages simultaneously because the class will build the swift URL based on the protocol, and configuration name. So, it's meant to facilitate users experience. Perhaps there could have been a better design, but when I first built this I didn't actually expect anybody buy myself to do use it. :)

bassel-zeidan commented 7 years ago

@gadamc thanks for taking a look.

1) version upgrade removed. 2) regarding testing:

https://apsportal.ibm.com/analytics/notebooks/628a1960-6f48-4dc3-b57b-5666363831d9/view?access_token=4ada3a91af5a4e11c0ee47b7c0b3f18a1623ba70075d0a0ab26443ad713f33cf

3) I updated the Readme file and add an example on how to use COS.

bassel-zeidan commented 7 years ago

@gilv thanks for the feedback.

regarding having it as a class, as Adam mentioned this is convention that has been followed before and i think it makes sense because normally you define your configurations once (make an instance of the class) and try to access files multiple times (using url function) and such a structure helps with that.

bassel-zeidan commented 7 years ago

In case you have no more concerns, let's get it merged.

gadamc commented 7 years ago

@bassel-zeidan @gilv

In the 'softlayer' and 'bluemix' classes, we append a configuration name to the "prefix". https://github.com/ibm-watson-data-lab/ibmos2spark/blob/PythonCOSSupport/python/ibmos2spark/osconfig.py#L158

This allows a user to create and use multiple softlayer / bluemix object storage accounts. For example, if I need to read data from one account and then write to another. I think this would be especially important to be able to define multiple configurations in a parallelized process where each worker node is interacting with the COS. Right?

Can we do this for COS? That is, can we change line 219? That would allow a person with multiple accounts configure access to them independently with the same code.

bassel-zeidan commented 7 years ago

@gadamc unfortunately that is not possible. I tried to do that in different ways but it seems that it is not supported.

bassel-zeidan commented 7 years ago

@gilv do you know of a way to support connecting to multiple COSs by setting multiple configurations at the same time? I tried the same approach that has been followed in bluemix and softlayer classes for COS (appending unique OS id to the prefix of the settings e.g. "fs.swift2d.service." + name). This doesn't seem to work with COS. Any ideas?

gilv commented 7 years ago

@bassel-zeidan It was resolved actually. You can setup different service names for COS, similar to Swift. But it's not yet deployed to Spark as a Service. This commit https://github.ibm.com/cloud-platforms/stocator-s3/commit/519bf601e6d934f7c69de9739e4ea6afe6dc9c68 will part of 1.4 version that we will release in a week or so.

bassel-zeidan commented 7 years ago

Great! I think we should utilise this update in ibmos2spark implementation and support defining multiple configs. However, It might take some time until the new version of Stocator is released and until it gets deployed on Spark service. I would suggest that we introduce COS update to ibmos2spark (Python, Scala and R) and make a new release then we can support the multiple configs support in the next version. I will open a new item for that and assign it to myself.

@gadamc @gilv Do you agree to that?

bassel-zeidan commented 7 years ago

tests on the latest version of this PR:

https://apsportal.ibm.com/analytics/notebooks/b7a32505-8df5-49c4-a230-1dee851f0e03/view?access_token=8941ac5d33113abf988298a5c86c56c4d260510faa39f097150142a7098e6eb5

gadamc commented 7 years ago

@bassel-zeidan I'd rather wait. If @gilv thinks its only a week or two away, I don't see the rush. I'd rather have the COS usage consistent with the other classes -- which includes supplying a 'configuration_name'. And I'd also not want to release twice. @gilv -- how confident are you in the timeline for the update to support a multiple configurations?

bassel-zeidan commented 7 years ago

As discussed, i am merging this PR.