Closed bassel-zeidan closed 7 years ago
Remove the version bump from this commit. I bump the version when I make a release (https://github.com/gadamc/release-python)
I think it will be good practice going forward that you post a link here directly to the tests that demonstrate it's working. That would be the notebook link you sent to me on Slack. Remember to use the #@hidden_cell magic to hide your credentials thought
You need to add instructions and examples to the README. Also, you need to describe the difference between 'CloudObjectStorage' and the bluemix/softlayer object storage
I'm not a fan of the CamelCase class 'CloudObjectStorage'. At the same time, I don't like 'cloudobjectstorage' or 'cloud_object_storage'.
@gadamc @bassel-zeidan I will review it. Some things not clear here, will comment later.
@bassel-zeidan why we need this class? why not a simple function?
@gilv @bassel-zeidan -- the reason Bassel has added this as a class is because he's following the precedent I set when I first created this project. In order to configure a connection to 'softlayer' or 'bluemix' object stores, one instantiates the appropriate class (see the README.md).
I have separate classes, mostly for historical reasons, in order to support previous 'swift' and 'swift2d' protocols. But also, having these as classes that a user instantiates facilitates, slightly, using multiple different object storages simultaneously because the class will build the swift URL based on the protocol, and configuration name. So, it's meant to facilitate users experience. Perhaps there could have been a better design, but when I first built this I didn't actually expect anybody buy myself to do use it. :)
@gadamc thanks for taking a look.
1) version upgrade removed. 2) regarding testing:
https://apsportal.ibm.com/analytics/notebooks/628a1960-6f48-4dc3-b57b-5666363831d9/view?access_token=4ada3a91af5a4e11c0ee47b7c0b3f18a1623ba70075d0a0ab26443ad713f33cf
3) I updated the Readme file and add an example on how to use COS.
@gilv thanks for the feedback.
regarding having it as a class, as Adam mentioned this is convention that has been followed before and i think it makes sense because normally you define your configurations once (make an instance of the class) and try to access files multiple times (using url function) and such a structure helps with that.
In case you have no more concerns, let's get it merged.
@bassel-zeidan @gilv
In the 'softlayer' and 'bluemix' classes, we append a configuration name to the "prefix". https://github.com/ibm-watson-data-lab/ibmos2spark/blob/PythonCOSSupport/python/ibmos2spark/osconfig.py#L158
This allows a user to create and use multiple softlayer / bluemix object storage accounts. For example, if I need to read data from one account and then write to another. I think this would be especially important to be able to define multiple configurations in a parallelized process where each worker node is interacting with the COS. Right?
Can we do this for COS? That is, can we change line 219? That would allow a person with multiple accounts configure access to them independently with the same code.
@gadamc unfortunately that is not possible. I tried to do that in different ways but it seems that it is not supported.
@gilv do you know of a way to support connecting to multiple COSs by setting multiple configurations at the same time? I tried the same approach that has been followed in bluemix and softlayer classes for COS (appending unique OS id to the prefix of the settings e.g. "fs.swift2d.service." + name
). This doesn't seem to work with COS. Any ideas?
@bassel-zeidan It was resolved actually. You can setup different service names for COS, similar to Swift. But it's not yet deployed to Spark as a Service. This commit https://github.ibm.com/cloud-platforms/stocator-s3/commit/519bf601e6d934f7c69de9739e4ea6afe6dc9c68 will part of 1.4 version that we will release in a week or so.
Great! I think we should utilise this update in ibmos2spark implementation and support defining multiple configs. However, It might take some time until the new version of Stocator is released and until it gets deployed on Spark service. I would suggest that we introduce COS update to ibmos2spark (Python, Scala and R) and make a new release then we can support the multiple configs support in the next version. I will open a new item for that and assign it to myself.
@gadamc @gilv Do you agree to that?
@bassel-zeidan I'd rather wait. If @gilv thinks its only a week or two away, I don't see the rush. I'd rather have the COS usage consistent with the other classes -- which includes supplying a 'configuration_name'. And I'd also not want to release twice. @gilv -- how confident are you in the timeline for the update to support a multiple configurations?
As discussed, i am merging this PR.
Background
This PR adds a class which gives the ability to setup the hadoop configs to connect to the cloud object storage (COS).