ibm-watson-data-lab / ibmos2spark

Facilitates Data I/O between Spark and IBM Object Storage services.
10 stars 8 forks source link

COS Scheme Support + Multiple Config Support [Python] #22

Closed bassel-zeidan closed 7 years ago

bassel-zeidan commented 7 years ago

Test: https://apsportal.ibm.com/analytics/notebooks/06ff8a28-ec0f-44cc-a60c-c01c754c2bac/view?access_token=f42ec46ec715a74839316211c5dad5b9aa496bcdc064e055de149d58505170a9

gadamc commented 7 years ago

Hi @bassel-zeidan.

I have a couple of comments.

Having a default bucket_name is inconsistent with the other bluemix/softlayer objects. But since this lib is probably 100% used only in DSX, it's probably okay.

One thing though, NOT related to this PR, but something I noticed about the class is the "cos_id". There are two things with this that I'm not sure about.

  1. 'cos_id' is inconsistent with the other classes. In those classes, we called them 'name'. I'd prefer to be consistent and change cos_id to 'name'. What do you think?
  2. 'cos_id' may be confusing to the user. Especially this sentence here: https://github.com/ibm-watson-data-lab/ibmos2spark/blob/341544978ca89357d213cfdf0de4c319fe3cc360/python/ibmos2spark/osconfig.py#L193. It says 'cos_id' is the "cloud object storage unique id", which sounds like a unique ID that somebody is supposed to look up and not a unique ID that the creator of a CloudObjectStorage object should provide.

Also, you should update the README to include the configuration name.

cos_configuration_name = 'some_configID'
cos = ibmos2spark.CloudObjectStorage(sc, credentials, cos_configuration_name)

I have another comment about the comments here: https://github.com/ibm-watson-data-lab/ibmos2spark/blob/341544978ca89357d213cfdf0de4c319fe3cc360/python/ibmos2spark/osconfig.py#L189-L203. But I don't have time right now to say what I want to say. I'll add another comment to this PR tomorrow.

bassel-zeidan commented 7 years ago

Hi @gadamc,

Regarding bucket_name i think it is fine since it is passed as an "optional" param just as an optimisation step since most of the time, a user is trying to fetch data from the same bucket in a working session.

Regarding "cos_id" param, i think you have a very good point. I was thinking about this yesterday. I agree the name might be confusing to the consumer of the lib. I will change the name of the param. Although, "name" is not very clear to me. What do you think about "configuration_name" as a name for this param? It is straight to the point.

gadamc commented 7 years ago

configuration name sounds good and I agree that name isn't very clear either.

Maybe we should, probably in a separate PR just to be clear, change the parameter names in the other classes as well: name -> configuration_name.

Regarding the DSX documentation here: https://github.com/ibm-watson-data-lab/ibmos2spark/blob/341544978ca89357d213cfdf0de4c319fe3cc360/python/ibmos2spark/osconfig.py#L189-L203.

In principle, I'd like for the library to be as general as possible. I would like it to be useful for those submitting spark-submit jobs, and to compile the library into Java/Scala projects. With that goal in mind, we should remove mention of DSX in the documentation since DSX will be a user of the library and not intertwine them.

However, I DO think it's reasonable to mention how DSX uses the library in the README and provide examples there.

Make sense? What are your thoughts?

bassel-zeidan commented 7 years ago

@gadamc makes sense. I did some updates on the comments.

gadamc commented 7 years ago

Okay... Can you remove this section of text: https://github.com/ibm-watson-data-lab/ibmos2spark/blob/30526a1c2fee04a5321064069f2063e3e4272c71/python/ibmos2spark/osconfig.py#L189-L191

And, while we're at it, in a separate commit, could you remove https://github.com/ibm-watson-data-lab/ibmos2spark/blob/30526a1c2fee04a5321064069f2063e3e4272c71/python/ibmos2spark/osconfig.py#L135-L139

That information is good though and should probably be in the README, perhaps with a picture to show the 'Insert' dropdown.

gadamc commented 7 years ago

Once you're done with the above comment remove, and added something to the README if you have it, then +1 to merge.