ibm-watson-data-lab / ibmos2spark

Facilitates Data I/O between Spark and IBM Object Storage services.
10 stars 8 forks source link

Rethink design of object storage classes. #33

Open gadamc opened 6 years ago

gadamc commented 6 years ago

The current maze of if/else statements used to setup the various connections to the different CloudObjectStorage services is not so appealing or future-compatible.

https://github.com/ibm-watson-data-lab/ibmos2spark/blob/master/python/ibmos2spark/osconfig.py#L171

This pattern persists in they python, sparkR and Scala implementations (though not in sparklyR because we don't yet support COS there).

If any more options for COS are added, or services that use the same name, this will get even worse. We should refactor this code. One idea would be to subclass CloudObjectStorage and use a factory pattern to generate appropriate instances. In doing so, we may also be able to bring the 'softlayer' and 'bluemix' objects into that refactor.

Of course, we'll have to be careful about backwards compatibility. One option would be to change the interface to the library so as to avoid breaking anybody's code. For example

import ibmos2spark
conf = ibmos2spark.configure(sc, credentials, config_name, service_type = '', auth_method = '')
data_url = conf.url(bucket_or_container, object_name)
bassel-zeidan commented 6 years ago

I completely agree that it would be good to have some internal conceptual separation inside the class for different COS types. This actually might be very important when we have big chunk of code (logic) for each type. However, currently i think there is no need for that since the logic is quite small. It is just a couple of lines for each type. "At least that is my personal opinion".

However, it is good that you opened it to keep that in mind and please feel free to create PRs to demonstrate and realize your suggestion.