ibm-watson-data-lab / ibmos2spark

Facilitates Data I/O between Spark and IBM Object Storage services.
10 stars 8 forks source link

The case for `configuration_name` to be required and not optional. #36

Open gadamc opened 7 years ago

gadamc commented 7 years ago

I advocate that, when/if we make changes to refactor this library, we make the configuration_name a required parameter when instantiating a new object to configure an object store connection.

  1. It seems likely that this could cause errors to users without any kind of explicit notification to the user in the situations where multiple object stores are being used.
    
    creds1 = {...} #credentials for one object store
    creds2 = {...} #creds for another object store

conf1 = ibmos2spark.configure(sc, creds1) conf2 = ibmos2spark.configure(sc, creds2)

at this point, the user believes connections to both object stores have been successfully created

rdd = sc.textFile( conf1.url(bucket, object_name) )

do work

sc.saveAsTextFile( conf2.url(bucket, object_name), rddfinal)


In the scenario above, significant problems could occur if both object stores contain objects with the same name. This may be a likely scenario if users are processing and moving data to different locations, the wrong piece of data would be retrieved and processed and then rewritten, overwriting the original data. If the user hasn't taken care to create archive buckets/containers and configured object store to track revisions (is that possible with COS S3? It is possible with OpenStack OS). 

There is no warning to the user in this scenario and if a large job on Spark is performed, this could potentially wipe out a user's entire data set. 

If there is no object in the second object store with 'object_name', then the `rdd = sc.textFile(...)` line will fail. However, it would still be very confusing to the user as to *why* it failed. The stack trace from Spark would say something like "file not found", but there'd be no indication that it was looking in the wrong object store. 

Another potential issue could stem from situations where the configuration code is executed on worker nodes if the ibmos2spark.configure call is made within functions that are then parallelized in a map function. In that scenario, it may be unpredictable, scattering data across different object storage instances and/or failing in some cases and not in others. 

This may feel like an 'edge case' or does not follow the usage by a large percentage of the DSX/ObjectStore users. However, I think we should avoid the potential for catastrophic failure. Justifiably angry users can spread bad news far a wide with ease. 

2. It seems like a very low burden to require a configuration name. In DSX when one uses the insert to code button, we already provide a randomized configuration name. We should continue this policy.

#### Alternative solution 
One alternative option would be for the `ibmos2spark` library to randomly generate a configuration_name if one is not provided.  This would essentially solve the problem, from what I can tell. I'd like to think more about the merits / demerits of this idea though.