BUCKET_HOST vs. BUCKET_SERVICE for cluster provided S3 services

guymguym commented 5 years ago

Hi @jeffvance @copejon CC: @dannyzaken @tamireran @nimrod-becker

The current provisioner requires to resolve the service to a single address that is assigned to the BUCKET_HOST and BUCKET_PORT in the configmap. This makes sense for a global service like AWS/Google/Azure endpoints.

However when the S3 endpoint is serving inside the cluster, there are multiple ways that clients may need to connect to the service - either using internal-DNS <service-name>.<namespace> or using nodePort, or using external load balancer, or using a route/ingress, etc. It seems weird that every OBC should be aware of the client network requirements.

I would expect that in this case the configmap will include just the BUCKET_SERVICE_NAME and BUCKET_SERVICE_NAMESPACE so that the OBC provisioner can assign a service to the client, and the client can decide on which service endpoint to use.

Thoughts?

copejon commented 5 years ago

That's a great point. In its infancy, the library was designed under the assumption of a single endpoint for an in-cluster object store. Services are probably the most common endpoint used, but ultimately the library is agnostic to which API resource, if any is used to form the connection. However, it also does not provide a mechanism for preferencing one route over another.

This could be addressed via StorageClasses. For every route to the object store, a StorageClass could be defined. This allows the library to remain agnostic to the endpoint and the ConfigMap data structure remains unchanged. The provisioner would be responsible ensuring the appropriate endpoint is returned respective of the StorageClass. e.g.

1 Class for Ingress

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: object-store-ingress
provisioner: bucket.io/provisioner
parameters:
  connectionType: ingress

1 Class for Service

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: object-store-service
provisioner: bucket.io/provisioner
parameters:
  connectionType: service

A possible downside is that now the developer/cluster-user has to know which one they want, which may not be acceptable.

guymguym commented 5 years ago

Hey @copejon

I dont think this makes sense as a storage class property - it is clearly and obc property. That is to say that the storage is the same, but there are different ways to claim it based on the client location.

I dont see any problem that a client application will tag its obc with a "in-cluster" tag or in general it will claim a specific address of a service endpoint based on its topology. In any case, this can and should be a general technique that clients use to connect to endpoints, rather than a specific solution for bucket provisioning, so I would just leave it out of the scope of the provisioner and allow it to return a service to the claimer.

jeffvance commented 5 years ago

@guymguym

the storage is the same, but there are different ways to claim it based on the client location.

True. But storage classes can and are used to point to the same underlying storage but able to customize the access to that storage, per claim, via the Parameters map. I don't know if this storage class idea is the best way to solve your issue but it doesn't seem unreasonable to me (yet).

jeffvance commented 5 years ago

Just another thought: Jon and I have basically supported the separation of concerns philosophy where the OBC author cares mostly about her app while the storage admin deals with all the storage details via the SC. What I don't have an answer to is will the proliferation of mostly similar, but slightly different, storage classes make the admin's job too difficult?

guymguym commented 5 years ago

I see several reasons why storage-class should not be the one carrying this information:

This is a general networking/service concern, not specific just to storage, it also applies to databases, web servers, and practically any network service inside/outside the cluster. As such it doesn't make sense to solve it by duplicating the storage-classes to represent the network possibilities.
Who will be responsible of creating all these storage-classes? The OLM CSV doesn't allow such logic AFAIK. It only supports a static set of cluster resources. By asking the storage-class to be cloned we are defering this problem to the poor system admin, rather than allowing someone to automate its solution for any service discovery.
How is this supposed to work for multiple applications sharing a few buckets? IIUC I should create 2 storage classes, one for internal-IP and one for external-IP, and then app1 will create a PVC on storage-class-1 to create a bucket, and app2 wants to claim that brownfield bucket but from storage-class-2? So would it require to create a storage-class per bucket per network endpoint? Too many storage classes.

jeffvance commented 5 years ago

I am also concerned with too many storage classes but at the same time I don't want to see storage-specific OBCs since that defeats the point of abstraction. I think you're suggesting service-specific OBCs and that is interesting...

copejon commented 5 years ago

Per Aug 8th meeting - it's agreed that using StorageClasses to represent different routes to the same object store is overly burdensome and doesn't scale.

The agreed on approach is that the configMap, in addition to the BUCKET_HOST endpoint value, can also supply the service name and namespace. As the ConfigMap is derived from the ObjectBucket returned by Provision(), the ObjectBucket spec will be changed to add the fields to the Endpoint substructure.

jeffvance commented 5 years ago

Summarizing:

no changes to the OBC
no changes to BucketOptions type passed to Provision() and Grant()
change to OB's Endpoint struct to support service ns + name.

Agree? Anything else?

kube-object-storage / lib-bucket-provisioner

BUCKET_HOST vs. BUCKET_SERVICE for cluster provided S3 services #130