s3: allow specifying bucket name in uri

tschoonj commented 5 years ago

Is your feature request related to a problem? Please describe.

Currently pushing and pulling images from an s3 bucket requires the use of an environment variable SREGISTRY_S3_BUCKET to specify the bucket that will be used for these operations. I am thinking that it may be cleaner to use the bucket name in the uri.

I understand that this is not a trivial fix, especially if the current behavior still needs to be supported...

Even though I haven't used them, I guess that the other backends may benefit from a similar approach as well?

Describe the solution you'd like

Something like this would be ideal:

sregistry pull s3://bucket/library/image:tag
sregistry push --name s3://bucket/library/image:tag /path/to/image.sif

vsoch commented 5 years ago

If the bucket works akin to a registry, this is fairly reasonable to do. Could you give me an example of a URI that defines a bucket and a registry (is that possible?)

vsoch commented 5 years ago

Right now, here is how your uri above is parsed, notice that "registry" gets filled with bucket:

In [4]: parse_image_name(remove_uri(image))
Out[4]: 
{'collection': 'library',
 'image': 'image',
 'original': 'bucket/library/image:tag',
 'registry': 'bucket',
 'storage': 'bucket/library/image-tag.sif',
 'storage_uri': 'bucket/library/image:tag.sif',
 'storage_version': 'bucket/library/image:tag.sif',
 'tag': 'tag',
 'tag_uri': 'bucket/library/image:tag',
 'uri': 'bucket/library/image-tag',
 'url': 'bucket/library/image',
 'version': None}

So we could move the requirement of the bucket to later, and then fail if it's not provided during push / pull (or similar). The one concern would be if it's possible to have a registry and/or a bucket, then we wouldn't be able to tell the difference (and it's likely not possible unless we did something like)

sregistry push --bucket bucket --name s3://ibrary/image:tag /path/to/image.sif

Thoughts?

tschoonj commented 5 years ago

What would the bucket need to do to function as a registry?

What gets put in 'registry' currently? The library?

vsoch commented 5 years ago

What would the bucket need to do to function as a registry?

It would need to be the case that there is no concept of a registry for amazon s3. If there is a registry and bucket, we can't share the same parser. if you could give me a list of examples (with and without both registry and library, if that's possible) it would help.

What gets put in 'registry' currently? The library?

No, if there is no registry, it's just left as None.

vsoch commented 5 years ago

I think more thinking needs to be done for this - there is in fact a difference between a registry (or base) and a bucket, see https://docs.aws.amazon.com/AmazonS3/latest/dev/UsingBucket.html. And in the example docs for s3 here, we connect to a Minio server (so the base/registry is served at Localhost). This means that we can't adopt the common parser function because it can't tell between a bucket and a registry. What we could do is provide another avenue for specifying a bucket, although we would need to change around the init sequence to re-parse/connect to it.

tschoonj commented 5 years ago

Apologies, but I am afraid you lost me...

By registry for amazon s3: what do you mean exactly? As in a deployment of a singularity registry server in a bucket? If so, I don't think that's possible as s3 really is just for file storage...

vsoch commented 5 years ago

The registry is the server - so for Minio (serves S3) it would be like http://127.0.0.1:9000. For Amazon, the pattern would be something like s3.<region>.amazonaws.com. For example, both of these should technically be valid (although they aren't currently parsed), we require the registry set as an environment variable first:

sregistry push --name s3://s3.<region>.amazonaws.com/library/image:tag /path/to/image.sif
sregistry push --name s3://127.0.0.1:8000/library/image:tag /path/to/image.sif

The first parses s3 as the endpoint, s3.<region>.amazonaws.com as the registry, library as the collection namespace, image as the image, and tag as tag. Does that make sense? The uri can't accept a bucket there because we wouldn't be able to tell between a bucket and a registry. We don't currently do special parsing to derive these from the URI (and someone might request this as a feature, and it would make sense)

vsoch commented 5 years ago

By registry for amazon s3: what do you mean exactly? As in a deployment of a singularity registry server in a bucket? If so, I don't think that's possible as s3 really is just for file storage...

I mean the storage server. It's not a Singularity Registry, no, but it is a server with an address we push / pull / interact with.

tschoonj commented 5 years ago

Ok thanks for the clarification.

I think that things would be simpler if the S3 backend would be used exclusively for actual AWS S3 buckets, in which case you would no longer need to worry about the endpoint. There would also be no need for specifying the region (as is currently the case already).

In this case, to continue supporting the Minio server, perhaps it would be best to create a dedicated client for it (which could inherit from the S3 client?)

vsoch commented 5 years ago

There will always be the need for specifying a different base, even for an s3 user that doesn't need minio (for example, let's say that the zone changed). We specify this registry (what is called the base in the code) when we instantiate the client:

        self.s3 = boto3.resource('s3',
                                 endpoint_url=self.base,   < --- this set to None will use s3 default, or some custom base
                                 aws_access_key_id=self._id,
                                 aws_secret_access_key=self._key,
                                 config=boto3.session.Config(signature_version=self._signature))

Minio is s3 storage (it has buckets too) so I don't think we would need to introduce a lot of redundant code. I think what we want to discuss is what would be a better way to designate a bucket?

vsoch commented 5 years ago

There are a lot of labs / institutions that host their own s3 storage, or just have something as trivial as a different zone, so the endpoint specification can't just be removed and enforced that everyone use the same. What about just adding optional kwargs to push/pull and then we could allow a bucket argument?

tschoonj commented 5 years ago

You cannot actually change the zone of an existing bucket. For that you would need to copy your buckets to an intermediate bucket, delete the old one, wait 24 hours, create a new bucket with the old name in the zone of your choice, and copy the contents of the intermediate bucket to the new one.

Since bucket names have to be unique across all AWS S3 zones, I don't think one would ever need to specify the endpoint_url argument to use them.

Using a command-line option to specify the bucket name would be slightly cleaner than using an environment variable, but not sure if that really adds much.

vsoch commented 5 years ago

@tschoonj is this ok to close?

singularityhub / sregistry-cli

s3: allow specifying bucket name in uri #205