inveniosoftware / invenio-s3

S3 file storage support for Invenio.
https://invenio-s3.readthedocs.io
MIT License
0 stars 16 forks source link

config: support for multiple S3 locations #14

Open ppanero opened 4 years ago

ppanero commented 4 years ago

Currently only one bucket and one endpoint is supported. Some use cases require multiple buckets/endpoint URLs. Config should be changed for that.

wgresshoff commented 4 years ago

I start looking into how configuration works. Then I can estimate the time I need to commit.

wgresshoff commented 4 years ago

After looking into how invenio-s3 and invenio-files-rest are actually implemented I'm coming to the conclusion that it's not that easy to implement. I see to targets to achieve:

The problem I see is: there is no obvious relation between a location and the files configuration at all. That's surely not a problem, if the location is just a directory anywhere in the file system, but to define a s3 endpoint you need at least the url of the s3 server and a secret. The storage factory creates the storage class just by the fileurl.

I think that should be refactored by adding at least the name of the location to have a key to distinguish the s3 configurations properties. An s3 endpoint url per location would the be configured by the property invenio_s3.config.S3_ENDPOINT_URL.location_name (with the fallback invenio_s3.config.S3_ENDPOINT_URL if there is just one s3 endpoint). The other configuration options would be similar.

egabancho commented 4 years ago

@ppanero @wgresshoff what about moving the endpoint to the location URI? (I probably thought about this at some point) Having something like s3://myserver.com/b1, the only open question I have would be the default base URL, i.e. if one uses AWS S3 you don't actually need to specify the URL of the server, it's already set internally by the boto3 library.

I don't really like the idea of adding a configuration variable to solve this because it'll add complexity into the location creation, which we want to avoid, plus I see this like the kind of thinkg that I would forget and then wonder for a day or two why my files are not in the right place 😂

wgresshoff commented 4 years ago

@egabancho @ppanero I need to clarify my idea: I would like to add the location name in the parameter list of the storage factory, so the storage knows which configuration to use. The location would not be changed. The base URL itself without credentials won't help (there could be different credentials for the same server URL with different storage prices).

egabancho commented 4 years ago

You are definitely right, URL without credentials ... not gonna work. Probably you already discuss this with @ppanero but, could you put here an example of how this configuration would look like? I'd really help understand what you have in mind ☺️

ppanero commented 4 years ago

@egabancho we did not really sketch how it would look at that point. I thought of having something like for ES, where you can specify multiple hosts, each with its credentials.

However, I think we would end up in the same issue @wgresshoff is mentioning, and need to change the files factory.

wgresshoff commented 4 years ago

Sorry, I needed some time, but finally... Ok, the example: if there are two locations defined, say names are amazon_aws and cephfs the configuration would look like this: S3_ENDPOINT_URL.amazon_aws = https://amazon.com S3_ENDPOINT_URL.cephfs = https://ceph.com S3_ACCESS_KEY_ID.amazon_aws = xyz S3_ACCESS_KEY_ID.cephfs = abc S3_SECRET_ACCESS_KEY.amazon_aws = sdsdsdsds S3_SECRET_ACCESS_KEY.cephfs = abcabdafah

And finally there might be some default configuration (which would surely lead to some nice errors if forgotten) as fallback: S3_ENDPOINT_URL = https://default.com S3_ACCESS_KEY_ID = ghz S3_SECRET_ACCESS_KEY = lkhjsdafkjhfdskjh

So everywhere the config is consulted the location_name should be known. This leads to some more code in invenio-s3 but a function to read the configuration is rather simple to implement (and only needed in invenio-s3).