jprante / elasticsearch-knapsack

Knapsack plugin is an import/export tool for Elasticsearch
Apache License 2.0
472 stars 77 forks source link

Adding support for backups to AWS S3 #29

Open nicktgr15 opened 10 years ago

nicktgr15 commented 10 years ago

Hello,

We are working on extending knapsack in order to support backups to AWS S3. However, the AWS Java SDK is about 20mb while knapsack is roughly 2mb. Do you consider the increased file size of the package an issue?

If size is an issue, we could probably use maven assembly plugin to generate two different packages (i.e. elasticsearch-knapsack-2.5.1.zip, elasticsearch-knapsack-aws-2.5.1.zip). Of course, we are open to suggestions or alternative approaches.

Regards, Nick

jprante commented 10 years ago

Cool idea, yes, another zip in the assembly plugin would be ok. It's not the size, but the feature set. AWS S3 is just optional, so everybody can choose.

nicktgr15 commented 10 years ago

Hello,

Thanks for the quick reply. I have a working implementation but I wonder if it is a good idea to have two zip files or just one. Since size doesn't matter (the package size is about 10 mb) I think that one .zip file would make things simpler.

In the current implementation an s3.bucket_name parameter has to be defined in elasticsearch.yml to enable backups/restores to/from S3. (this makes storage to S3 an optional feature)

To backup an index to S3 a request like the following has to be executed:

curl -XPOST 'localhost:9200/index_name/_export?target=/tmp/index_name.tar.gz&s3path=path/in/bucket/index_name.tar.gz'

Similarly, a restore from s3 can be done like this:

curl -XPOST 'localhost:9200/index_name/_import?target=/tmp/index_name.tar.gz&s3path=path/in/bucket/index_name.tar.gz'

Both, target and s3path parameters are required for S3 to work properly.

I'd like to read your thoughts regarding packaging and any suggestions (or concerns) that you may have before making a pull request.

Nick

jprante commented 10 years ago

Many thanks for your suggestions.

Is it possible to add new hooks to the REST API? I'd like to have _export and _import just for local dumps only. How about _export/s3/... and _import/s3/... and using something like url as parameter name instead of s3path?

Example for url: https://s3.amazonaws.com/[your_bucket]/[your_file_name]

I think you need to add the AWS key somewhere, this could be possible by a key parameter.

From what I understand, the S3 API does not allow upload streaming, so first the knapsack tar archive is created locally and then uploaded to S3? Will it be deleted locally after successful upload?

I can build two zip's and Maven profiles and offer them via Bintray, no need to worry about that. S3 feature could only be available in an S3 knapsack plugin zip by using the ServiceLoader functionality, this needs a little rework though.