numblr / glaciertools

Command line (bash) scripts to upload large files to AWS glacier using multipart upload and to calculate the required tree hash
MIT License
69 stars 19 forks source link

Question rather than issue #17

Open 80kk opened 5 years ago

80kk commented 5 years ago

What will happen if I run your script with option to split to 4GB chunks against a folder containing large files >4GB and files <4GB? Will it fail on 'small' files and continue splitting large ones? Let's say that I have following folder structure:

/data/files/x/ /data/files/y/ /data/files/

all of them contains files from 300MB to 100GB big. Ideally I'd like to run script in tmux/screen session and check once a day.

numblr commented 5 years ago

Hi,

It should just work fine also if the (or some of) the files are smaller than the split size. The chunk size only gives a maximum size for each chunk in the multipart upload, but if there is just one chunk and that is less then the max should not break it. Btw, I changed the code in the meantime to support multiple files, so you can also use wildcards now to invoke the script. The wildcard expressions are expanded already by the shell, so you can test what they cover by just trying them with ls. In your case ./glacierupload -v myvault /data/files/**/* /data/files/* should work, but you can test the wildcards with ls /data/files/**/* /data/files/*.

Best, Thomas

80kk commented 5 years ago

Why the maximum value of the split-size is 2^22 which means 4TB, whilst the maximum chunk accepted by glacier is 4GB (2^12)?

numblr commented 5 years ago

Sorry, that is a typo in the documentation, it should be indeed 12..

numblr commented 5 years ago

Btw, the advantage of a split size smaller than the file size is that the multiple parts are uploaded in parallell. If you specify multiple files in the glacierupload command those are, however, still processed in sequence (this might be a point for improvement). That means if you want to speed up the upload, you might consider not to upload all files in one command invocation, but to start several upload commands in parallel, each on a subset of your files.

80kk commented 5 years ago

What if I have 8TB of data to upload and total available space on local hard drive for cache/split 250GB? Will the script cleanup after each upload completed?

80kk commented 5 years ago

It will be also great if script could write to the log file. It can be txt not json necessarily.

numblr commented 5 years ago

That should be fine, it caches only the parts that are currently uploaded on disk, i.e. at most 4GB(#of parallel uploads) and it will clean up after completion of each part (in case of error there might be some data left in tmp folders, but they should be cleaned on a restart of the operating system). The number of parallel uploads for a single invocation of glacierupload is determined by the parallel command and is I think the number of CPUs available. To get a log file you can just redirect the output to a file `./glacierupload -v myvault > result.txt 2> upload.logor./glacierupload -v myvault * >upload.log 2>&1` (all in one file). Can't tell from the back of my mind though what output goes to stderr and what to stdout. The final result is already stored as json file in the folder from which you start the upload (see the docs).

80kk commented 5 years ago

Thanks. I am testing it now.

numblr commented 5 years ago

Only from curiosity, did it work properly or did you run into any problems? If it worked fine I'll tag the current state as a release :)

80kk commented 5 years ago

It is running currently (500GB out of 8TB so far). It is slow because source data is mounted using s3fs onto ec2 instance, and then files are being chunked to 2GB chunks.

numblr commented 5 years ago

Then fingers crossed ;) Btw, if you already have the data in S3(?) there might be easier options to get it into glacier. I'm not really an expert on this, but found for example this question.