Open 80kk opened 5 years ago
Hi,
It should just work fine also if the (or some of) the files are smaller than the split size. The chunk size only gives a maximum size for each chunk in the multipart upload, but if there is just one chunk and that is less then the max should not break it. Btw, I changed the code in the meantime to support multiple files, so you can also use wildcards now to invoke the script. The wildcard expressions are expanded already by the shell, so you can test what they cover by just trying them with ls
. In your case ./glacierupload -v myvault /data/files/**/* /data/files/*
should work, but you can test the wildcards with ls /data/files/**/* /data/files/*
.
Best, Thomas
Why the maximum value of the split-size is 2^22 which means 4TB, whilst the maximum chunk accepted by glacier is 4GB (2^12)?
Sorry, that is a typo in the documentation, it should be indeed 12..
Btw, the advantage of a split size smaller than the file size is that the multiple parts are uploaded in parallell. If you specify multiple files in the glacierupload
command those are, however, still processed in sequence (this might be a point for improvement). That means if you want to speed up the upload, you might consider not to upload all files in one command invocation, but to start several upload commands in parallel, each on a subset of your files.
What if I have 8TB of data to upload and total available space on local hard drive for cache/split 250GB? Will the script cleanup after each upload completed?
It will be also great if script could write to the log file. It can be txt not json necessarily.
That should be fine, it caches only the parts that are currently uploaded on disk, i.e. at most 4GB(#of parallel uploads) and it will clean up after completion of each part (in case of error there might be some data left in tmp folders, but they should be cleaned on a restart of the operating system). The number of parallel uploads for a single invocation of glacierupload
is determined by the parallel
command and is I think the number of CPUs available.
To get a log file you can just redirect the output to a file `./glacierupload -v myvault > result.txt 2> upload.logor
./glacierupload -v myvault * >upload.log 2>&1` (all in one file). Can't tell from the back of my mind though what output goes to stderr and what to stdout. The final result is already stored as json file in the folder from which you start the upload (see the docs).
Thanks. I am testing it now.
Only from curiosity, did it work properly or did you run into any problems? If it worked fine I'll tag the current state as a release :)
It is running currently (500GB out of 8TB so far). It is slow because source data is mounted using s3fs onto ec2 instance, and then files are being chunked to 2GB chunks.
What will happen if I run your script with option to split to 4GB chunks against a folder containing large files >4GB and files <4GB? Will it fail on 'small' files and continue splitting large ones? Let's say that I have following folder structure:
/data/files/x/ /data/files/y/ /data/files/
all of them contains files from 300MB to 100GB big. Ideally I'd like to run script in tmux/screen session and check once a day.