Enhancement Request: use `sqlite3` instead of in-memory lists

jaddison commented 9 years ago

Several people have issues with s3cmd consuming all available memory, causing OOM errors - mostly while syncing large directory structures (see #206, #405, #364 and a few others listed here: https://github.com/s3tools/s3cmd/search?q=memory&type=Issues&utf8=✓).

I can't help but wonder why Python's standard sqlite3 module isn't being used in place of in-memory lists? This would surely greatly reduce the memory usage. It could be an --use-sqlite3 option or completely replace existing functionality.

Creating a temporary on-disk database in the system's temp should be feasible. It would/should be used for both local and remote file listings.

Has this been considered before?

mattbillenstein commented 9 years ago

IMO a better option would be to iterate over the bucket listing only considering one key at a time - I believe the api will page the results for you 1000 or so keys at a time. So you only end up keeping 1000 keys in memory at a time.

If that's not easy in the existing code, using key prefixes to only look at part of the filesystem being considered would work as well.

jaddison commented 9 years ago

I'm not too familiar with the code, but I don't think that would work, @mattbillenstein. Both the remote file list and the local file list contribute to the memory issue - they're both stored in memory when syncing.

If you looped through the remote list 1000 files at a time, you'd either need the local file list in memory to compare against (still creating memory 'pressure' problem) or repetitively hit the disk to see if the file exists.

mattbillenstein commented 9 years ago

So my point is you do a stat on each file either way - it's just a matter if you do it all up front where you need to keep that info in memory, or you do it as you page through the s3 keys...

It's really just a matter of reordering the operations, you do the same amount of work in either case - the difference is you only have that information in memory as you need it, not loading it all at the beginning.

jaddison commented 9 years ago

Fair enough - sounds like we agree that there are solutions besides "increase your available memory".

mdomsch commented 9 years ago

You need a recursive directory listing on both sides. The local list (os.walk()) doesn't return in alphabetical order, unlike the S3 LIST command. Doing an os.stat() on every file in the list from S3 is indeed necessary, but to also handle the case of newly created local files you want to copy to S3 (the very common case), you have to get the full local directory listing, at the very least, else you're effectively calling s3cmd info for every local file to determine if it's present (instead of getting it from the directory listing), which would be very slow, and the price would add up in a hurry too.

The sqlite idea is a good one, especially if it's kept on disk and not in memory.

mattbillenstein commented 9 years ago

A database for this is totally overkill - stick the s3 key string in a set as you iterate over them - 1MM 256 byte keys takes around 370MB on my system -- then make another walk over the local directory at the end to upload new files to s3 that exist on the filesystem but you didn't catch when walking s3...

So all together:

Walk s3, compare to local, mark s3 paths in a set
Walk local, upload what isn't in the set

I think this minimizes the number of round trips to s3, and you should only stat each file on the filesystem once - although os.walk may do some of this under the hood.

I looked at the code a little bit - it doesn't seem straightforward to implement something like this, but using the raw boto apis, I think you could do this with a fairly small special-purpose script.

jaddison commented 9 years ago

1MM 256 byte keys takes around 370MB on my system

This may be non-trivial for some, but not everyone. Having your system memory-constrained is pretty common - a sudden 370MB+ increase in memory usage can easily result in OOM errors and processes getting killed. Mitigating this makes sense.

mattbillenstein commented 9 years ago

Be practical - if you have a system that needs to sync 1MM files to s3 using Python, you probably have 370MB of RAM to spare - the original report is using several gigabytes, so this would be at least a 10x improvement...

Also, there are further optimizations possible by using fancier algorithms - only considering branches of the file tree at a time, recursive traversal, divide and conquer, etc.

jaddison commented 9 years ago

I am being practical - it is entirely feasible that a low memory VPS can have that many files. Taking into account other running processes such as gunicorn and celery (via supervisord), the situation I mention can, does and will happen frequently enough. Correlating memory with number of files/disk space is just an assumption.

That said, if there are ways to alleviate the issue using fancy algorithms, that's great. There's no reason not to solve this via other means.

mdomsch commented 9 years ago

Another reason to consider the whole tree is for duplicate file detection and then local hardlinking or remote copy of duplicate files. In the original use case for such, the Fedora mirror system, hardlinks accounted for a huge time and space savings. Like rsync, this only works if you can see the duplicates in whatever subset of the trees you're looking.

On Tue, Dec 9, 2014 at 12:14 PM, Matt Billenstein notifications@github.com wrote:

Be practical - if you have a system that needs to sync 1MM files to s3 using Python, you probably have 370MB of RAM to spare - the original report is using several gigabytes, so this would be at least a 10x improvement...

Also, there are further optimizations possible by using fancier algorithms

only considering branches of the file tree at a time, recursive traversal, divide and conquer, etc.

— Reply to this email directly or view it on GitHub https://github.com/s3tools/s3cmd/issues/408#issuecomment-66329451.

mattbillenstein commented 9 years ago

s3cmd does file deduplication via hardlinks?

if 370MB is too much - add hash(path) to the set - ~90MB ...

denydias commented 7 years ago

I'm in a huge fight here. It has been a week and a half trying to backup an instance data with s3cmd. The dataset to be backed up is about 800GB spread in 1.434.238 objects. The most recent attempt is running in an new, empty bucket for about 30 minutes.

The command used looks like:

s3cmd --skip-existing --delete-removed --no-check-md5 --no-guess-mime-type \
      --storage-class=STANDARD_IA sync file1 file2 dir1/ dir2/ s3://bucket/

s3cmd is eating 45.2% of a t2.large 7.8GB memory just for the listing task. PUT operations are actually running as I write this, so I don't think the memory figure is going to raise anymore as the listings are done by now.

The thing the bothers me is that I was able to get here after several days of trial and error to reach the

--skip-existing --delete-removed --no-check-md5 --no-guess-mime-type

magic combo. Before I figure out these ones (which are poorly documented in usage page if we take into account the effect they could have for some), s3cmd was eating the whole available memory plus lots of swap, up to a point that the instance itself render unusable and the backup task never finished. I hope it finishes now... let's see. At least this time I don't have a full memory CloudWatch alarm ringing before s3cmd even had finished its listing stuff.

So, if you guys have something in mind to improve s3cmd memory usage, please do so. It'll be very welcome for people like me in the need to backup huge datasets in a daily basis.

mattbillenstein commented 7 years ago

@denydias curious if you could 'pip install awscli' and take a look at the aws s3 sync command -- I haven't used this tool in quite awhile...

denydias commented 7 years ago

Unfortunately not, @mattbillenstein.

These instances are production stuff. They have only the minimal set of tools for the purpose. Our engineering team ruled awscli out of that equation.

I don't have a similar dataset to test with aws s3 sync in any dev environments to compare its memory usage.

fviard commented 7 years ago

@denydias (First, as a side note, for documentation, it is better to look at the README in the project and at the inline help: s3cmd --help)

Your issues with a huge data set are understandable. Currently what is done with a sync is this:

create in memory a list LOCAL with all the local files and their "stat" info.
create in memory a list REMOTE with all the remote files using a s3 list command. (at least, both of them only in the adequate subdir)
compare these lists for difference to finally have 3 lists (still in memory) TO_UPLOAD, TO_COPY, TO_DELETE.
at the same time as the previous 3 lists, there could also be in memory a list of all the unique "md5" sums of files so that we can detect duplicates and so perform remote copies.
Then, the 3 lists are processed.

Why it did speed up and used less memory when using "--no-check-md5":

When creating the "local" list, a md5sum of all files have to be performed. In that mode you don't have to do that.
When creating the "remote" list, aws S3 is not able to give you directly the "md5sum" or some hash for the "big files" that are uploaded in multi parts. So, that means that for each file like that seen in an aws "ls", s3cmd will have to issue an individual "file meta" info command for this file to retrieve a custom attribute with this value. (big files are what is defined as multipart_chunk_size_mb in the config file, 15MB by default)

The drawback of this mode is that we can't detect difference of files content to upload it in that case. So in that case files will be uploaded if they are new (not already on remote) or different in size. That is important to have in mind that if a file was modified but keep the exact same size, it will not be uploaded as considered "equal". In your case, you also used "skip-existing", that means that the size check will even not be done, just "new" files will be uploaded.

In the end, a high memory usage result often in a high cpu usage of moving memory blocks around even if the task is not intensive.

So, having the option of using an in-disk sqlite3 database is indeed what would be the perfect solution for such usage. Especially as sqlite3 "temp" tables can be used for the temporary lists like to_upload, to_copy, to_delete. And that would also provide performance improvement like keeping locally a table of the hash id of remote for big files. I don't think that it will hard to do, and it is on the "nice to have" list, but implementing this would require a long work.

For your specific situation, there is a hack that could help you improve things a little: You can partition you job. Let's say that you want to back this folder: myserver/ myserver/subfolder1 myserver/subfolder2 myserver/subfolder3 ...

Instead of doing: s3cmd myserver s3://mybucket/ Issue the following commands: s3cmd myserver/subfolder1 s3://mybucket/ s3cmd myserver/subfolder2 s3://mybucket/ ... If you can partition a level under, it is even better: s3cmd myserver/subfolder1/subsub1 s3://mybucket/subfolder1/ s3cmd myserver/subfolder1/subsub2 s3://mybucket/subfolder1/

The added value with this strategy is that you can even run the job in parallel as they are working on separated data path. The drawback is that there will not be "duplicate" detection for remote copying for items in different partitions.

danielmotaleite commented 7 years ago

i worked around on a huge, small files static content files, by running several s3cmd for each main directory and using cache, as only this process change the s3 files:

cd /mnt/nfs/
for i in $(find product/ -mindepth 1 -maxdepth 1 -type d | sort -n ) ; do
    /usr/bin/s3cmd -v \
        -c /home/backup/s3cfg  \
        --delete-removed \
        $* \
        --cache-file=/home/backuppc/$site/md5-product-${i/*\/}.cache \
        sync $i/ s3://$site/$i/ 1>>/home/backup/$site/s3-$venture.log 2>&1
done

This make everything faster... smaller directories make s3cmd faster, the cache make it even faster Instead of almost one day copying data, now i can do this in a few hours. So until s3cmd behaves better with huge directories/many files, it is better to partition your copy/sync

mattbillenstein commented 7 years ago

@denydias awscli is one of the official CLI tools built by amazon -- I would guess it handles this case better than this tool does -- you should probably figure out a way to use it.

denydias commented 7 years ago

Thanks for your reply, @fviard! I'll address the most important points of it.

@denydias (First, as a side note, for documentation, it is better to look at the README in the project and at the inline help: s3cmd --help)

I know (and have carefully read both). None address the huge repository sync use case.

Your issues with a huge data set are understandable. Currently what is done with a sync is this:

create in memory a list LOCAL with all the local files and their "stat" info. create in memory a list REMOTE with all the remote files using a s3 list command. (at least, both of them only in the adequate subdir) compare these lists for difference to finally have 3 lists (still in memory) TO_UPLOAD, TO_COPY, TO_DELETE. at the same time as the previous 3 lists, there could also be in memory a list of all the unique "md5" sums of files so that we can detect duplicates and so perform remote copies. Then, the 3 lists are processed.

Thank you for this detailed explanation.

Why it did speed up and used less memory when using "--no-check-md5":

Yeap. I've noted it after trial and error. Besides, --no-check-md5 *do not** allow s3cmd do address the huge thing alone. It required the other options to be usable.

*At least it did on my scenario, though I don't know s3cmd code to make a bold statement here.

When creating the "local" list, a md5sum of all files have to be performed. In that mode you don't have to do that. When creating the "remote" list, aws S3 is not able to give you directly the "md5sum" or some hash for the "big files" that are uploaded in multi parts. So, that means that for each file like that seen in an aws "ls", s3cmd will have to issue an individual "file meta" info command for this file to retrieve a custom attribute with this value. (big files are what is defined as multipart_chunk_size_mb in the config file, 15MB by default) The drawback of this mode is that we can't detect difference of files content to upload it in that case. So in that case files will be uploaded if they are new (not already on remote) or different in size. That is important to have in mind that if a file was modified but keep the exact same size, it will not be uploaded as considered "equal". In your case, you also used "skip-existing", that means that the size check will even not be done, just "new" files will be uploaded.

This is ok to my use case. Updated files do not overwrites their parents. They live besides the old ones, while the old ones never really changes again (business rule).

In the end, a high memory usage result often in a high cpu usage of moving memory blocks around even if the task is not intensive.

...which leads to that unusable instance I've reported before.

So, having the option of using an in-disk sqlite3 database is indeed what would be the perfect solution for such usage. Especially as sqlite3 "temp" tables can be used for the temporary lists like to_upload, to_copy, to_delete. And that would also provide performance improvement like keeping locally a table of the hash id of remote for big files. I don't think that it will hard to do, and it is on the "nice to have" list, but implementing this would require a long work.

That's good to know! Please note that my first message on this issue was not a request, although at the end of it I express a desire that you guys may do something about it. It was more like a report from someone having real world memory issues with s3cmd and large datasets sync. I'm fully aware that this might not be a trivial thing to put in place.

For your specific situation, there is a hack that could help you improve things a little:

....

The added value with this strategy is that you can even run the job in parallel as they are working on separated data path. The drawback is that there will not be "duplicate" detection for remote copying for items in different partitions.

I wish I could! It's not our system that generates the data, nor it is a normalized thing. Al the sync'ed data comes from our customers business rules, which vary great deal for each, e.g.: a financial services company have different ways to organize their data than a lawyer office does. We can't force those rules to the customers, which in turns render any partitioning strategy useless.

But thank you anyway for the help attempt. :wink:

@mattbillenstein I'm aware of that, as well as our engineering staff. The fact awscli is the official thing doesn't change the fact that it address much more things that is needed by the use case. s3cmd in turns deliver just what is required.

There is No CODE that is more flexible than NO Code!

-- Brad Appleton

mattbillenstein commented 7 years ago

@fviard you do not need a sqlite database, you need a better algorithm.

@denydias if you want a quote: "Use the right tool for the job." --unknown

denydias commented 7 years ago

LOL, @mattbillenstein! No, this is not a quote battle! :stuck_out_tongue_closed_eyes:

I agree with you about the better algorithm.

mattbillenstein commented 7 years ago

And one last point re sqlite -- think of rsync, you're trying to do the same thing rsync does, and what it doesn't do is use a database, or use a crapton of ram -- and it does this over the network...

krushik commented 7 years ago

@denydias after I stumbled upon this memory issue, I switched to s3-parallel-put -- it works great on your case

denydias commented 7 years ago

Tks for the tip, @krushik! I didn't know it. It looks quite promising indeed!

fviard commented 7 years ago

@mattbillenstein @denydias Sometimes, things are more complicated at implementation than they look. So, really, the problem here is not "to have a better algorithm". You can trust me on the topic, I work on "backup" for resource constrained devices (rsync, samba, gdrive, dropbox, clouds...).

You evocate rsync, rsync has a similar behavior/algorithm, and its memory usage can quite grow up if you have an important file tree (a lot of files). The difference being that the specificities of the s3 protocol and the impossibility to "tweak" the server can make the memory usage a multiple of the one of rsync. For example if we have to store 3 paths instead of one.

First, i would not say that s3cmd has the perfect algorithm, there a number of things to be improved and a lot of legacy. With just a little time, a contributor might be able to reduce use from 5 to 30% in some cases. (There are some points like that on my todo list). But there are good reasons for some design choices or limitations.

Another point is that the tool can have a lot of usages and so has to be generic in some ways. So, in the end, a custom development could always better fit to perform a specific use case. This is similar to "cp" versus "rsync". If you would basically just a folder and some files, cp might be a lot faster and efficient than rsync. And so, in your case, the set of options that you have chosen, are not the one needed to be able to "put" such a dataset, but the one fitting your system resources constraints with just the needed features.

So, let's explicit all of that with an example: s3-parallel-put looks like to be an implementation that the one that you would expect: A process walk the local filesystem source, and treat each file individually or not depending on what is needed. So there is no need for maintaining the complete list of files in memory. That is a strategy that could be more efficient in some cases but it has the following drawbacks:

It is impossible to have the possibility to remove files at destination that are not any more at the source. (--delete-removed in your options).
It is impossible to have "duplicated files" detection. Thus, possibly resulting in slower backups and increased aws costs by having to reupload files when a remote copy would be enough. (Higher bandwidth usage).
To detect the existence and metadata of files, it has to issue an individual "HEAD" request for each of them before deciding to upload it or not. Thus also resulting in slower backups and increased aws costs. (High number of requests). (@denydias taking your case as example, just on this item I guess that you would had at least around 0.50$ increased costs for each time your run the job. 1434000/10000*0.004$)
Impossible to have a progress info on the task or an estimation of what will have to be done. That can look futile, but I think that a lot of people can wait for 4 days for the completion of a job with a good progress report, whereas people are not be happy at all if you say that you have absolutely no idea of the current progression of the job, and that maybe it will finish in 10 minutes, or in 10 days. Even if it goes faster.

One last point, just a warning for people using "s3-parallel-put", this tool assume that the etag provides the md5 sum of the file on aws s3 all the time. It is not true, the etag will give meaningless information (and so not the md5) for big files that were uploaded in multipart. (ie it depends on boto configuration, but i would say all the files exceeding 15 to 35MB). So, all these files will be re-uploaded at each job in "update" mode, even if they are already "up-to-date".

denydias commented 7 years ago

@fviard,

Thanks a bunch for this thorough explanation! From my side (as well as our engineers), s3cmd is the tool of choice. Any other tool should meet our criteria before hit production.

The parameters I came up with looks like reasonable for the job. Right now, 808512 of 1434236 objects have been uploaded and memory usage is 9.2%. Max. was 45.2%. It started on May'23 at 7:56 AM UTC. I expect it to be finished by tomorrow night.

As this is the first sync to a new, empty bucket, I was expecting it to take that long. I hope that the next one could run at a daily basis in a few hours (~2h).

All in all, it looks that the worst is gone now.

denydias commented 7 years ago

Oh dear! Still having hard time here to sort this OOM thing.

The first sync attempt was finished fine. But for the daily sync I had to remove the --skip-existing argument so some files get updated during the sync process.

Now I have this:

May 30 11:15:00 server kernel: [177852.473269] Out of memory: Kill process 2719 (s3cmd) score 833 or sacrifice child

Also, sync is taking way too long to complete. See the output of my s3cmd wrapper script from yesterday sync, still with --skip-existing in effect:

[2017-05-29T01:22:01-03:00] Backing up to AWS S3...
[2017-05-29T01:22:01-03:00] Importing defaults from /etc/s3bkprc...
[2017-05-29T01:22:01-03:00] Creating temporary directory /tmp/bkpdump...
[2017-05-29T01:22:01-03:00] Dumping database...
[2017-05-29T01:22:57-03:00] Database dump done.
[2017-05-29T01:22:57-03:00] Backing up to AWS S3 bucket: s3://syncbucket
[2017-05-29T18:19:56-03:00] Backup to AWS S3 done.
[2017-05-29T18:19:56-03:00] Cleaning up...
[2017-05-29T18:19:56-03:00] Clean up done.
[2017-05-29T18:19:56-03:00] Total runtime: 0:16:57:55
[2017-05-29T18:19:56-03:00] Backup to AWS S3 done.

16:57:55 is A LOT of time for a sync process without md5 checksums.

One interesting thing to note is that up to a few weeks ago, this same instance was running on Ubuntu 12.04 (also in a t2.large). With Precise reaching its EOL it was then upgraded to Xenial. When still in Precise the same sync was taking 2~4 hours, depending on how much the dataset has grown on the last day. Now I'm yet to see it run under 6h, which is the backup window I have. With the actual performance I'm almost 11h off from that window.

I could set oom_score_adj for the s3cmd PID upon start, but that may cause oom-killer to assassinate other process. I can even disable oom-killer system wide, but this could lead up to serious kernel woes, including panic. Anyway, none of these workarounds will get me closer to a sync performed within the time span I need (and that I had before).

Isn't really anything I could do to improve the sync process run time?

fviard commented 7 years ago

@denydias I forgot to reply to you for your previous message, but I was expecting the memory usage to be worse after the initial sync because the local and remote lists would be full after the initial run.

For your specific case, 2 things that I'm thinking of that could improve performances:

As you have a good connection, you can increase a lot the size of "parts" of multiparts size. (multipart_chunk_size_mb in config) from 15 (default) to 50. So, you will have more files uploaded as "normal" (non multipart) files and so that will reduce the number of files that would need an individual "head" request.
Even if you can "partition" the files yourself, maybe you can split your backup job based on the existing partition that you can see. If we imagine that at your root you see the following folders: mysource/R&D mysource/finance mysource/HR you can run a job for each one instead of directly of mysource. Would not it be possible for you?

You can use the "-v" option when running your job to try to estimate the time taken by each step.

denydias commented 7 years ago

Thanks for the attention, @fviard! I greatly appreciate that!

You are right. Taking into account the memory workflow you have explained in details above, that 45.2% of memory usage with the empty bucket is relative to the LOCAL list only, while REMOTE list is just empty. Now that I have objects on both sides, of course memory usage is going to get higher. My logic was plain wrong.

So, I have made a few changes to .s3cfg. Here they are:

$ diff -u .s3cfg.old .s3cfg
--- .s3cfg.old  2017-05-30 13:28:45.955336016 -0300
+++ .s3cfg      2017-05-30 13:51:58.293541460 -0300
@@ -1,38 +1,62 @@
-delete_removed = False
+delete_removed = True
-get_continue = False
+get_continue = True
-guess_mime_type = True
+guess_mime_type = False
-human_readable_sizes = False
+human_readable_sizes = True
-multipart_chunk_size_mb = 15
+multipart_chunk_size_mb = 50
-put_continue = False
+put_continue = True
-socket_timeout = 100
+socket_timeout = 10
-use_mime_magic = True
+use_mime_magic = False

I've also changed the s3cmd arguments to:

--no-check-md5 --storage-class=STANDARD_IA sync

I'm holding of adding -v right now as I'm afraid that can add even more overhead to the process.

The automated backup is going to run at 1:22 AM. Tomorrow I come back here to let you know how it went with those changes. :crossed_fingers:

EDIT: if things don't get better with that changes, I'll have no other option but to improve my wrapper script to partition the source directories as you and @danielmotaleite have suggested. The thing here is that I don't know how much improvement I can expect from that measure. If it show up as marginal, I'll only end up with a lot of operation overhead.

fviard commented 7 years ago

You should not put "+get_continue = True" in the config. That will change nothing for your current task, but here it is more for one shot things with the get command. And regarding the "fix" with the multipart part size, fyi, it would work only for files newly uploaded. Smaller files that were already uploaded as multipart should stay like that until they are updated.

denydias commented 7 years ago

Thanks for the advice, @fviard. All noted.

denydias commented 7 years ago

Well, the last changes don't produced any improvements. In fact, it's a bit worst than before as oom_killer ran later than yesterday attempt. :cry:

May 31 15:47:29 server kernel: [280602.399309] Killed process 1487 (s3cmd) total-vm:14028492kB, anon-rss:5732316kB, file-rss:944kB

Now the only path left is to implement the process partitioning. This is going to take some days, but I'll come back here to report the results once it's done.

mattbillenstein commented 7 years ago

Omg, just use a different tool already...

M

-- Matt Billenstein matt@vazor.com

Sent from my iPhone 6 (this put here so you know I have one)

On May 31, 2017, at 9:04 PM, Deny Dias notifications@github.com wrote:

Well, the last changes don't produced any improvements. In fact, it's a bit worst than before as oom_killer ran later than yesterday attempt. 😢

May 31 15:47:29 server kernel: [280602.399309] Killed process 1487 (s3cmd) total-vm:14028492kB, anon-rss:5732316kB, file-rss:944kB Now the only path left is to implement the process partitioning. This is going to take some days, but I'll come back here to report the results once it's done.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

denydias commented 7 years ago

Am I bothering you, @mattbillenstein?

I may not be the right person to talk about this subject, but I think any real world feedback is the most valuable asset for developers and users facing problems on similar contexts.

So, If I sound annoying to you, first, I'm sorry. Second, hit that unsubscribe button so you stop receiving notifications from this issue, as I do not intend to stop giving s3cmd contributors any feedback of my use case, or if anyone of the contributors tell me to do so.

denydias commented 7 years ago

Partioning s3cmd to the first level produced no gains. Same OOM issues, take too long to complete and so on...

So I've improved the backup script to partition up to 3rd level. Here's an example for Bash:

echo "Backup 1st level dir files in $ROOT/"
cd $ROOT
f=0
ft=$(find assets/ -mindepth 1 -maxdepth 1 -type f | wc -l)
for file in assets/*.*; do
  f=$((f + 1))
  echo "| 1st level file $f/$ft: $file"
  $S3CMD $S3PARAM "$file" $S3PATH/bkp/assets/
done
echo "Backup stage 2nd level dirs in $ROOT/"
r=0
rt=$(find assets/ -mindepth 1 -maxdepth 1 -type d | wc -l)
find assets/ -mindepth 1 -maxdepth 1 -type d -print0 | sort -z | while read -d '' -r rd; do
  r=$((r + 1))
  echo "|- 2nd level dir $r/$rt: $rd/"
  f=0
  ft=$(find "$rd" -mindepth 1 -maxdepth 1 -type f | wc -l)
  find "$rd" -mindepth 1 -maxdepth 1 -type f -print0 | sort -z | while read -d '' -r file; do
    f=$((f + 1))
    echo "|- 2nd level file $f/$ft (from $r/$rt): $file"
    $S3CMD $S3PARAM "$file" "$S3PATH/bkp/$rd/"
  done
  w=0
  wt=$(find "$rd" -mindepth 1 -maxdepth 1 -type d | wc -l)
  find "$rd" -mindepth 1 -maxdepth 1 -type d -print0 | sort -z | while read -d '' -r wd; do
    w=$((w + 1))
    echo "|-- 3rd level dir $w/$wt recursively (from dir $r/$rt): $wd/"
    $S3CMD $S3PARAM "$wd/" "$S3PATH/bkp/$wd/"
  done
done

It produces an output similar to:

Backup 1st level dir files in /srv/assets
| 1st level file 1/3: assets/f1.pdf
| 1st level file 2/3: assets/f2 with spaces.pdf
| 1st level file 3/3: assets/f3.pdf
Backup 2nd level dirs in /srv/assets
|- 2nd level dir 1/6: assets/d1/
|- 2nd level dir 2/6: assets/d2/
|- 2nd level dir 3/6: assets/d3/
|- 2nd level dir 4/6: assets/d4/
|-- 3rd level dir 1/8 recursively (from 2nd level dir 4/6): assets/d4/1600001/
|-- 3rd level dir 2/8 recursively (from 2nd level dir 4/6): assets/d4/1600002/
|-- 3rd level dir 3/8 recursively (from 2nd level dir 4/6): assets/d4/1600003/
|-- 3rd level dir 4/8 recursively (from 2nd level dir 4/6): assets/d4/1600004/
|-- 3rd level dir 5/8 recursively (from 2nd level dir 4/6): assets/d4/1610114/
|-- 3rd level dir 6/8 recursively (from 2nd level dir 4/6): assets/d4/1610115/
|-- 3rd level dir 7/8 recursively (from 2nd level dir 4/6): assets/d4/1610116/
|-- 3rd level dir 8/8 recursively (from 2nd level dir 4/6): assets/d4/1610117/
|- 2nd level dir 5/6: assets/d5 with spécial/
|-- 3rd level dir 1/3 recursively (from 2nd level dir 5/6): assets/d5 with spécial/001/
|-- 3rd level dir 2/3 recursively (from 2nd level dir 5/6): assets/d5 with spécial/002/
|-- 3rd level dir 3/3 recursively (from 2nd level dir 5/6): assets/d5 with spécial/003/
|- 2nd level dir 6/6: assets/d6 with space/
|- 2nd level file 1/3 (from 2nd level dir 6/6): assets/d6 with space/f1.pdf
|- 2nd level file 2/3 (from 2nd level dir 6/6): assets/d6 with space/f2 with spaces.pdf
|- 2nd level file 3/3 (from 2nd level dir 6/6): assets/d6 with space/f3.pdf
|-- 3rd level dir 1/3 recursively (from 2nd level dir 6/6): assets/d6 with space/001/
|-- 3rd level dir 2/3 recursively (from 2nd level dir 6/6): assets/d6 with space/002/
|-- 3rd level dir 3/3 recursively (from 2nd level dir 6/6): assets/d6 with space/003/

Not the fastest thing in the world but as it's now very partitioned, memory usage and processing is marginal. It looks reasonable to leave it running even for the whole day.

The backup is now running with this new script. It started at 2017-06-02T02:01:00-03:00. The most recent log entry as I write this is from 2017-06-02T04:23:58-03:00 saying Backing up 2nd level dir 61106/671234 (from 6/9).

I'll report when it's done.

fviard commented 7 years ago

The good thing is that each job part is now independent. So later you can try to run multiple parts in parallel. Withe the Linux parallel command for example.

But from what you told me, I think that maybe most of your tree is marginally small and just somewhere in the subtree there will be huge folders. If that would be the case, an alternative that is possible for you is that: Let's say for the only 2 bigs things in your dataset are: Root/subdirectory/bigAssets1 Root/subdirectory/bigAssets3

In that case you can run: s3cmd root s3://buck/ --exclude bigAssets1 --exclude bigAssets3

And then s3cmd root/subdirectory/bigAssets1 s3://buck/subdirectory/ ...

In your case it could be interesting, to study your dataset. Like doing du -shc somefolder/* to size of subfolders. Or find .. | WC -l for number of files

Le 2 juin 2017 9:29 AM, "Deny Dias" notifications@github.com a écrit :

Partioning s3cmd to the first level produced no gains. Same OOM issues, take too long to complete and so on...

So I've improved the backup script to partition up to 3rd level. Here's an example for Bash:

echo "Backup 1st level dir files in $ROOT/"cd $ROOT f=0 ft=$(find assets/ -mindepth 1 -maxdepth 1 -type f | wc -l)for file in assets/.; do f=$((f + 1)) echo "| 1st level file $f/$ft: $file" $S3CMD $S3PARAM "$file" $S3PATH/bkp/assets/doneecho "Backup stage 2nd level dirs in $ROOT/" r=0 rt=$(find assets/ -mindepth 1 -maxdepth 1 -type d | wc -l) find assets/ -mindepth 1 -maxdepth 1 -type d -print0 | sort -z | while read -d '' -r rd; do r=$((r + 1)) echo "|- 2nd level dir $r/$rt: $rd/" f=0 ft=$(find "$rd" -mindepth 1 -maxdepth 1 -type f | wc -l) find "$rd" -mindepth 1 -maxdepth 1 -type f -print0 | sort -z | while read -d '' -r file; do f=$((f + 1)) echo "|- 2nd level file $f/$ft (from $r/$rt): $file" $S3CMD $S3PARAM "$file" "$S3PATH/bkp/$rd/" done w=0 wt=$(find "$rd" -mindepth 1 -maxdepth 1 -type d | wc -l) find "$rd" -mindepth 1 -maxdepth 1 -type d -print0 | sort -z | while read -d '' -r wd; do w=$((w + 1)) echo "|-- 3rd level dir $w/$wt recursively (from dir $r/$rt): $wd/" $S3CMD $S3PARAM "$wd/" "$S3PATH/bkp/$wd/" donedone

It produces an output similar to:

Backup 1st level dir files in /srv/assets | 1st level file 1/3: assets/f1.pdf | 1st level file 2/3: assets/f2 with spaces.pdf | 1st level file 3/3: assets/f3.pdf Backup 2nd level dirs in /srv/assets |- 2nd level dir 1/6: assets/d1/ |- 2nd level dir 2/6: assets/d2/ |- 2nd level dir 3/6: assets/d3/ |- 2nd level dir 4/6: assets/d4/ |-- 3rd level dir 1/8 recursively (from 2nd level dir 4/6): assets/d4/1600001/ |-- 3rd level dir 2/8 recursively (from 2nd level dir 4/6): assets/d4/1600002/ |-- 3rd level dir 3/8 recursively (from 2nd level dir 4/6): assets/d4/1600003/ |-- 3rd level dir 4/8 recursively (from 2nd level dir 4/6): assets/d4/1600004/ |-- 3rd level dir 5/8 recursively (from 2nd level dir 4/6): assets/d4/1610114/ |-- 3rd level dir 6/8 recursively (from 2nd level dir 4/6): assets/d4/1610115/ |-- 3rd level dir 7/8 recursively (from 2nd level dir 4/6): assets/d4/1610116/ |-- 3rd level dir 8/8 recursively (from 2nd level dir 4/6): assets/d4/1610117/ |- 2nd level dir 5/6: assets/d5 with spécial/ |-- 3rd level dir 1/3 recursively (from 2nd level dir 5/6): assets/d5 with spécial/001/ |-- 3rd level dir 2/3 recursively (from 2nd level dir 5/6): assets/d5 with spécial/002/ |-- 3rd level dir 3/3 recursively (from 2nd level dir 5/6): assets/d5 with spécial/003/ |- 2nd level dir 6/6: assets/d6 with space/ |- 2nd level file 1/3 (from 2nd level dir 6/6): assets/d6 with space/f1.pdf |- 2nd level file 2/3 (from 2nd level dir 6/6): assets/d6 with space/f2 with spaces.pdf |- 2nd level file 3/3 (from 2nd level dir 6/6): assets/d6 with space/f3.pdf |-- 3rd level dir 1/3 recursively (from 2nd level dir 6/6): assets/d6 with space/001/ |-- 3rd level dir 2/3 recursively (from 2nd level dir 6/6): assets/d6 with space/002/ |-- 3rd level dir 3/3 recursively (from 2nd level dir 6/6): assets/d6 with space/003/

Not the fastest thing in the world but as it's now very partitioned, memory usage and processing is marginal. It looks reasonable to leave it running even for the whole day.

The backup is now running with this new script. It started at 2017-06-02T02:01:00-03:00. The most recent log entry as I write this is from 2017-06-02T04:23:58-03:00 saying Backing up 2nd level dir 61106/671234 (from 6/9).

I'll report when it's done.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/s3tools/s3cmd/issues/408#issuecomment-305711653, or mute the thread https://github.com/notifications/unsubscribe-auth/ABAUpGf0mtUmNfc-0h00HpjPU4S4x7C_ks5r_7nLgaJpZM4C3j8B .

denydias commented 7 years ago

@fviard, sure! parallel could definitively help here. Right now, I just want to implement the serialized, non parallel thing so I can check the raw performance of s3cmd interacting with s3 API for almost every object would be.

The backup process status now is:

[2017-06-02T15:19:42-03:00] !! Backing up 2nd level dir 339361/671234 (from 6/9)

In that pace, I don't think it's going to cover the whole dataset in the 24h time span. I still need some improvements. You've mentioned s3cmd include/exclude rules. I've tried to figure them out but I couldn't make a good use of it. I think they can be handy here.

A typical tree looks like:

$ tree
. l1_d1
├── l1_f1.txt
├── l1_f2.txt
├── l1_f3.txt
├── l2_d1
├── l2_d2
│   ├── l2_f1.txt
│   ├── l2_f2.txt
│   ├── l2_f3.txt
│   ├── l3_d1
│   │   ├── l3_f1.txt
│   │   ├── l3_f2.txt
│   │   └── l3_f3.txt
│   ├── l3_d2
│   │   ├── l3_f1.txt
│   │   ├── l3_f2.txt
│   │   └── l3_f3.txt
│   └── l3_d3
│       ├── l3_f1.txt
│       ├── l3_f2.txt
│       └── l3_f3.txt
├── l2_d3
├── l2_d4
│   ├── l3_d1
│   │   └── l3_f1.xml
│   ├── l3_d2
│   │   └── l3_f2.xml
│   ├── l3_d3
│   │   └── l3_f3.xml
│   ├── l3_d4
│   │   └── l3_f4.xml
│   ├── l3_d5
│   │   └── l3_f5.xml
│   ├── l3_d6
│   │   └── l3_f6.xml
│   ├── l3_d7
│   │   └── l3_f7.xml
│   └── l3_d8
│       └── l3_f8.xml
├── l2_d5
└── l2_d6
    ├── l3_d1
    │   ├── l3_f1.txt
    │   ├── l3_f2.txt
    │   └── l3_f3.txt
    ├── l3_d2
    │   ├── l3_f1.txt
    │   ├── l3_f2.txt
    │   └── l3_f3.txt
    └── l3_d3
        ├── l3_f1.txt
        ├── l3_f2.txt
        └── l3_f3.txt

The above example can vary big deal from one customer to another. But usually the tree is filled following this pattern:

l1 dir have some l2 dirs and it's not usual to contain any files.
some l2 dirs have LOTS of l3 dirs and it's not usual to contain any files.
l3 dirs holds many files and it's not usual to contain any deeper dirs.

The keyword above is 'some'. I can't predict which l2 dirs are going to be elected for receiving lots of l3 dirs. In the example above, these are l2_d2, l2_d4 and l2_d6 but it could easily be l2_d1, l2_d3 and l2_d5. These l2 dirs packed with content are the ones presenting OOM challenges to s3cmd, so they must be partitioned.

You've mentioned the include/exclude rules. I've tried to use them with while implementing the snippet I've posted above, but I was not successful. For instance: using filter rules, how do I include only files in l1_d1 while excluding all dirs in same level? If I'm able to do that, I can greatly improve the script logic to avoid standalone requests for individual files in l1 and l2.

denydias commented 7 years ago

Well, after days struggling with this, countless hours of programming and many operation policies revisions, I had to surrender to @mattbillenstein suggestion to test awscli.

I've downloaded the whole dataset to a local vm and did full and incremental backups using awscli. It's performance simply destroy that one seen with s3cmd. awscli did in less than 20 minutes what s3cmd took close to a day, namely the incremental daily sync. No kidding. I've tested this myself. More than once.

As a bonus, awscli does a better job encoding filenames and directories with special characters and spaces.

So, I stand corrected with @mattbillenstein's quote:

Use the right tool for the job.

awscli is the right tool for huge repositories sync. s3cmd don't. After using s3cmd since v1.0.1, it's hard to say goodbye. Anyway, I would like to say a big thank you to all s3cmd devs, in special to @fviard.

fviard commented 7 years ago

@denydias Sad to hear, but only rare things comes for free, and so I suspect that too good to be true implies some drawbacks. I'm not really sure, but i think that aws cli sync use the same boto code and core than s3 parallel put, just without the parallel thing. Do you still have the "--delete-removed" feature? files removed from source removed from your s3 side after the sync?

Btw, regarding your comment on special characters, i'm wondering because I think I didn't ask, what is the version of s3cmd that you are using? 1.6.1? Because since then on MASTER the "encoding mode" was changed. Probably it will not fix your memory issues, but it can still be interesting for you to give a try to it with your test node as a lot have changed since the 1.6.1.

denydias commented 7 years ago

@fviard, yes, that performance increase of awscli over s3cmd looks really suspicious. It's very bold to be something real. That's why I did the test dozens of times. All numbers (files and time) were consistent across runs. So, even if I had a reason to suspect, it was gone after tests.

Yes, I was using --delete-removed with s3cmd. I'm also using --delete with awscli.

Yes, I was using 1.6.1. Although I can go for master branch on a test spin, I can't use it under production. Only release software allowed. Anyway I'll give it a try with my spare time.

For now I'm just happy I could solve the S3 backup issues we're having in the last couple weeks.

EDIT: an automated backup just finished. First with awscli in place. It took 20m49s in a huge repository, 21s in a small one. I can't argue with that figures.

atodorov commented 7 years ago

First, i would not say that s3cmd has the perfect algorithm, there a number of things to be improved and a lot of legacy. With just a little time, a contributor might be able to reduce use from 5 to 30% in some cases. (There are some points like that on my todo list). But there are good reasons for some design choices or limitations.

@fviard can you share your list of TODO points about optimization? I think I will have some time to work on this. Also I looked at the code today (local to remote copy in particular) and was trying to figure out how stuff works. Here are a few quick observations:

1) When the local file list is created we also do stat and md5. These md5 hashes are used in the compare_filelists (lines 610-625) and even in the case of new files we keep track of them for deduplication detection purposes. Maybe we can add (yet another) option to disable this so that we don't need stat and md5 of local files and look for them in the remote list only by name?

EDIT: ^^^ I just realized the code actually does this when --no-check-md5 is given. Am I right ?

2) When fetching the remote list md5 and size come from the response. Maybe we can ignore them in some cases, e.g. --no-check-md5 given, just to save some storage space ?

3) --skip-existing in compare_filelists deletes objects from the local and remote lists thus freeing memory. However we can move this if statement into fetch_remote_list and skip the files directly there. For multi-part uploads this will have the benefit that we don't issue another request to Amazon to get the md5. The drawback is that fetch_remote_list will need to receive the local list as parameter.

wedgef5 commented 4 years ago

I know this was only rated as a "Nice to have", but it would be REALLY nice to have for us. We're trying to backup meteorological data cases. Each case can have up to 1M files, and we have dozens of cases. Trying to do it all at once resulted in >50 GB of memory usage JUST creating the local file list. We need to retain the local file metadata (esp. modification time), so s3cmd is what we want to use. I will just have to break the job up. It would be nice if we didn't have to do that!

mattbillenstein commented 4 years ago

I know this was only rated as a "Nice to have", but it would be REALLY nice to have for us.

This issue is ~6 years old - I don't think it's happening - can you try the latest awscli and report back if that works? I think you want 'aws s3 sync ...'

https://docs.aws.amazon.com/cli/latest/userguide/install-cliv2.html

s3tools / s3cmd

Enhancement Request: use `sqlite3` instead of in-memory lists #408