s3tools / s3cmd

Official s3cmd repo -- Command line tool for managing S3 compatible storage services (including Amazon S3 and CloudFront).
https://s3tools.org/s3cmd
GNU General Public License v2.0
4.54k stars 904 forks source link

s3cmd hangs with wildcard argument for hundrends files #902

Open alexandrnikitin opened 7 years ago

alexandrnikitin commented 7 years ago

I have few hundred files I want to upload to s3. I specify wildcard as a file argument. But s3cmd hangs in my case.

prod [root@stagetrainworker shared-profiles]# ls -lh
total 76G
-rw-r--r-- 1 root root 188M Jul 24 13:58 file_000000.gz
-rw-r--r-- 1 root root 188M Jul 24 14:00 file_000001.gz
-rw-r--r-- 1 root root 188M Jul 24 14:02 file_000002.gz
.......
-rw-r--r-- 1 root root 179M Jul 25 01:19 file_000428.gz

prod [root@stagetrainworker shared-profiles]# s3cmd put file_* s3://bucket-rpp-dev-permanent/alex/shared-profiles/

BTW It works for a dozen files.

fviard commented 7 years ago

Can you tell me what is the version of s3cmd that you are using on what platform? Also, running the command with the -d flag can give you more insight of what is going on as it enables debug output.

Le 25 juil. 2017 9:33 AM, "Alexandr Nikitin" notifications@github.com a écrit :

I have few hundred files I want to upload to s3. I specify wildcard as a file argument. But s3cmd hangs in my case.

prod [root@stagetrainworker shared-profiles]# ls -lh total 76G -rw-r--r-- 1 root root 188M Jul 24 13:58 file_000000.gz -rw-r--r-- 1 root root 188M Jul 24 14:00 file_000001.gz -rw-r--r-- 1 root root 188M Jul 24 14:02 file_000002.gz ....... -rw-r--r-- 1 root root 179M Jul 25 01:19 file_000428.gz

prod [root@stagetrainworker shared-profiles]# s3cmd put file_* s3://bucket-rpp-dev-permanent/alex/shared-profiles/

BTW It works for a dozen files.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/s3tools/s3cmd/issues/902, or mute the thread https://github.com/notifications/unsubscribe-auth/ABAUpIoNglLVPRwqOWW8H5vkPaSKn1xfks5sRZpkgaJpZM4OiK68 .

alexandrnikitin commented 7 years ago

Basically it starts uploading but after 30 minutes or so. There's plenty RAM in the server. Disk IO is reasonably fast. Upload speed varies from 10 MB/s to 30 MB/s.

s3cmd version 2.0.0
Python 2.7.5
prod [root@stagetrainworker shared-profiles]# s3cmd put file_000* s3://adform-rpp-dev-permanent/alex/shared-profiles-temp/ -d -n &> log
prod [root@stagetrainworker shared-profiles]# cat log
DEBUG: s3cmd version 2.0.0
...
DEBUG: Unicodising 'put' using UTF-8
DEBUG: Unicodising 'file_000000.gz' using UTF-8
DEBUG: Unicodising 'file_000001.gz' using UTF-8
...
DEBUG: Unicodising 'file_000428.gz' using UTF-8
DEBUG: Unicodising 's3://adform-rpp-dev-permanent/alex/shared-profiles-temp/' using UTF-8
DEBUG: Command: put
DEBUG: DeUnicodising u'file_000000.gz' using UTF-8
DEBUG: DeUnicodising u'file_000001.gz' using UTF-8
...
DEBUG: DeUnicodising u'file_000428.gz' using UTF-8
INFO: Compiling list of local files...
DEBUG: DeUnicodising u'file_000000.gz' using UTF-8
DEBUG: Unicodising 'file_000000.gz' using UTF-8
DEBUG: DeUnicodising u'file_000000.gz' using UTF-8
DEBUG: DeUnicodising u'file_000000.gz' using UTF-8
DEBUG: Unicodising '' using UTF-8
DEBUG: DeUnicodising u'file_000000.gz' using UTF-8
DEBUG: Unicodising 'file_000000.gz' using UTF-8
DEBUG: DeUnicodising u'file_000000.gz' using UTF-8
DEBUG: DeUnicodising u'file_000000.gz' using UTF-8
INFO: Compiling list of local files...
DEBUG: DeUnicodising u'file_000001.gz' using UTF-8
DEBUG: Unicodising 'file_000001.gz' using UTF-8
DEBUG: DeUnicodising u'file_000001.gz' using UTF-8
DEBUG: DeUnicodising u'file_000001.gz' using UTF-8
DEBUG: Unicodising '' using UTF-8
DEBUG: DeUnicodising u'file_000001.gz' using UTF-8
DEBUG: Unicodising 'file_000001.gz' using UTF-8
DEBUG: DeUnicodising u'file_000001.gz' using UTF-8
DEBUG: DeUnicodising u'file_000001.gz' using UTF-8
...
INFO: Compiling list of local files...
DEBUG: DeUnicodising u'file_000428.gz' using UTF-8
DEBUG: Unicodising 'file_000428.gz' using UTF-8
DEBUG: DeUnicodising u'file_000428.gz' using UTF-8
DEBUG: DeUnicodising u'file_000428.gz' using UTF-8
DEBUG: Unicodising '' using UTF-8
DEBUG: DeUnicodising u'file_000428.gz' using UTF-8
DEBUG: Unicodising 'file_000428.gz' using UTF-8
DEBUG: DeUnicodising u'file_000428.gz' using UTF-8
DEBUG: DeUnicodising u'file_000428.gz' using UTF-8
DEBUG: Applying --exclude/--include
DEBUG: CHECK: file_000000.gz
DEBUG: PASS: u'file_000000.gz'
DEBUG: CHECK: file_000001.gz
DEBUG: PASS: u'file_000001.gz'
...
DEBUG: CHECK: file_000428.gz
DEBUG: PASS: u'file_000428.gz'
INFO: Running stat() and reading/calculating MD5 values on 429 files, this may take some time...
DEBUG: DeUnicodising u'file_000000.gz' using UTF-8
DEBUG: doing file I/O to read md5 of file_000000.gz
DEBUG: DeUnicodising u'file_000000.gz' using UTF-8
DEBUG: DeUnicodising u'file_000001.gz' using UTF-8
DEBUG: doing file I/O to read md5 of file_000001.gz
DEBUG: DeUnicodising u'file_000001.gz' using UTF-8
...
DEBUG: DeUnicodising u'file_000428.gz' using UTF-8
DEBUG: doing file I/O to read md5 of file_000428.gz
DEBUG: DeUnicodising u'file_000428.gz' using UTF-8
INFO: Summary: 429 local files to upload
upload: 'file_000000.gz' -> 's3://adform-rpp-dev-permanent/alex/shared-profiles-temp/file_000000.gz'
upload: 'file_000001.gz' -> 's3://adform-rpp-dev-permanent/alex/shared-profiles-temp/file_000001.gz'
...
upload: 'file_000428.gz' -> 's3://adform-rpp-dev-permanent/alex/shared-profiles-temp/file_000428.gz'
WARNING: Exiting now because of --dry-run

Full log file in the gist: https://gist.github.com/alexandrnikitin/da7c5ed2a66a7b4a340c32d0d06f8821

alexandrnikitin commented 7 years ago

An excerpt of profile results using python -m cProfile /usr/bin/s3cmd ... Full log on gist

         5549592 function calls (5549195 primitive calls) in 1101.296 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...
   429    2.915    0.007 1100.503    2.565 Utils.py:256(hash_file_md5)
...
  2479089  912.556    0.000  912.556    0.000 {method 'read' of 'file' objects}
...
  2478659  184.972    0.000  184.972    0.000 {method 'update' of '_hashlib.HASH' objects}

Updated: it's a dry run.

fviard commented 7 years ago

You see no issue when you just do the dry run? The interesting thing is to have the "-d" flag when you do the real operation, so you can see at what operation exactly it is hanging.

alexandrnikitin commented 7 years ago

It hangs in both cases for ~20 minutes. During the real operations it hangs for 20 minutes with no output (without -d ofc) and then starts uploading. Besides not stable speed, uploading itself looks OK for me.

fviard commented 7 years ago

Ok, so I think that your issue is just the lack of feedback. First you can try to run the command with the "--progress" flag, that should give you real time information for the real upload (but maybe not for the md5 generation step).

Otherwise, for your case, you should probably run the command with the "-v" flag, that is to display basic info. s3cmd is designed to not be too verbose by default. Please tell me if that is enough to fix your issue.

alexandrnikitin commented 7 years ago

Hmm, not quite... I would expect it to run its stages simultaneously: reading files, calculating hashes, uploading (with one progress). It would be way more efficient. But I understand that it's more a design change than a bug fix and is not what is easy to do.

The "--progress" flag acts the same, no output in the beginning. The "-v" flag gives some feedback:

...
INFO: Compiling list of local files...
INFO: Compiling list of local files...
INFO: Running stat() and reading/calculating MD5 values on 429 files, this may take some time...

And hangs after, at least it warns 😀

fviard commented 7 years ago

Maybe a progress info could be added for this stage, as indeed it can take some time.

For the parallel work, that is indeed missing and so feature request. Currently there will be some for operations on "multiple destinations" but that is very rare i guess.

Adding parallel operations is something on the todo list, and i have some ideas and experiments, but some details are still tricky for a nice implementation. For example, how to display/report the progress when multiple files are uploaded at the same time.

cgough commented 6 years ago

I am seeing the same issue...attempting to s3cmd sync a directory of flat text files to s3.

Job has been running 20 minutes or so, all I see hanging on the tmux console:

DEBUG: DeUnicodising u'/data/logs/application/api_access.log.2017-06-10.gz' using UTF-8
DEBUG: DeUnicodising u'/data/logs/application/api_access.log.2017-06-11.gz' using UTF-8
DEBUG: doing file I/O to read md5 of api_access.log.2017-06-11.gz
DEBUG: DeUnicodising u'/data/logs/application/api_access.log.2017-06-11.gz' using UTF-8
DEBUG: DeUnicodising u'/data/logs/application/api_access.log.2017-06-12.gz' using UTF-8
DEBUG: doing file I/O to read md5 of api_access.log.2017-06-12.gz
DEBUG: DeUnicodising u'/data/logs/application/api_access.log.2017-06-12.gz' using UTF-8
DEBUG: DeUnicodising u'/data/logs/application/api_access.log.2017-06-13.gz' using UTF-8
DEBUG: doing file I/O to read md5 of api_access.log.2017-06-13.gz
DEBUG: DeUnicodising u'/data/logs/application/api_access.log.2017-06-13.gz' using UTF-8
DEBUG: DeUnicodising u'/data/logs/application/api_access.log.2017-06-14.gz' using UTF-8
DEBUG: doing file I/O to read md5 of api_access.log.2017-06-14.gz
DEBUG: DeUnicodising u'/data/logs/application/api_access.log.2017-06-14.gz' using UTF-8

...progress is very, very slow. I see no files uploaded.

s3cmd sync /data/logs/application/ s3://backup.production.mycompany.com/logs/ -v -d --multipart-chunk-size=15 --progress

I am using latest s3cmd installed today:

[admin004.prod:/data/logs/application] cgough% pip list | grep s3cmd
DEPRECATION: The default format will switch to columns in the future. You can use --format=(legacy|columns) (or define a format=(legacy|columns) in your pip.conf under the [list] section) to disable this warning.
s3cmd (2.0.1)

I have no problem uploading individual files.

fviard commented 6 years ago

Like for the previous reporter, the slow part before upload is when we have to calculate the md5 sum of local files and then compare with the files on the s3 side to know what to upload.

In your case, there are two things that you can do to improve the speed of your task. 1) Use: --no-check-md5 Do not check MD5 sums when comparing files for [sync]. Only size will be compared. May significantly speed up transfer but may also miss some changed files.

2) Use --cache-file=FILE Will cache the calculated values of md5 sum for local files in the FILE file. Then, s3cmd will not have to recalculate the md5 sum of known unmodified local files at next runs.

Bmwx12013 commented 1 month ago

I have few hundred files I want to upload to s3. I specify wildcard as a file argument. But s3cmd hangs in my case.


prod [root@stagetrainworker shared-profiles]# ls -lh

total 76G

-rw-r--r-- 1 root root 188M Jul 24 13:58 file_000000.gz

-rw-r--r-- 1 root root 188M Jul 24 14:00 file_000001.gz

-rw-r--r-- 1 root root 188M Jul 24 14:02 file_000002.gz

.......

-rw-r--r-- 1 root root 179M Jul 25 01:19 file_000428.gz

prod [root@stagetrainworker shared-profiles]# s3cmd put file_* s3://bucket-rpp-dev-permanent/alex/shared-profiles/

BTW It works for a dozen files.

Not