Open Petahhh opened 9 years ago
This is the big difference between the two versions, and I'm still not sure which is the 'better' option.
Using a multiprocessing pool to create workers who use the raw swift library calls might be more efficient, but not as 'safe'. We could reuse the code from before the PR.
Using subcommand to fire off batches of 'swift upload...' shell commands is less efficient, but 'safer'.
The latest PR was a big downgrade, in terms of speed. I definitely want to fix that.
I'm willing to try replacing subcommand with http://docs.openstack.org/developer/python-swiftclient/swiftclient.html#module-swiftclient.multithreading or our own multiprocess pool, see if we can get the upload speed up while still doing md5sum checks.
While doing the large batch upload for #9, it's become pretty clear that a single thread calling 'swift upload' millions of times isn't going to cut it. Way too slow. Even if we don't call the library directly, and continue to use subcommand, we'll need a multiprocess pool to call 'swift upload' in parallel. We can use existing code in bulkupload.py.
On my test VM, uploading 1,000,000 1-2 kb files to a container using swift upload
took 838m28.226s. We should shoot for performance in this tool to fall within 110% of that.
For directories with 20+ million files, multiple processes will be necessary to complete the upload within a reasonable amount of time.