scholarsportal / SwiftBulkUploader

Script that reads through an entire directory, records paths of all files in a mysql database and uploads all the files. The advantage of recording the paths of all the files is it allows the uploading process to be cancelled and continued. This was necessary for directories with millions of files.
7 stars 7 forks source link

Multiprocess Option #10

Open Petahhh opened 9 years ago

Petahhh commented 9 years ago

For directories with 20+ million files, multiple processes will be necessary to complete the upload within a reasonable amount of time.

cudevmaxwell commented 9 years ago

This is the big difference between the two versions, and I'm still not sure which is the 'better' option.

Using a multiprocessing pool to create workers who use the raw swift library calls might be more efficient, but not as 'safe'. We could reuse the code from before the PR.

Using subcommand to fire off batches of 'swift upload...' shell commands is less efficient, but 'safer'.

The latest PR was a big downgrade, in terms of speed. I definitely want to fix that.

I'm willing to try replacing subcommand with http://docs.openstack.org/developer/python-swiftclient/swiftclient.html#module-swiftclient.multithreading or our own multiprocess pool, see if we can get the upload speed up while still doing md5sum checks.

cudevmaxwell commented 9 years ago

While doing the large batch upload for #9, it's become pretty clear that a single thread calling 'swift upload' millions of times isn't going to cut it. Way too slow. Even if we don't call the library directly, and continue to use subcommand, we'll need a multiprocess pool to call 'swift upload' in parallel. We can use existing code in bulkupload.py.

cudevmaxwell commented 9 years ago

On my test VM, uploading 1,000,000 1-2 kb files to a container using swift upload took 838m28.226s. We should shoot for performance in this tool to fall within 110% of that.