s3tools / s3cmd

Official s3cmd repo -- Command line tool for managing S3 compatible storage services (including Amazon S3 and CloudFront).
https://s3tools.org/s3cmd
GNU General Public License v2.0
4.51k stars 901 forks source link

MD5 cache file not updated after each calculated hash #641

Open ramonsmits opened 8 years ago

ramonsmits commented 8 years ago

I'm currently running a s3cmd sync operation on my laptop from s3 to my nas. The operation is running for hours because the md5 hashes need to be generated. I added the --cache-file option so that the md5 hashes will be stored.

However, I just looked at the folder and I don't see that file. Does that mean that the cache file is only flushed to storage at the end of the sync operation?

Why isn't the cache file flushed after each calculated md5? I cannot stop the current operation as then all calculated md5 hashes are gone.

I'm running the operation with --verbose and after almost 11 hours it says 9000/40244 so that means it is now at 22%.

mdomsch commented 8 years ago

Yes, the cache is stored as a python pickle, at the end of the local directory walk. Not at the end of the sync operation though. Depends which direction you are syncing. If syncing local to remote, the local file list is read first, the cache saved, then the remote file list is read. If syncing remote to local, the remote list is read first, then the local list is read and the cache saved. To re-write the pickle after each file being read would be crazy. Now, it could be stored in a different format, one more suitable to appending I suppose, but pickles were the easy choice and have worked thus far.

On Fri, Oct 2, 2015 at 5:47 PM, Ramon Smits notifications@github.com wrote:

I'm currently running a s3cmd sync operation on my laptop from s3 to my nas. The operation is running for hours because the md5 hashes need to be generated. I added the --cache-file option so that the md5 hashes will be stored.

However, I just looked at the folder and I don't see that file. Does that mean that the cache file is only flushed to storage at the end of the sync operation?

Why isn't the cache file flushed after each calculated md5? I cannot stop the current operation as then all calculated md5 hashes are gone.

I'm running the operation with --verbose and after almost 11 hours it says 9000/40244 so that means it is now at 22%.

— Reply to this email directly or view it on GitHub https://github.com/s3tools/s3cmd/issues/641.

ramonsmits commented 8 years ago

Not familiar with python but is a pickle sort of a hash table or dictionary? Yes that would be weird to save to disk after each file.

Is there a graceful abort possible that will still write the pickle to disk in case you want to reboot?

A solution would be a log file that gets appended. At start you load the pickle, then the log if it exists and update the pickle.

After the file scan, store the pickle, delete the log file.

Another option would be to store the pickle for a given interval so that if the operation is aborted only the work of one interval is lost.

I prefer the first.

It explains why people mention that the md5 cache is not working. I'm now syncing 40.000+ files and if I quit the terminal all md5 hash data is gone.

Let me dive into python, maybe I can contribute to s3cmd. On Oct 3, 2015 6:24 AM, "Matt Domsch" notifications@github.com wrote:

Yes, the cache is stored as a python pickle, at the end of the local directory walk. Not at the end of the sync operation though. Depends which direction you are syncing. If syncing local to remote, the local file list is read first, the cache saved, then the remote file list is read. If syncing remote to local, the remote list is read first, then the local list is read and the cache saved. To re-write the pickle after each file being read would be crazy. Now, it could be stored in a different format, one more suitable to appending I suppose, but pickles were the easy choice and have worked thus far.

On Fri, Oct 2, 2015 at 5:47 PM, Ramon Smits notifications@github.com wrote:

I'm currently running a s3cmd sync operation on my laptop from s3 to my nas. The operation is running for hours because the md5 hashes need to be generated. I added the --cache-file option so that the md5 hashes will be stored.

However, I just looked at the folder and I don't see that file. Does that mean that the cache file is only flushed to storage at the end of the sync operation?

Why isn't the cache file flushed after each calculated md5? I cannot stop the current operation as then all calculated md5 hashes are gone.

I'm running the operation with --verbose and after almost 11 hours it says 9000/40244 so that means it is now at 22%.

— Reply to this email directly or view it on GitHub https://github.com/s3tools/s3cmd/issues/641.

— Reply to this email directly or view it on GitHub https://github.com/s3tools/s3cmd/issues/641#issuecomment-145199280.

ramonsmits commented 8 years ago

Today my wifi connection failed and the sync quit. No md5 hashes were flushed to disk at all.

Also, having a binary file that is written to at the end of the batch makes it impossible for multiple invocations to share the same cache file.

An alternative is to create a .md5 file for each file and

Or use a file per folder and maybe even use the same file format as md5sum as text.

Or use a file per tree

d4v3y0rk commented 3 years ago

wow this does not seem to have gotten any love in a long time. was there ever a resolution? I am currently facing the same issue. every time I run the sync command it has to calculate 40k md5 hashes...

rchavez-neu commented 1 year ago

Same issue here. When syncing 8000 files s3cmd seems to generate an md5 every time and doesn't save the MD5 results to cache on local disk (s3cmd version 2.3.0). It makes for really long sync times.

Does anyone have any ideas?