havardgulldahl / jottalib

A library to access files stored at jottacloud.com.
GNU General Public License v3.0
83 stars 19 forks source link

Option for only checking modified date and size with jottacloudclientscanner #53

Open ariselseng opened 9 years ago

ariselseng commented 9 years ago

I want to backup up a lot of data to jotta with jottacloudclientscanner. Using md5 to check each time takes ages to finished, and is very straining on my drives. How hard would it be to make it only check if size and time is modified before checksumming the file? Rsync-style.

havardgulldahl commented 9 years ago

hey @cowai! Nice to see you.

We could do that -- although it would have to be an optional thing, and it would come with a big warning.

That's because metadata like size and time are signals, but I consider a checksum more like truth.

There's one thing I would like to hear your opinion on, though. I recently added the possibility to store checksums in the file system, so we don't have to do all the checksum calculations over and over again.

What do you think about that?

ariselseng commented 9 years ago

Hi This would also work and have the same effect. But it would still be "signals". I mean how do you determine when to recalculate a checksum without relying on metadata? Locally storing checksums is fine by me, it would double the io access locally but much better then checksum each time:)

Related question: Does jottalib store the file time at jottalib or is just creating a new date at upload?

Btw, thank you so much for your work!

On 12 September 2015 09:53:07 CEST, "Håvard Gulldahl" notifications@github.com wrote:

hey @cowai! Nice to see you.

We could do that -- although it would have to be an optional thing, and it would come with a big warning.

That's because metadata like size and time are signals, but I consider a checksum more like truth.

There's one thing I would like to hear your opinion on, though. I recently added the possibility to store checksums in the file system, so we don't have to do all the checksum calculations over and over again.

What do you think about that?


Reply to this email directly or view it on GitHub: https://github.com/havardgulldahl/jottalib/issues/53#issuecomment-139739118

Sent from my phone.

havardgulldahl commented 9 years ago

Yeah, you're right, of course. The only way to be certain is to calculate the checksum every time. So, we try to be as close to certain as possible

What I'm thinking, is that by storing the checksum, the last modified time and the file size together, we will trust the cached checksum as long as the two others are still correct. If the file size or the last modified time has changed, we recalculate.

What do you think?

In regards to your other question. I guess we'll be able to store mtime at jottacloud. I haven't tried it. Currently, we're storing the time of upload. Look at JFS.post() for details.

lør. 12. sep. 2015 kl. 10.25 skrev Ari notifications@github.com:

Hi This would also work and have the same effect. But it would still be "signals". I mean how do you determine when to recalculate a checksum without relying on metadata? Locally storing checksums is fine by me, it would double the io access locally but much better then checksum each time:)

Related question: Does jottalib store the file time at jottalib or is just creating a new date at upload?

Btw, thank you so much for your work!

On 12 September 2015 09:53:07 CEST, "Håvard Gulldahl" < notifications@github.com> wrote:

hey @cowai! Nice to see you.

We could do that -- although it would have to be an optional thing, and it would come with a big warning.

That's because metadata like size and time are signals, but I consider a checksum more like truth.

There's one thing I would like to hear your opinion on, though. I recently added the possibility to store checksums in the file system, so we don't have to do all the checksum calculations over and over again.

What do you think about that?


Reply to this email directly or view it on GitHub:

https://github.com/havardgulldahl/jottalib/issues/53#issuecomment-139739118

Sent from my phone.

— Reply to this email directly or view it on GitHub https://github.com/havardgulldahl/jottalib/issues/53#issuecomment-139740103 .

ariselseng commented 9 years ago

I guess it would work good. I am not convinced it would be necessary though. If rsync does not do it, then it could not be that bad :P But of course it wont hurt either :) I have a question for that though: Is there ever a situation where a file changes and the mtime does not change? I can only think of one case, and that is when there is bitrot, and in that case I dont want my broken file to be reuploaded to jotta :P

Just remember that it will be a huge cache for some. We need to store the path, the size, mtime. It will take around 100-200bytes for each file. In my case several hundred megabytes just for that cache ( that in the end is not that much more secure). It will also like I said earlier, double the disk lookups (depending if you are storing it as simple json or some sort of db).

If you are going with that I would also save a list of the sate of jotta (as an option). So that we don't need to lookup remotely to know what to do, but just push the changes. In my use case I will never ever use the web interface or some other platform.

What do you think?

havardgulldahl commented 9 years ago

Yeah, I don't disagree with you. So I'm happy to add the option --no-checksum or something like that.

But we'll keep md5 checksumming as the default, because

  1. We keep feature parity with the official client (and we're exactly as safe as them)
  2. This is backups we're talking about, so we want to err on the side of caution
  3. The implementation would be dependent on a feature (server side mtime) that we don't know much about
havardgulldahl commented 9 years ago

Regarding local md5 cache.

I'm not particularly interested in maintaining a central cache, be it sqlite or a structured, flat json file. Caching is hard, and keeping that cache in sync sounds like a quick way to get in a bad mood.

But take a look at db30d406a04a3689f482e8a2a5e72a2fb889a32a. It's a way of keeping the calculated checksum along with the file itself, using xattr. No central cache. Just some bytes added in the file system, attached to the file.

Of course, you need a file system that supports this. So, it's not for everyone.

I'd appreciate it if you tried it out and let me know your thoughts!

ariselseng commented 9 years ago

About always md5 checksumming always-on: I didn't know the official client did this every time. Makes sense do make it do the same thing.

xattr seems like a good idea now that I actually know how it works! :)

havardgulldahl commented 9 years ago

Well, I don't think they recalculate the checksum every time. They keep a sqlite db around where they store a lot of metadata:

CREATE TABLE jwt_fl (jwc_id INTEGER PRIMARY KEY ASC AUTOINCREMENT, jwc_name, jwc_path, jwc_hash, jwc_phash, jwc_chksum, jwc_size, jwc_created, jwc_modified, jwc_mp, jwc_revision, jwc_lastchecked, jwc_err, jwc_nextupload,jwc_parentfolder,jwc_folderid);

So I reckon they keep using the cached checksum as long as the file size and date still match. But they always compare checksums with the online copy to see if they need to replace it with the local file.

ariselseng commented 9 years ago

If we are going to implement this option. We need to save date in xattr too right? So that we can check if size and date in xattr is the same as the actual file modified time and size, right?

havardgulldahl commented 8 years ago

fixed in https://github.com/havardgulldahl/jottalib/commit/64fdf1e480e85eb9c2d56f38df0d8232da7ce87d

ariselseng commented 8 years ago

@havardgulldahl So now it can check only by mtime/size?

havardgulldahl commented 8 years ago

Hmm I might have been a bit too eager here. ;)

We still have to patch jottacloud.replace_if_changed to only look at mtime if the right argument is passed.

Thanks for paying attention :)

ariselseng commented 8 years ago

@havardgulldahl I will see if I can add that option to only check for size and mtime, like the default rsync behaviour. I think that will be a lot faster with thousands of files instead of looking up xattr for each file. I want to backup 10TB with ~2 million files without that taking hours each time. Every millisecond counts in my case :)