juliangruber / backer

wip distributed backup / file mirroring tool
MIT License
60 stars 2 forks source link

relying on mtimes #9

Closed juliangruber closed 11 years ago

juliangruber commented 11 years ago

Can we rely on mtimes?

If not, we have to rehash the whole sync folder every time backer is started, because we have to reliably determine what files changed. As noted in https://github.com/juliangruber/backer/issues/8 this can be a lot of work.

If we can rely on mtimes, replication can be based on timestamps rather than content hashes.

dominictarr commented 11 years ago

you can set the mtime manually, so you can't really rely on it. maybe you could use it for an optimistic optimisation, though?

pipobscure commented 11 years ago

Can't rely on mimes, but on Mac-OS-X there is the whole FS-Events thing that allows you to detect state across restarts.

Maybe there are similar OS features elsewhere. Definitely keep change detection modular to enable different solutions on different platforms.

(See https://npmjs.org/package/fsevents which will be updated to newer versions, that update will add the starting-point feature back in)

juliangruber commented 11 years ago

According to wikipedia FSEvents only tells you the directory in which something changed.

And I really don't want to implement mac things and linux things seperately, at least not for now, so the solution has to truly work cross platform.

It seems to me like dropbox rehashes the whole sync folder on boot up, and since people rarely reboot their pcs or restart daemons nowadays that's not too bad.

juliangruber commented 11 years ago

yeah, no mtimes, rehash everything on boot and keep cached trees in environments where the fs doesn't change without the daemon running.

juliangruber commented 11 years ago

Unless someone comes up with a better solution? Rehashing 20GB of data doesn't sound like the best idea to me.

juliangruber commented 11 years ago

argument by @hij1nx: if you set your mtime manually to an older timestamp it's your fault. conter argument: your clock might have been off for some time, e.g. while you were changing time zones.

pipobscure commented 11 years ago

Wikipedia is wrong. It usually tells you the actual file. Only if it can't for some reason, (i.e.: too many simultaneous changes, entire tree move, ...) will it tell you to rescan a folder. In that case it will tell you the deepest folder to rescan to get all changes. The whole thing was designed for this very purpose (think TimeMachine)

I agree that we should keep as many things as possible identical across systems/platforms. At the same time I think we would be smart to modularize the system as much as possible.

See #10 for more thoughts on modularity.

dominictarr commented 11 years ago

there is fs.watchFile, this works pretty well in my experince. @juliangruber how often do you restart your computer, anyway?

If the mtime hasn't changed, then you could assume the file hasn't. I think that is reasonable.

you could persist that - but even in memory, how often do you restart your computer, these days?

juliangruber commented 11 years ago

My last restart was a month ago. Ok, so restarts happen that rarely that rebuilding the tree on each restart is fine.

I just committed a file hash function which uses a combination of mtime and filesize, as rsync does, so rebuilding the tree really shouldn't take that long. Plus it's safer than just mtimes.

I don't see a need for persisting that tree, and if it were to be persistet then just as a side effect because it doesn't fit into memory.

juliangruber commented 11 years ago

Oh, well, you need to persist that tree in order to have one ready when a new peer connects or an old one reconnects.

dominictarr commented 11 years ago

hmm, isn't this gonna be a service that is running in the background? if it was you could keep the tree in memory.

juliangruber commented 11 years ago

only persist it in a db if it's too big for memory