richfitz / remake

Make-like declarative workflows in R
Other
340 stars 32 forks source link

FR: Consider detecting changes by looking at size+timestamp only #116

Open krlmlr opened 8 years ago

krlmlr commented 8 years ago

Use case: Large SQLite database which I'd rather not have read entirely just to detect it hasn't changed.

Should be configurable per target and as a global option, or even making it the default. (How often does a file keep its size and mod time when contents change?)

Also, a file should be considered unchanged (with its cached timestamp information updated) if the hash is unchanged (means size unchanged but timestamp different). This could also be implemented in a different option.

richfitz commented 8 years ago

So, at the moment there is a check: option for targets, with an option "exists". That would seem to be the sensible place to put this sort of alternative.

It might be worth a global remake option for the maximum size of file to hash before falling back on a faster approach.

In terms of things to check:

krlmlr commented 8 years ago

http://cs.stackexchange.com/a/19044/4265 mentions that editing video files may be prone to keep file size and head/tail information unchanged. We could make partial/full hashing an option though.

Nothing should be rebuilt if mtime but not contents change, but we need to compute the hash (partial/full) to be sure. Also, we want to compute the hash only once, until mtime changes again.

Does this mean we need to store size, mtime, and hash for each file? We still seem to need only the hash for R objects. How does that interact with storr?

richfitz commented 8 years ago

For files, storr won't care. It's a different matter when storing objects as those will always be hashed on write, but then they're also essentially guaranteed not to be modified by anything outside remake as the storage is opaque.

I think the approach you outline is correct; I'll see if I can work up a test case that demonstrates an undesirable number of reads and with that try to get an approach working that avoids rehashing.

krlmlr commented 8 years ago

Also, xxhash seems to be supported by digest -- could be a faster option once the data is in RAM.

After a successful make() run, no file reads should be necessary at all if we check mtime.