This PR updates file dependencies in the doit database even if the task is already up to date. The change improves performance for large files under certain circumstances.
Consider the following task which simply copies large_file.txt to output.txt.
The first time doit runs, it saves the timestamp, size, and md5 hash. On the second run, doit smartly skips calculating the md5 hash of large_file.txt because the timestamps match. So far so good.
Now suppose the timestamp changes but the content does not. This might happen if we delete an intermediate file which is then regenerated. On the second run, doit will evaluate the md5 on large_file.txt and skip the task because it's up to date--as expected. But it won't update the timestamp in the database. So every time we run doit, it'll evaluate the md5 hash of large_file.txt.
This PR ensures the file dependencies are updated in the database even if the task is already up to date. Here's a concrete example using touch to update the timestamp. I've modified the check_modified function to report some debugging information (see end of description for details).
$ (master) rm -f .doit.db # Start clean.
$ (master) doit
. copy
$ (master) doit
-- copy
$ (master) touch large_file.txt
$ (master) doit
large_file.txt was modified at 15:53:09.664308; expected 15:51:36.076443
-- copy
$ (master) doit # Evaluates md5 hash again (and will indefinitely).
large_file.txt was modified at 15:53:09.664308; expected 15:51:36.076443
-- copy
$ (check_modified) rm -f .doit.db # Start clean.
$ (check_modified) doit
. copy
$ (check_modified) doit
-- copy
$ (check_modified) touch large_file.txt
$ (check_modified) doit
large_file.txt was modified at 15:51:36.076443; expected 15:49:30.170537
-- copy
$ (check_modified) doit # Does not evaluate md5 hash again (updated timestamp saved in previous run).
-- copy
Updated check_modified to report debug information.
def check_modified(self, file_path, file_stat, state):
"""
Check if file in file_path is modified from previous "state".
"""
timestamp, size, file_md5 = state
# 1 - if timestamp is not modified file is the same
if file_stat.st_mtime == timestamp:
return False
from datetime import datetime
print(f"{file_path} was modified at {datetime.fromtimestamp(file_stat.st_mtime).time()}; "
f"expected {datetime.fromtimestamp(timestamp).time()}")
# 2 - if size is different file is modified
if file_stat.st_size != size:
return True
# 3 - check md5
return file_md5 != get_file_md5(file_path)
This PR updates file dependencies in the doit database even if the task is already up to date. The change improves performance for large files under certain circumstances.
Consider the following task which simply copies
large_file.txt
tooutput.txt
.The first time doit runs, it saves the timestamp, size, and md5 hash. On the second run, doit smartly skips calculating the md5 hash of
large_file.txt
because the timestamps match. So far so good.Now suppose the timestamp changes but the content does not. This might happen if we delete an intermediate file which is then regenerated. On the second run, doit will evaluate the md5 on
large_file.txt
and skip the task because it's up to date--as expected. But it won't update the timestamp in the database. So every time we run doit, it'll evaluate the md5 hash oflarge_file.txt
.This PR ensures the file dependencies are updated in the database even if the task is already up to date. Here's a concrete example using
touch
to update the timestamp. I've modified thecheck_modified
function to report some debugging information (see end of description for details).Updated check_modified to report debug information.