Closed tclose closed 3 months ago
Attention: Patch coverage is 99.04762%
with 1 lines
in your changes are missing coverage. Please review.
Project coverage is 84.22%. Comparing base (
ff01e4c
) to head (921979c
).
Files | Patch % | Lines |
---|---|---|
pydra/utils/hash.py | 98.97% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I am not confident I understand enough of the previous and proposed caching methods to have an opinion and assess this PR.
AFAIK, the previous caching mechanism generated cache folders for each task (including the workflow) in a common cache location which was set to a temporary folder, unless overridden with the cache_dir
argument.
Each task's cache folder was composed of its name and a hash value, the latter being computed from the task's input field values. The cache would be re-used, if the name and hash values did not change, and the same cache folder was specified as cache_dir
. Otherwise, it would recompute everything in a new cache location somewhere in temp.
Could you confirm whether my summary is accurate and whether this PR is proposing to change the fundamentals of this mechanism?
I am not confident I understand enough of the previous and proposed caching methods to have an opinion and assess this PR.
No worries, I just put you all down as reviewers so you were notified. Don't feel like you need to contribute if the area isn't familiar
AFAIK, the previous caching mechanism generated cache folders for each task (including the workflow) in a common cache location which was set to a temporary folder, unless overridden with the
cache_dir
argument.Each task's cache folder was composed of its name and a hash value, the latter being computed from the task's input field values. The cache would be re-used, if the name and hash values did not change, and the same cache folder was specified as
cache_dir
. Otherwise, it would recompute everything in a new cache location somewhere in temp.Could you confirm whether my summary is accurate and whether this PR is proposing to change the fundamentals of this mechanism?
Yes, your understanding is correct. This is how the execution cache works, both currently and in this PR. This PR seeks to improve (or more accurately, restore) performance by caching the hashes of file/directory types themselves, so files/directories don't need to be rehashed (which can be an expensive operation) each time the checksum is accessed.
It does this by hashing the path and mtime of the file/directory (not to be confused with the hashing of the file/directory contents), and using it as a key to look up a "hash cache" (not to be confused with the execution cache) that contains previously computed file/directory hashes.
- I have had to put a sleep call inside the hash test to ensure the mtimes are different
Better to mock the call than to sleep. Here's an example where I've done that with time.time()
:
https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464
- I have had to put a sleep call inside the hash test to ensure the mtimes are different
Better to mock the call than to sleep. Here's an example where I've done that with
time.time()
: https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464
Interesting, I assumed that the value for the mtime was controlled by the file-system not Python.
@tclose - you can try with new slurm testing workflow, should work now!
- I have had to put a sleep call inside the hash test to ensure the mtimes are different
Better to mock the call than to sleep. Here's an example where I've done that with
time.time()
: https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464
@effigies I'm not sure this works on Windows (it seems to work on Ubuntu), see test failures
@effigies, I think that I have addressed the outstanding issues with this PR so it just needs your review. However, I had to revert the mtime mocking as it doesn't appear to work for Windows (see https://github.com/nipype/pydra/actions/runs/6307685788/job/17124695852), unless you have some ideas on how to do it.
@tclose - the recent commit to this PR comes from some rebasing?
@tclose - the recent commit to this PR comes from some rebasing?
I rebased it on top of master after the environment PR was merged and fixed up a few things related to those changes
ok, trying to figure out if I should review it again, since I already accepted
sorry, I'm sure we discussed it at some point, but I'm not sure about one important thing... looks like if I move file around my filesystem I have no way of reusing the previous tasks now, is that right?
sorry, I was wrong, the task has correct hash when the file is the same with a different path.
Perhaps you can add this test: https://github.com/djarecka/pydra/blob/db782c20890797bd58eb1d52545b7104f7d41aa4/pydra/engine/tests/test_node_task.py#L1568
(I was planning to create a PR to you, but I must have merged to the branch more things to my branch)
sorry, I was wrong, the task has correct hash when the file is the same with a different path.
Perhaps you can add this test: https://github.com/djarecka/pydra/blob/db782c20890797bd58eb1d52545b7104f7d41aa4/pydra/engine/tests/test_node_task.py#L1568
(I was planning to create a PR to you, but I must have merged to the branch more things to my branch)
Ok, nice addition
I've just realized that we should also have a similar test when the persistent cache is used, but realized that it's not enough to set PYDRA_HASH_CACHE
. Sorry, for asking for more work, but can we have some test when the persistent cache is used together with running task or workflow
@djarecka I have added a new test to ensure that the persistent cache gets hit during the running of tasks. Let me know if there is anything else you need me to do
@tclose - thanks so much for the work!
Types of changes
Summary
Will address #683 by
bytes_repr()
overloads for FileSets to yield a "time-stamp" object consisting of a key (file path) and modification time as the first item in the generator.hash_single()
checks for these key/mtime pairs in a global "local hashes cache" dictChecklist
Notes
I'm pretty happy with how this turned out with the exception of a few of wrinkles. Any suggestions would be most welcome
cache_location
path (my original plan) given thatchecksum
andhash
are object properties instead of methods~/.pydra-hash-cache