Caching of file-set hashes by local path and mtimes

tclose commented 9 months ago

Types of changes

New feature (non-breaking change which adds functionality)

Summary

Will address #683 by

[x] bytes_repr() overloads for FileSets to yield a "time-stamp" object consisting of a key (file path) and modification time as the first item in the generator.
[x] hash_single() checks for these key/mtime pairs in a global "local hashes cache" dict
- returns cached hash, if present
- otherwise it proceeds through the remaining byte chunks, and calculates the hash
[x] calculated hashes are saved into/loaded from the cache directory using a hash of the key/mtime

Checklist

[x] I have added tests to cover my changes (if necessary)
[x] I have updated documentation (if necessary)

Notes

I'm pretty happy with how this turned out with the exception of a few of wrinkles. Any suggestions would be most welcome

There isn't a clean way to specify the location of the persistent hash cache in the top-level code or put it in the cache_location path (my original plan) given that checksum and hash are object properties instead of methods
- I have therefore just dumped it in ~/.pydra-hash-cache
- not ideal to have to (effectively) hard-code this and having it user dependent
- Having it in the user directory does mean that if the same cache directory is accessed from different machines (e.g. shared network drive) then local paths won't clash (although chance of mtimes being the same would be vanishingly small)
The keys of the caches themselves are hashed and stored as files in the persistent cache directory, and will therefore never be cleaned up.
- This could be ok as they are very small files so if it grows as time goes on it probably won't have much impact
- This is how pydra caches work in general
The resolution of mtime differs depending on OS, with Ubuntu and Windows having quite low resolution, i.e. order of seconds
- I have had to put a sleep call inside the hash test to ensure the mtimes are different
- Could cause issues if you were updating the contents of files within a loop and spinning off different workflows (not sure whether this would ever happen in practice)
- No easy way to disable the mtime caching behaviour if this does become a problem

codecov[bot] commented 9 months ago

Codecov Report

Attention: Patch coverage is 99.04762% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 84.22%. Comparing base (ff01e4c) to head (921979c).

Files	Patch %	Lines
pydra/utils/hash.py	98.97%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #700 +/- ## ========================================== + Coverage 83.93% 84.22% +0.29% ========================================== Files 24 25 +1 Lines 5029 5123 +94 Branches 1429 1449 +20 ========================================== + Hits 4221 4315 +94 Misses 802 802 Partials 6 6 ``` | [Flag](https://app.codecov.io/gh/nipype/pydra/pull/700/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/nipype/pydra/pull/700/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | `84.22% <99.04%> (+0.29%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ghisvail commented 9 months ago

I am not confident I understand enough of the previous and proposed caching methods to have an opinion and assess this PR.

AFAIK, the previous caching mechanism generated cache folders for each task (including the workflow) in a common cache location which was set to a temporary folder, unless overridden with the cache_dir argument.

Each task's cache folder was composed of its name and a hash value, the latter being computed from the task's input field values. The cache would be re-used, if the name and hash values did not change, and the same cache folder was specified as cache_dir. Otherwise, it would recompute everything in a new cache location somewhere in temp.

Could you confirm whether my summary is accurate and whether this PR is proposing to change the fundamentals of this mechanism?

tclose commented 9 months ago

I am not confident I understand enough of the previous and proposed caching methods to have an opinion and assess this PR.

No worries, I just put you all down as reviewers so you were notified. Don't feel like you need to contribute if the area isn't familiar

AFAIK, the previous caching mechanism generated cache folders for each task (including the workflow) in a common cache location which was set to a temporary folder, unless overridden with the cache_dir argument.

Each task's cache folder was composed of its name and a hash value, the latter being computed from the task's input field values. The cache would be re-used, if the name and hash values did not change, and the same cache folder was specified as cache_dir. Otherwise, it would recompute everything in a new cache location somewhere in temp.

Could you confirm whether my summary is accurate and whether this PR is proposing to change the fundamentals of this mechanism?

Yes, your understanding is correct. This is how the execution cache works, both currently and in this PR. This PR seeks to improve (or more accurately, restore) performance by caching the hashes of file/directory types themselves, so files/directories don't need to be rehashed (which can be an expensive operation) each time the checksum is accessed.

It does this by hashing the path and mtime of the file/directory (not to be confused with the hashing of the file/directory contents), and using it as a key to look up a "hash cache" (not to be confused with the execution cache) that contains previously computed file/directory hashes.

effigies commented 9 months ago

I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time(): https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

tclose commented 9 months ago

I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time(): https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

Interesting, I assumed that the value for the mtime was controlled by the file-system not Python.

djarecka commented 9 months ago

@tclose - you can try with new slurm testing workflow, should work now!

tclose commented 9 months ago

I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time(): https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

@effigies I'm not sure this works on Windows (it seems to work on Ubuntu), see test failures

tclose commented 9 months ago

@effigies, I think that I have addressed the outstanding issues with this PR so it just needs your review. However, I had to revert the mtime mocking as it doesn't appear to work for Windows (see https://github.com/nipype/pydra/actions/runs/6307685788/job/17124695852), unless you have some ideas on how to do it.

djarecka commented 4 months ago

@tclose - the recent commit to this PR comes from some rebasing?

tclose commented 4 months ago

@tclose - the recent commit to this PR comes from some rebasing?

I rebased it on top of master after the environment PR was merged and fixed up a few things related to those changes

djarecka commented 4 months ago

ok, trying to figure out if I should review it again, since I already accepted

djarecka commented 3 months ago

sorry, I'm sure we discussed it at some point, but I'm not sure about one important thing... looks like if I move file around my filesystem I have no way of reusing the previous tasks now, is that right?

djarecka commented 3 months ago

sorry, I was wrong, the task has correct hash when the file is the same with a different path.

Perhaps you can add this test: https://github.com/djarecka/pydra/blob/db782c20890797bd58eb1d52545b7104f7d41aa4/pydra/engine/tests/test_node_task.py#L1568

(I was planning to create a PR to you, but I must have merged to the branch more things to my branch)

tclose commented 3 months ago

sorry, I was wrong, the task has correct hash when the file is the same with a different path.

Perhaps you can add this test: https://github.com/djarecka/pydra/blob/db782c20890797bd58eb1d52545b7104f7d41aa4/pydra/engine/tests/test_node_task.py#L1568

(I was planning to create a PR to you, but I must have merged to the branch more things to my branch)

Ok, nice addition

djarecka commented 3 months ago

I've just realized that we should also have a similar test when the persistent cache is used, but realized that it's not enough to set PYDRA_HASH_CACHE. Sorry, for asking for more work, but can we have some test when the persistent cache is used together with running task or workflow

tclose commented 3 months ago

@djarecka I have added a new test to ensure that the persistent cache gets hit during the running of tasks. Let me know if there is anything else you need me to do

djarecka commented 3 months ago

@tclose - thanks so much for the work!

nipype / pydra