nipype / pydra

Pydra Dataflow Engine
https://nipype.github.io/pydra/
Other
119 stars 57 forks source link

Caching of file-set hashes by local path and mtimes #700

Closed tclose closed 3 months ago

tclose commented 9 months ago

Types of changes

Summary

Will address #683 by

Checklist

Notes

I'm pretty happy with how this turned out with the exception of a few of wrinkles. Any suggestions would be most welcome

codecov[bot] commented 9 months ago

Codecov Report

Attention: Patch coverage is 99.04762% with 1 lines in your changes are missing coverage. Please review.

Project coverage is 84.22%. Comparing base (ff01e4c) to head (921979c).

Files Patch % Lines
pydra/utils/hash.py 98.97% 1 Missing :warning:
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #700 +/- ## ========================================== + Coverage 83.93% 84.22% +0.29% ========================================== Files 24 25 +1 Lines 5029 5123 +94 Branches 1429 1449 +20 ========================================== + Hits 4221 4315 +94 Misses 802 802 Partials 6 6 ``` | [Flag](https://app.codecov.io/gh/nipype/pydra/pull/700/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | Coverage Δ | | |---|---|---| | [unittests](https://app.codecov.io/gh/nipype/pydra/pull/700/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype) | `84.22% <99.04%> (+0.29%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=nipype#carryforward-flags-in-the-pull-request-comment) to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ghisvail commented 9 months ago

I am not confident I understand enough of the previous and proposed caching methods to have an opinion and assess this PR.

AFAIK, the previous caching mechanism generated cache folders for each task (including the workflow) in a common cache location which was set to a temporary folder, unless overridden with the cache_dir argument.

Each task's cache folder was composed of its name and a hash value, the latter being computed from the task's input field values. The cache would be re-used, if the name and hash values did not change, and the same cache folder was specified as cache_dir. Otherwise, it would recompute everything in a new cache location somewhere in temp.

Could you confirm whether my summary is accurate and whether this PR is proposing to change the fundamentals of this mechanism?

tclose commented 9 months ago

I am not confident I understand enough of the previous and proposed caching methods to have an opinion and assess this PR.

No worries, I just put you all down as reviewers so you were notified. Don't feel like you need to contribute if the area isn't familiar

AFAIK, the previous caching mechanism generated cache folders for each task (including the workflow) in a common cache location which was set to a temporary folder, unless overridden with the cache_dir argument.

Each task's cache folder was composed of its name and a hash value, the latter being computed from the task's input field values. The cache would be re-used, if the name and hash values did not change, and the same cache folder was specified as cache_dir. Otherwise, it would recompute everything in a new cache location somewhere in temp.

Could you confirm whether my summary is accurate and whether this PR is proposing to change the fundamentals of this mechanism?

Yes, your understanding is correct. This is how the execution cache works, both currently and in this PR. This PR seeks to improve (or more accurately, restore) performance by caching the hashes of file/directory types themselves, so files/directories don't need to be rehashed (which can be an expensive operation) each time the checksum is accessed.

It does this by hashing the path and mtime of the file/directory (not to be confused with the hashing of the file/directory contents), and using it as a key to look up a "hash cache" (not to be confused with the execution cache) that contains previously computed file/directory hashes.

effigies commented 9 months ago
  • I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time(): https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

tclose commented 9 months ago
  • I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time(): https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

Interesting, I assumed that the value for the mtime was controlled by the file-system not Python.

djarecka commented 9 months ago

@tclose - you can try with new slurm testing workflow, should work now!

tclose commented 9 months ago
  • I have had to put a sleep call inside the hash test to ensure the mtimes are different

Better to mock the call than to sleep. Here's an example where I've done that with time.time(): https://github.com/nipy/nibabel/blob/5f37398a2f8211c175b798eb42298b963c693ae0/nibabel/tests/test_openers.py#L457-L464

@effigies I'm not sure this works on Windows (it seems to work on Ubuntu), see test failures

tclose commented 9 months ago

@effigies, I think that I have addressed the outstanding issues with this PR so it just needs your review. However, I had to revert the mtime mocking as it doesn't appear to work for Windows (see https://github.com/nipype/pydra/actions/runs/6307685788/job/17124695852), unless you have some ideas on how to do it.

djarecka commented 4 months ago

@tclose - the recent commit to this PR comes from some rebasing?

tclose commented 4 months ago

@tclose - the recent commit to this PR comes from some rebasing?

I rebased it on top of master after the environment PR was merged and fixed up a few things related to those changes

djarecka commented 4 months ago

ok, trying to figure out if I should review it again, since I already accepted

djarecka commented 3 months ago

sorry, I'm sure we discussed it at some point, but I'm not sure about one important thing... looks like if I move file around my filesystem I have no way of reusing the previous tasks now, is that right?

djarecka commented 3 months ago

sorry, I was wrong, the task has correct hash when the file is the same with a different path.

Perhaps you can add this test: https://github.com/djarecka/pydra/blob/db782c20890797bd58eb1d52545b7104f7d41aa4/pydra/engine/tests/test_node_task.py#L1568

(I was planning to create a PR to you, but I must have merged to the branch more things to my branch)

tclose commented 3 months ago

sorry, I was wrong, the task has correct hash when the file is the same with a different path.

Perhaps you can add this test: https://github.com/djarecka/pydra/blob/db782c20890797bd58eb1d52545b7104f7d41aa4/pydra/engine/tests/test_node_task.py#L1568

(I was planning to create a PR to you, but I must have merged to the branch more things to my branch)

Ok, nice addition

djarecka commented 3 months ago

I've just realized that we should also have a similar test when the persistent cache is used, but realized that it's not enough to set PYDRA_HASH_CACHE. Sorry, for asking for more work, but can we have some test when the persistent cache is used together with running task or workflow

tclose commented 3 months ago

@djarecka I have added a new test to ensure that the persistent cache gets hit during the running of tasks. Let me know if there is anything else you need me to do

djarecka commented 3 months ago

@tclose - thanks so much for the work!