jaycedowell commented 4 years ago

Adds a disk-based caching system for bifrost.map() calls to help reduce start up time in pipelines that are run in the same configuration multiple times. This feature can also be turned off by setting the "NOMAPCACHE" variable to one in the user.mk file.

coveralls commented 4 years ago

Coverage decreased (-0.02%) to 61.899% when pulling 90291f0dd9fb0e8f2cd45d6627259b86d7242eb1 on jaycedowell:map-disk-cache into 72e031595bf034948fce30b33e40b44f74697455 on ledatelescope:master.

coveralls commented 4 years ago

Coverage decreased (-0.02%) to 61.343% when pulling ab4ece0055409b93ba36696c72dcc36f6b78c077 on jaycedowell:map-disk-cache into 1681fde6e643fcc03a3cea10e411b8411aeb31cc on ledatelescope:master.

codecov-io commented 4 years ago

Codecov Report

Merging #132 into master will decrease coverage by <.01%. The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
- Coverage   58.78%   58.77%   -0.01%     
==========================================
  Files          63       63              
  Lines        5095     5097       +2     
==========================================
+ Hits         2995     2996       +1     
- Misses       2100     2101       +1

Impacted Files	Coverage Δ
python/bifrost/__init__.py	`80% <100%> (ø)`	:arrow_up:
python/bifrost/map.py	`21.05% <50%> (+1.6%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 72e0315...90291f0. Read the comment docs.

jaycedowell commented 4 years ago

Oh, the "coverage" thing again.

telegraphic commented 4 years ago

+1 for this functionality -- as this is C++ level I would defer to others for review though

codecov-commenter commented 4 years ago

Codecov Report

Merging #132 into master will not change coverage. The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #132   +/-   ##
=======================================
  Coverage   61.62%   61.62%           
=======================================
  Files          64       64           
  Lines        5300     5300           
=======================================
  Hits         3266     3266           
  Misses       2034     2034

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a53b35b...2874bd7. Read the comment docs.

codecov-io commented 3 years ago

Codecov Report

Merging #132 (f6b075e) into master (e1cea0b) will decrease coverage by 2.87%. The diff coverage is 66.66%.

@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
- Coverage   61.62%   58.75%   -2.88%     
==========================================
  Files          64       64              
  Lines        5300     5302       +2     
==========================================
- Hits         3266     3115     -151     
- Misses       2034     2187     +153

Impacted Files	Coverage Δ
python/bifrost/map.py	`22.00% <50.00%> (-55.09%)`	:arrow_down:
python/bifrost/__init__.py	`80.76% <100.00%> (ø)`
python/bifrost/fir.py	`0.00% <0.00%> (-100.00%)`	:arrow_down:
python/bifrost/romein.py	`0.00% <0.00%> (-100.00%)`	:arrow_down:
python/bifrost/blocks/scrunch.py	`37.03% <0.00%> (-59.26%)`	:arrow_down:
python/bifrost/linalg.py	`36.84% <0.00%> (-52.64%)`	:arrow_down:
python/bifrost/fft.py	`45.00% <0.00%> (-50.00%)`	:arrow_down:
python/bifrost/fdmt.py	`50.00% <0.00%> (-50.00%)`	:arrow_down:
python/bifrost/transpose.py	`41.66% <0.00%> (-50.00%)`	:arrow_down:
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update e1cea0b...f6b075e. Read the comment docs.

codecov-commenter commented 3 years ago

Codecov Report

Merging #132 (d75cf87) into master (a57a83f) will decrease coverage by 0.13%. The diff coverage is 23.07%.

@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
- Coverage   58.50%   58.36%   -0.14%     
==========================================
  Files          67       67              
  Lines        5731     5753      +22     
==========================================
+ Hits         3353     3358       +5     
- Misses       2378     2395      +17

Impacted Files	Coverage Δ
python/bifrost/version/__main__.py	`0.00% <ø> (ø)`
python/bifrost/map.py	`24.00% <20.00%> (-2.00%)`	:arrow_down:
python/bifrost/__init__.py	`80.00% <100.00%> (ø)`
python/bifrost/ndarray.py	`73.15% <0.00%> (ø)`
python/bifrost/telemetry/__main__.py	`0.00% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a57a83f...d75cf87. Read the comment docs.

jaycedowell commented 2 years ago

The cache should be controllable through configure now.

jaycedowell commented 2 years ago

I think this PR is probably ready for someone to look at it again, pending tests of whether or not this still works on a GPU.

league commented 2 years ago

One other thing I thought to look into… calling clear_cache in the test setUp is appropriate to reduce test dependence and ordering effects. But we should make sure testing includes the possibility of a cache hit, in case that path is (or becomes) buggy.

jaycedowell commented 2 years ago

My original motivation for using so much clear_cache in the tests was to make sure that we didn't end up with failures due to cache hits on things used in a previous version of Bifrost. Maybe we should only be calling it once at the start of test_map/test_transpose.

jaycedowell commented 2 years ago

I did some testing today and fixed a few problems that I found. Everything is now working as far as I can. Nothing happens with regards to the cache if you compile with --disable-map-cache. The cache gets cleared when the tests start. The cache doesn't get cleared if you comment out the call to clear_map_cache. When the cache clearing isn't called the tests run faster the second time. The only outstanding issue that I know of is the fileutils.hpp/process_exists() one. Additional testing with another version of CUDA (I used 10.1) would also be nice.

league commented 2 years ago

I should be able to try this PR again today, on cuda 11. I'll also start to look at one of the others, whatever is most relevant or ready: is that mmap ring space #135 ?

jaycedowell commented 2 years ago

Cool, thanks. I just finished going through all of my tests (again) under 10.1 and things still seem ok.

If you did want to look at another one either #135 or #167 would work. #167 is short so it has that going for it.

league commented 2 years ago

Tests worked great for me on cuda11, tried with and without map cache enabled. It makes a huge difference in speed. Just out of curiosity, I messed with overwriting one of the files with random bytes and sure, it core dumped. Not surprising, and not worth detecting… just trying to push the limits, find the "robustness boundary." Then I tried editing one of the numbers in cache.version, and it seemed to hang instead of dump. (But that's with the corrupt file still there. Not sure whether that's significant. If the version doesn't match expectations, we shouldn't be trying to read one of the other files?)

If I delete the whole directory and start over, it works again of course. Then if I change cache.version without also corrupting one of the other files, it seems to replace it cleanly and keep moving. So I think that's okay.

I started to experiment with adding a fileutils.cpp in addition to the hpp but it's not ready yet, and maybe not necessary. I'm also slightly uncomfortable with dumping together a command line and calling system() in these routines, but I get it… it's definitely easier to use mkdir -p and rm -rf and *.inf than to code those things in C++ (or maybe even to find the libraries to do it).

league commented 2 years ago

I remember seeing an issue about machines with >1 GPU, and I'm wondering if the map cache makes that any worse… is it embedding GPU-0 architecture info. Or what if $HOME is shared across machines with different GPUs. Maybe not worth worrying about, since if something goes wrong it's a simple matter to wipe the directory and start fresh. Just thinking out loud about potential complications.

jaycedowell commented 2 years ago

Then I tried editing one of the numbers in cache.version, and it seemed to hang instead of dump. (But that's with the corrupt file still there. Not sure whether that's significant. If the version doesn't match expectations, we shouldn't be trying to read one of the other files?)

The hang is surprising. The cache.version should have been verified first and then the entire cache should have be dropped when the version information was wrong.

I started to experiment with adding a fileutils.cpp in addition to the hpp but it's not ready yet, and maybe not necessary. I'm also slightly uncomfortable with dumping together a command line and calling system() in these routines, but I get it… it's definitely easier to use mkdir -p and rm -rf and *.inf than to code those things in C++ (or maybe even to find the libraries to do it).

I had thought about this too. The options are:

move to C++17 and use the new filesystem library,
stay with C++11 and use boost for their boost::filesystem library (it's what the C++17 one is based on),
use C system calls to implement mkdir/rm and globbing, or
stick with system().

jaycedowell commented 2 years ago

I remember seeing an issue about machines with >1 GPU, and I'm wondering if the map cache makes that any worse… is it embedding GPU-0 architecture info. Or what if $HOME is shared across machines with different GPUs.

It shouldn't make things worse on multi-GPU systems but it might be problematic if those GPUs are heterogeneous. The cache is stored as PTX which should be forward compatible. The hitch is that when bifrost.map generates the PTX it targets the architecture of the GPU it has been requested on and then that will get saved to the disk cache. On a mixed GPU system you could end up in a situation where you cache a function for a compute capability 75 device, save it as 75, and then try to load it/use it from a 52 device. I don't think that works.

The alternative would be to do something like always generate the PTX for the lowest compute capability that Bifrost was compiled for if the disk cache is enabled. That would solve my above problems, maybe with a little slowdown, but be largely unchanged in the homogeneous GPU case.

league commented 2 years ago

Regarding cache.version, yeah I'm not sure. It was a hard enough hang that ^C didn't do the trick, had to kill by PID. Could explore/trace more carefully if it's worrisome.
Regarding C++ filesystem: there seems to still be a lot of Ubuntu 18 out there. If there's not a C++17-supporting gcc in its regular apt sources, it might be too early to make that switch. Boost may be plausible, and especially if it wouldn't take major changes to reroute it to C++ standard stuff later on. “Hard pass” on implementing mkdir -p, rm -rf and glob on our own in C++.
Regarding GPU heterogeneity, it's likely an obscure situation. But would it be relatively easy to – rather than encode the architecture in cache.version, could you use a separate subdir per architecture/CUDA version/whatever, like $HOME/.bifrost/map_cache/ARCH-VERSION-TAG/*.ptx. Then if you end up on a different architecture, the appropriate cache would just appear to be absent, rather than existing but with the wrong arch. Does that make sense?

I don't think any of this necessarily has to hold up this branch. Could leave a TODO here or there as a reminder for things to pursue in the future or revisit if there's a relevant bug report. Or write it as a low-priority github issue with appropriate label (wish list, improvement, for future consideration, etc.)?

jaycedowell commented 2 years ago

Your point on splitting up the cache by arch. does make sense but I'm not sure how necessary that is if we stick with PTX for the cache. As far as I can tell PTX gets interpreted by the CUDA driver and JIT compiled to a binary when it is run. So as long as we aren't using features in the PTX that the card cannot support it should be ok to have a flat cache. That should be true if we always target the lowest arch. requested during the install. Getting fancier is an option but I think how we do that is going to need some more thought about the details of compatibility and what CUDA really does with PTX files/binaries. Maybe a "TODO" is the way to go.

league commented 2 years ago

Diagnosed the hang that I experienced before. After it core-dumped because I corrupted a ptx file, it left behind the .lock file. So next time when trying to open the cache, it was waiting on the lock. For whatever that's worth. I was messing around and probably got what I deserved. :)

jaycedowell commented 2 years ago

I ran through my tests again this morning and I think I'm happy with this. @league if you are happy with it too then I will merge it.

league commented 2 years ago

LGTM, ship it. 🚢

ledatelescope / bifrost

Map disk cache #132

Codecov Report

Codecov Report

Codecov Report

Codecov Report