Closed amurashov closed 4 years ago
I double-checked the compute capability of the K80 and it's actually 3.7, not >= 6.0. Since only atomicAdd<double>
needed the higher compute capability, I've implemented it in terms of compare-and-swap and lowered the requirements so you can use it on the K80. Could you please try it out and let me know if it works for you?
Confirmed working! //may be a a good idea to update the docs?
PS Install still required some hacking with Makefile paths to compilers, cuda-libraries, etc. I am fine with this, but some potential users might get turned off by this. // may be a good idea to make a note in the docs that Makefile might require fine-tuning to get up and running? On my systems (including the standard AWS with out-of-the-box TF envs) some voodoo magic was needed for this to compile.
I have tested HASTE on two different instance types on AWS (for reproducibility):
p2.xlarge (K80 instance) p3.2xlarge (V100 instance)
Both instances were using stock Deep Learning AMI (Amazon Linux 2) Version 29.0 - ami-0b0b075706e19de29
Following sequence of commands was used to install the HASTE:
(0) Change symlink of /usr/local/cuda to point from /usr/local/cuda-10.0 to /usr/local/cuda-10.1 (see another issue that without this HASTE does not install properly). (1) source activate tensorflow2_p36 (2) git clone https://github.com/lmnt-com/haste (3) cd haste (4) make haste_tf (5) pip install haste_tf-*.whl
then from jupyter notebook the following:
On p2.xlarge (K80) the following is the output:
env: CUDA_VISIBLE_DEVICES=0 HASTE has total 240800 trainable variables! CuDNN has total 240800 trainable variables! HASTE maxabs of each grad: 0.0 0.0 Non-HASTE maxabs of each grad: 6.3259706 0.0 7.397908
On p3.2xlarge (V100) the following is the output:
env: CUDA_VISIBLE_DEVICES=0 HASTE has total 240800 trainable variables! CuDNN has total 240800 trainable variables! HASTE maxabs of each grad: 7.004616 6.2311497 Non-HASTE maxabs of each grad: 6.231148 0.0 7.0048447
Gradients appear to be broken on K80 device.