GPU Support #9

Open mdbarnesUCSD opened 3 years ago

mdbarnesUCSD commented 3 years ago


I am looking to run tensorsignatures on an AWS g3 instance. I was hoping to make use of the GPU support but am receiving an error. The AWS conda environment that I am using is tensorflow_p36 and comes with tensorflow-gpu version 1.15.3 installed. After running 'pip install tensorsignatures' the packages are:

tensorboard 1.15.0
tensorboard-plugin-wit 1.7.0
tensorflow 1.15.0
tensorflow-estimator 1.15.1
tensorflow-gpu 1.15.3
tensorflow-serving-api 1.15.0
tensorsignatures 0.5.0

The code runs when 'tensorflow 1.15.0 ' is installed but when it is just 'tensorflow-gpu 1.15.3' it does not (because tensorflow can not be imported).

Is there a way to verify that GPU is working?

Thank you!

sagar87 commented 3 years ago

Hi mdbarnesUCSD,

interesting, I didn't tried to run tensorsignatures on an AWS environment myself, but since it doesn't seem to be possible to import tensorflow there might be something wrong with the python installation, or the respective conda environment is not active. Could you try to open a Python shell on the AWS machine and test whether it is possible to import the package, that should look somewhat like this:

$ python
Python 3.6.1 (default, Sep 22 2017, 15:04:10)
[GCC 5.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
/home/hsv23/tensorflow/lib/python3.6/site-packages/h5py/ FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  from ._conv import register_converters as _register_converters
>>> tf.__version__

Otherwise it might help to downgrade tensorflow to 1.5.0. Perhaps running conda install tensorflow=1.15.0 works ?

To generally asses if the GPU is working you can try to run $ nvidia-smi or $ nvcc --version which should return the installed CUDA version on the AWS machine. Let me know if that helps.

mdbarnesUCSD commented 3 years ago

Thanks for you response. I ran the $ nvidia-smi command and see that the gpu is being used. I was initially concerned that cpu was being used instead, but now see that it is functioning as expected.


mdbarnesUCSD commented 3 years ago


I am reopening this issue because when I run the GPU version of the code my GPU-Util is at 0% when running tensorsignatures train. I installed tensorsignatures gpu by running: pip install --upgrade pip setuptools wheel && pip install -r requirements-gpu.txt

I am using an AWS g3 instance with the following 'tensor' packages: tensorboard 1.15.0 tensorboard-plugin-wit 1.7.0 tensorflow-estimator 1.15.1 tensorflow-gpu 1.15.0 tensorflow-serving-api 1.15.0 tensorsignatures 0.5.0

Here is the output from running $ nvidia-smi: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 450.80.02 Driver Version: 450.80.02 CUDA Version: 11.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla M60 On | 00000000:00:1E.0 Off | 0 | | N/A 40C P0 37W / 150W | 70MiB / 7618MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 26132 C ...ensorflow2_p36/bin/python 67MiB | +-----------------------------------------------------------------------------+

Also, this is the output from $ nvcc-version: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2018 NVIDIA Corporation Built on Sat_Aug_25_21:08:01_CDT_2018 Cuda compilation tools, release 10.0, V10.0.130

Please let me know if I can provide any additional information. Thank you!

sagar87 commented 3 years ago

Hi mdbarnesUCSD,

this certainly looks wrong. Hard to diagnose the problem remotely ... Can you paste the output of pip freeze ? What happens if you execute this test script (taken from ?

import tensorflow as tf
if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
    print("Please install GPU version of TF")
mdbarnesUCSD commented 3 years ago

Here is the output from running the test script:

1.15.0 Please install GPU version of TF

Additionally, I ran this (from this stack overflow post): sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

and got this output:

Device mapping: /job:localhost/replica:0/task:0/device:XLA_CPU:0 -> device: XLA_CPU device /job:localhost/replica:0/task:0/device:XLA_GPU:0 -> device: XLA_GPU device

Also, from the same post:

from tensorflow.python.client import device_lib

producing this output:

[name: "/device:CPU:0" device_type: "CPU" memory_limit: 268435456 locality { } incarnation: 2796909284998702720 , name: "/device:XLA_CPU:0" device_type: "XLA_CPU" memory_limit: 17179869184 locality { } incarnation: 9824271307598288295 physical_device_desc: "device: XLA_CPU device" , name: "/device:XLA_GPU:0" device_type: "XLA_GPU" memory_limit: 17179869184 locality { } incarnation: 6226426787505711414 physical_device_desc: "device: XLA_GPU device" ]

Here is the output from pip freeze:

absl-py==0.11.0 alabaster==0.7.12 anaconda-client==1.7.2 anaconda-project==0.8.3 argh==0.26.2 asn1crypto==1.3.0 astor==0.8.1 astroid==2.4.2 astropy==4.0 astunparse==1.6.3 atomicwrites==1.3.0 attrs==19.3.0 autopep8==1.4.4 autovizwidget==0.16.0 Babel==2.8.0 backcall==0.1.0 backports.shutil-get-terminal-size==1.0.0 beautifulsoup4==4.8.2 bitarray==1.2.1 bkcharts==0.2 bleach==1.5.0 bokeh==1.4.0 boto==2.49.0 boto3==1.16.9 botocore==1.19.9 Bottleneck==1.3.2 cachetools==4.1.1 certifi==2020.6.20 cffi==1.14.0 chardet==3.0.4 Click==7.0 cloudpickle==1.3.0 clyent==1.2.2 colorama==0.4.3 contextlib2==0.6.0.post1 cryptography==2.8 cycler==0.10.0 Cython==0.29.15 cytoolz==0.10.1 dask==2.11.0 decorator==4.4.1 defusedxml==0.6.0 diff-match-patch==20181111 distributed==2.11.0 docutils==0.16 entrypoints==0.3 environment-kernels==1.1.1 et-xmlfile==1.0.1 fastcache==1.1.0 filelock==3.0.12 flake8==3.7.9 Flask==1.1.1 flatbuffers==1.12 fsspec==0.6.2 future==0.18.2 gast==0.2.2 gevent==1.4.0 glob2==0.7 gmpy2==2.0.8 google-auth==1.23.0 google-auth-oauthlib==0.4.2 google-pasta==0.2.0 greenlet==0.4.15 grpcio==1.32.0 h5py==2.10.0 hdijupyterutils==0.16.0 HeapDict==1.0.1 horovod==0.19.5 html5lib==0.9999999 hypothesis==5.5.4 idna==2.8 imageio==2.6.1 imagesize==1.2.0 importlib-metadata==1.5.0 intervaltree==3.0.2 ipykernel==5.1.4 ipyparallel @ file:///tmp/build/80754af9/ipyparallel_1593440601845/work ipython==7.12.0 ipython-genutils==0.2.0 ipywidgets==7.5.1 isort==4.3.21 itsdangerous==1.1.0 jdcal==1.4.1 jedi==0.14.1 jeepney==0.4.2 Jinja2==2.11.1 jmespath @ file:///tmp/build/80754af9/jmespath_1594304593830/work joblib==0.14.1 json5==0.9.1 jsonschema==3.2.0 jupyter==1.0.0 jupyter-client==5.3.4 jupyter-console==6.1.0 jupyter-core==4.6.1 jupyterlab==1.2.6 jupyterlab-server==1.0.6 Keras==2.3.0 Keras-Applications==1.0.8 Keras-Preprocessing==1.1.2 keyring==21.1.0 kiwisolver==1.1.0 lazy-object-proxy==1.4.3 libarchive-c==2.8 lief==0.9.0 llvmlite==0.31.0 locket==0.2.0 lxml==4.5.0 Markdown==3.3.3 MarkupSafe==1.1.1 matplotlib==3.1.2 mccabe==0.6.1 mistune==0.8.4 mkl-fft==1.0.15 mkl-random==1.1.0 mkl-service==2.3.0 mock==4.0.1 more-itertools==8.2.0 mpi4py==3.0.3 mpmath==1.1.0 msgpack==0.6.1 multipledispatch==0.6.0 nb-conda==2.2.1 nb-conda-kernels @ file:///tmp/build/80754af9/nb_conda_kernels_1598624781735/work nbconvert==5.6.1 nbformat==5.0.4 networkx==2.4 nltk==3.4.5 nose==1.3.7 notebook==6.0.3 numba==0.48.0 numexpr==2.7.1 numpy==1.16.1 numpydoc==0.9.2 oauthlib==3.1.0 olefile==0.46 opencv-python== openpyxl==3.0.3 opt-einsum==3.3.0 packaging==20.1 pandas==0.25.3 pandocfilters==1.4.2 parso==0.5.2 partd==1.1.0 path==13.1.0 pathlib2==2.3.5 pathtools==0.1.2 patsy==0.5.1 pep8==1.7.1 pexpect==4.8.0 pickleshare==0.7.5 Pillow==7.0.0 pkginfo== plotly==4.12.0 pluggy==0.13.1 ply==3.11 prometheus-client==0.7.1 prompt-toolkit==3.0.3 protobuf==3.14.0 protobuf3-to-dict==0.1.5 psutil==5.6.7 psycopg2==2.7.5 PTable==0.9.2 ptyprocess==0.6.0 py==1.8.1 pyasn1==0.4.8 pyasn1-modules==0.2.8 pycodestyle==2.5.0 pycosat==0.6.3 pycparser==2.19 pycrypto==2.6.1 pycurl== pydocstyle==4.0.1 pyflakes==2.1.1 pygal==2.4.0 Pygments==2.5.2 pykerberos==1.2.1 pylint==2.5.3 pyodbc===4.0.0-unsupported pyOpenSSL==19.1.0 pyparsing==2.4.6 pyrsistent==0.15.7 PySocks==1.7.1 pytest==5.3.5 pytest-arraydiff==0.3 pytest-astropy==0.8.0 pytest-astropy-header==0.1.2 pytest-doctestplus==0.5.0 pytest-openfiles==0.4.0 pytest-remotedata==0.3.2 python-dateutil==2.8.1 python-jsonrpc-server==0.3.4 python-language-server==0.31.7 pytz==2019.3 PyWavelets==1.1.1 pyxdg==0.26 PyYAML==5.3.1 pyzmq==18.1.1 QDarkStyle==2.8 QtAwesome==0.6.1 qtconsole==4.6.0 QtPy==1.9.0 requests==2.22.0 requests-kerberos==0.12.0 requests-oauthlib==1.3.0 retrying==1.3.3 rope==0.16.0 rsa==4.6 Rtree==0.9.3 ruamel-yaml==0.15.87 s3fs==0.4.2 s3transfer==0.3.3 sagemaker==2.16.1 scikit-image==0.16.2 scikit-learn==0.21.3 scipy==1.3.2 seaborn==0.10.0 SecretStorage==3.1.2 Send2Trash==1.5.0 simplegeneric==0.8.1 singledispatch== six==1.15.0 smdebug-rulesconfig==0.1.5 snowballstemmer==2.0.0 sortedcollections==1.1.2 sortedcontainers==2.1.0 soupsieve==1.9.5 sparkmagic==0.15.0 Sphinx==2.4.0 sphinxcontrib-applehelp==1.0.1 sphinxcontrib-devhelp==1.0.1 sphinxcontrib-htmlhelp==1.0.2 sphinxcontrib-jsmath==1.0.1 sphinxcontrib-qthelp==1.0.2 sphinxcontrib-serializinghtml==1.1.3 sphinxcontrib-websupport==1.2.0 spyder==4.0.1 spyder-kernels==1.8.1 SQLAlchemy==1.3.13 statsmodels==0.11.0 sympy==1.5.1 tables==3.6.1 tblib==1.6.0 tensorboard==1.15.0 tensorboard-plugin-wit==1.7.0 tensorflow-estimator==1.15.1 tensorflow-gpu==1.15.0 tensorflow-serving-api==1.15.0 tensorsignatures==0.5.0 termcolor==1.1.0 terminado==0.8.3 testpath==0.4.4 toml==0.10.1 toolz==0.10.0 tornado==6.0.3 tqdm==4.39.0 traitlets==4.3.3 typed-ast==1.4.1 typing-extensions== ujson==1.35 unicodecsv==0.14.1 urllib3==1.25.10 watchdog==0.10.2 wcwidth==0.1.8 webencodings==0.5.1 Werkzeug==1.0.0 widgetsnbextension==3.5.1 wrapt==1.12.1 wurlitzer==2.0.0 xlrd==1.2.0 XlsxWriter==1.2.7 xlwt==1.3.0 yapf==0.28.0 zict==1.0.0 zipp==2.2.0

Thanks for the help. Please let me know if I can provide any more information.

mdbarnesUCSD commented 3 years ago

Issue resolved when I switched from tensorflow2_p36 to tensorflow_p36 AWS environment and ran the installation commands from README:

pip install --upgrade pip setuptools wheel && pip install -r requirements-gpu.txt
python install

The version for tensorflow-gpu==1.15.3.

sagar87 commented 3 years ago

That's great! Thanks for letting me know. I am thinking about porting the code to TF2 or Pytorch, TF1 is indeed a pain in the neck.