taichi-dev / taichi

Productive, portable, and performant GPU programming in Python.
https://taichi-lang.org
Apache License 2.0
25.36k stars 2.27k forks source link

Error calling .to_torch on taichi field #2190

Open davidmatthews1uvm opened 3 years ago

davidmatthews1uvm commented 3 years ago

Describe the bug Taichi is crashing when I call .to_torch() on a taichi field from a cluster (SLURM) environment.

To Reproduce

# sample code here
import taichi as ti
ti.init(arch=ti.cuda)
x = ti.field(ti.f32, shape=(1,1))
x.to_torch()

Log/Screenshots

$ python test.py 
[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-u5oe1tbb
[Taichi] version 0.7.14, llvm 10.0.0, commit 58feee37, linux, python 3.7.7
[Taichi] Starting on arch=cuda
[Taichi] materializing...
Traceback (most recent call last):
  File "test.py", line 4, in <module>
    x.to_torch()
  File "/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/lang/util.py", line 214, in wrapped
    return func(*args, **kwargs)
  File "/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/lang/expr.py", line 186, in to_torch
    tensor_to_ext_arr(self, arr)
  File "/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/lang/kernel.py", line 571, in wrapped
    return primal(*args, **kwargs)
  File "/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/lang/kernel.py", line 500, in __call__
    return self.compiled_functions[key](*args)
  File "/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/lang/kernel.py", line 432, in func__
    gpu_v = v.cuda()
RuntimeError: CUDA error: all CUDA-capable devices are busy or unavailable
[E 02/22/21 22:51:37.623] [cuda_driver.h:operator()@81] CUDA Error CUDA_ERROR_CONTEXT_IS_DESTROYED: context is destroyed while calling mem_free (cuMemFree_v2)

***********************************
* Taichi Compiler Stack Traceback *
***********************************
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so: taichi::Logger::error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so: taichi::lang::CUDADriverFunction<void*>::operator()(void*)
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so: taichi::lang::UnifiedAllocator::~UnifiedAllocator()
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so: taichi::lang::MemoryPool::~MemoryPool()
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so: taichi::lang::Program::~Program()
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so(+0x58059e) [0x2b7d934d759e]
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so(+0x3f1f2b) [0x2b7d93348f2b]
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so(+0x3f1bfc) [0x2b7d93348bfc]
python(+0x1246a0) [0x55a6a84346a0]
python(_PyGC_CollectNoFail+0x2a) [0x55a6a852d48a]
python(PyImport_Cleanup+0x4b0) [0x55a6a84b9b70]
python(Py_FinalizeEx+0x67) [0x55a6a8534817]
python(+0x2373d3) [0x55a6a85473d3]
python(_Py_UnixMain+0x3c) [0x55a6a85476fc]
/lib64/libc.so.6: __libc_start_main
python(+0x1dc3c0) [0x55a6a84ec3c0]

Internal error occurred. Check out this page for possible solutions:
https://taichi.readthedocs.io/en/stable/install.html#troubleshooting
terminate called after throwing an instance of 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >'
[E 02/22/21 22:51:37.626] Received signal 6 (Aborted)

***********************************
* Taichi Compiler Stack Traceback *
***********************************
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so: taichi::Logger::error(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, bool)
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so: taichi::signal_handler(int)
/lib64/libc.so.6(+0x36340) [0x2b7d8b9f1340]
/lib64/libc.so.6: gsignal
/lib64/libc.so.6: abort
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/bin/../lib/libstdc++.so.6: __gnu_cxx::__verbose_terminate_handler()
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/bin/../lib/libstdc++.so.6(+0xabf47) [0x2b7d966fbf47]
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/bin/../lib/libstdc++.so.6(+0xabf7d) [0x2b7d966fbf7d]
/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib/taichi_core.so(+0x353a0b) [0x2b7d932aaa0b]

Internal error occurred. Check out this page for possible solutions:
https://taichi.readthedocs.io/en/stable/install.html#troubleshooting
...

Additional comments The following versions of torch and torch vision were installed with pip

torch==1.6.0+cu101
torchvision==0.7.0+cu101

ti diagnose output below:

(taichi-new) [dmatthe1@dg-gpunode02 ~]$ ti diagnose
[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-ksywpzan
[Taichi] version 0.7.14, llvm 10.0.0, commit 58feee37, linux, python 3.7.7

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://taichi.rtfd.io/en/stable
GitHub: https://github.com/taichi-dev/taichi
Forum:  https://forum.taichi.graphics

Taichi system diagnose:

python: 3.7.7 (default, May  7 2020, 21:25:33) 
[GCC 7.3.0]
system: linux
executable: /users/d/m/dmatthe1/anaconda3/envs/taichi-new/bin/python
platform: Linux-3.10.0-957.21.3.el7.x86_64-x86_64-with-redhat-7.6-Maipo
architecture: 64bit 
uname: uname_result(system='Linux', node='dg-gpunode02.cluster', release='3.10.0-957.21.3.el7.x86_64', version='#1 SMP Fri Jun 14 02:54:29 EDT 2019', machine='x86_64', processor='x86_64')
locale: en_US.UTF-8
PATH: /gpfs1/arch/spack-0.11.2/opt/spack/linux-rhel7-x86_64/gcc-7.3.0/gcc-7.3.0-khalm3u67h2z4zkci37yawjxdpnkglgb/bin:/users/d/m/dmatthe1/anaconda3/envs/taichi-new/bin:/gpfs1/arch/x86_64-rhel7/bin:/usr/lib64/qt-3.3/bin:/gpfs1/arch/x86_64-rhel7/openmpi-3.1.6-cuda-slurm-ib-verbs/bin:/users/d/m/dmatthe1/utils:/users/d/m/dmatthe1/julia/julia-1.0.2/bin:/gpfs1/arch/x86_64-rhel7/bin:/gpfs1/arch/x86_64-rhel7/bin:/usr/lib64/qt-3.3/bin:/gpfs1/arch/x86_64-rhel7/openmpi-3.1.6-cuda-slurm-ib-verbs/bin:/users/d/m/dmatthe1/utils:/users/d/m/dmatthe1/anaconda3/condabin:/users/d/m/dmatthe1/julia/julia-1.0.2/bin:/gpfs1/arch/x86_64-rhel7/bin:/gpfs1/arch/spack-0.11.2/bin:/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/var/cfengine/bin:/usr/local/cuda/bin:/gpfs1/arch/x86_64-rhel7/bin:/usr/lpp/mmfs/bin:/opt/ibutils/bin:/opt/env-switcher/bin:/users/d/m/dmatthe1/.local/bin:/users/d/m/dmatthe1/bin
PYTHONPATH: ['/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/bin', '/users/d/m/dmatthe1/src/research_code', '/users/d/m/dmatthe1/softbots', '/gpfs1/home/d/m/dmatthe1', '/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python37.zip', '/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7', '/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/lib-dynload', '/users/d/m/dmatthe1/.local/lib/python3.7/site-packages', '/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages', '/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/EvoSoroCore-0.1.0.0-py3.7.egg', '/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/lxml-4.5.2-py3.7-linux-x86_64.egg', '/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/networkx-2.4-py3.7.egg', '/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/decorator-4.4.2-py3.7.egg', '/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/Parallel_Computation-0.9-py3.7.egg', '/gpfs1/home/d/m/dmatthe1/manual_python_packages/mpi4py-3.0.0', '/gpfs1/home/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/core/../lib']

LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: RedHatEnterpriseServer
Description:    Red Hat Enterprise Linux Server release 7.6 (Maipo)
Release:    7.6
Codename:   Maipo

TAICHI_REPO_DIR=

import: <module 'taichi' from '/users/d/m/dmatthe1/anaconda3/envs/taichi-new/lib/python3.7/site-packages/taichi/__init__.py'>

cc: True
cpu: True
metal: False
opengl: False
cuda: True

`glewinfo` not available: [Errno 2] No such file or directory: 'glewinfo': 'glewinfo'

Mon Feb 22 22:54:13 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:1F:00.0 Off |                    0 |
| N/A   32C    P0    42W / 300W |      0MiB / 32480MiB |      0%   E. Process |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-d5apz6xc
[Taichi] version 0.7.14, llvm 10.0.0, commit 58feee37, linux, python 3.7.7

[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-fpezv7cc
[Taichi] version 0.7.14, llvm 10.0.0, commit 58feee37, linux, python 3.7.7
[Taichi] Starting on arch=x64

[W 02/22/21 22:54:19.181] [__init__.py:adaptive_arch_select@589] Arch=[<Arch.opengl: 6>] is not supported, falling back to CPU
[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-ax5owccu
[Taichi] version 0.7.14, llvm 10.0.0, commit 58feee37, linux, python 3.7.7
[Taichi] Starting on arch=x64

[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-rbzzgqca
[Taichi] version 0.7.14, llvm 10.0.0, commit 58feee37, linux, python 3.7.7
[Taichi] Starting on arch=cuda

[Taichi] mode=release
[Taichi] preparing sandbox at /tmp/taichi-8yzj5jpb
[Taichi] version 0.7.14, llvm 10.0.0, commit 58feee37, linux, python 3.7.7

*******************************************
**      Taichi Programming Language      **
*******************************************

Docs:   https://taichi.rtfd.io/en/stable
GitHub: https://github.com/taichi-dev/taichi
Forum:  https://forum.taichi.graphics

Running example minimal ...
[Taichi] Starting on arch=x64
[Taichi] materializing...
>>> Running time: 1.75s
42

Consider attaching this log when maintainers ask about system information.
>>> Running time: 22.58s
k-ye commented 3 years ago

It sounds like this could be due to the exclusive-process mode E. Process settings in your cluster: https://discuss.pytorch.org/t/distributeddataparallel-runtimeerror-cuda-error-all-cuda-capable-devices-are-busy-or-unavailable/102763/7 ?

davidmatthews1uvm commented 3 years ago

Thanks for the advice. I checked and you are right -- the cluster is in exclusive-process mode. Do you know of any workarounds for this issue?

k-ye commented 3 years ago

There is a suggestion of how to turn it off: https://discuss.pytorch.org/t/distributeddataparallel-runtimeerror-cuda-error-all-cuda-capable-devices-are-busy-or-unavailable/102763/5. But I'm not sure if you are authorized to just change that on your cluster.

Also could you try printing out CUDA_VISIBLE_DEVICES? You may be able to widen the devices you can use, see https://stackoverflow.com/a/39661999/12003165. (Of course this depends on how many physical GPU devices your cluster machine has)

Last but not the least, your stack trace has implied that the pytorch tensor is on a CPU device (the else branch means CPU):

https://github.com/taichi-dev/taichi/blob/a936eef45b09ae2061b454b6f5ceeb229f39e4ee/python/taichi/lang/kernel.py#L429-L432.

Could you try directly passing in a tensor on GPU?

davidmatthews1uvm commented 3 years ago

Thanks for the suggestions. I was able to get taichi-pytorch to pass data between GPU memory directly however with some caveats: this only works when pytorch is using a different GPU than taichi. In that case: x.to_torch(device=device) and x.from_torch(x_torch) work as expected.

Do you think it would be feasible to expose the taichi CUDA context and directly pass it to pytorch or vice versa? I can use 2 GPUs if needed but ideally I would only need to allocate 1.

k-ye commented 3 years ago

Do you think it would be feasible to expose the taichi CUDA context and directly pass it to pytorch or vice versa? I can use 2 GPUs if needed but ideally I would only need to allocate 1.

I see. While it is possible for Taichi to borrow the existing CUDA context created by pytorch (or any other CUDA-enabled library) in the same process, this assumes that the borrowed context is not destroyed or changed during the lifetime of Taichi. I couldn't find any reference in Pytorch saying whether this holds or not.

Would it be a bit more easier if, while you still need to use two GPUs, Taichi would allow you to specify which GPU device ID to use?

Filed https://discuss.pytorch.org/t/share-the-cuda-context-created-by-pytorch/113359