Open sosofun opened 5 years ago
Our distributed version cannot work fine currently on GPU, we are working on it.
Could you please try out the thread-based scheduling?
from mars.session import new_session
sess = new_session()
a = mt.random.rand(2000, 2000, gpu=True)
b = mt.random.rand(2000, 2000, gpu=True)
c = mt.dot(a,b)
c.execute(session=sess)
Our distributed version cannot work fine currently on GPU, we are working on it.
Could you please try out the thread-based scheduling?
from mars.session import new_session sess = new_session() a = mt.random.rand(2000, 2000, gpu=True) b = mt.random.rand(2000, 2000, gpu=True) c = mt.dot(a,b) c.execute(session=sess)
code:
import mars.tensor as mt
from mars.session import new_session
sess = new_session()
a = mt.random.rand(20000, 20000, gpu=True)
b = mt.random.rand(20000, 20000, gpu=True)
c = mt.dot(a,b)
c.execute(session=sess)
Error:
File "demo.py", line 12, in <module>
c.execute(session=sess)
File "/usr/local/lib/python3.5/dist-packages/mars/tensor/core.py", line 446, in execute
return session.run(self, **kw)
File "/usr/local/lib/python3.5/dist-packages/mars/session.py", line 124, in run
result = self._sess.run(*run_tensors, **kw)
File "/usr/local/lib/python3.5/dist-packages/mars/session.py", line 47, in run
return self._executor.execute_tensors(tensors, **kw)
File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 128, in execute_tensors
sparse_mock_percent=sparse_mock_percent)
File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 77, in execute_graph
prefetch=self._prefetch, retval=True)
File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 300, in execute_graph
[f.result() for f in fs.values()]
File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 300, in <listcomp>
[f.result() for f in fs.values()]
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 405, in result
return self.__get_result()
File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
raise self._exception
File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
result = self.fn(*self.args, **self.kwargs)
File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 159, in execute_chunk
executor.handle(chunk, chunk_result)
File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 60, in handle
return self._op_runners[cls](results, chunk)
File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/linalg.py", line 111, in _tensordot
ctx[chunk.key] = xp.tensordot(a, b, axes)
File "/usr/local/lib/python3.5/dist-packages/cupy/linalg/product.py", line 195, in tensordot
return core.tensordot_core(a, b, None, n, m, k, ret_shape)
File "cupy/core/core.pyx", line 4112, in cupy.core.core.tensordot_core
File "cupy/core/core.pyx", line 4147, in cupy.core.core.tensordot_core
File "cupy/core/core.pyx", line 150, in cupy.core.core.ndarray.__init__
File "cupy/cuda/memory.pyx", line 517, in cupy.cuda.memory.alloc
File "cupy/cuda/memory.pyx", line 1064, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 1085, in cupy.cuda.memory.MemoryPool.malloc
File "cupy/cuda/memory.pyx", line 899, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
File "cupy/cuda/memory.pyx", line 920, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
File "cupy/cuda/memory.pyx", line 694, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: out of memory to allocate 134217728 bytes (total 11593883648 bytes)
OK, this computation may not be achieved by the limited resource provided by a single GPU card, I think there are mainly two solutions.
I think this issue is a good start, we can start more work on the multiple cards, as well as the distributed runtime with GPU.
OK, this computation may not be achieved by the limited resource provided by a single GPU card, I think there are mainly two solutions.
- Try to take advantage of the spill mechanism which we do not mention much in our documents, the downside is that the speed may be effected heavily.
- Use multiple cards, currently we haven't done much tests on it, thus nothing can be guaranteed.
I think this issue is a good start, we can start more work on the multiple cards, as well as the distributed runtime with GPU.
expect to be able to achieve single node computing using multiple cards, it is the foundation of large-scale distributed computing!
by the way, can you give a example about the "spill mechanism" ?
by the way, can you give a example about the "spill mechanism" ?
I want to bring some background that this version of Mars is actually the refactored one, and during the refactor, some GPU ability does not catch up, because the main challenge we faced before is too large for GPU, thus we focus more on the CPU side.
Hence, the spill mechanism of GPU is absent during the refactor, apologize for that.
We will start our full GPU support in a soon time, please keep in touch, and really welcome to join us for the coming developments.
How to use multi GPU on single machine ?
env: ubuntu 16.04, cuda 9.0, 1080ti*8
code:
code: