Big tensor will OOM on GPU , How to use multi GPU on single machine ?

How to use multi GPU on single machine ?

env: ubuntu 16.04, cuda 9.0, 1080ti*8

error:  CUDA mem OOM
  File "demo.py", line 9, in <module>
    c.execute()
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/core.py", line 446, in execute
    return session.run(self, **kw)
  File "/usr/local/lib/python3.5/dist-packages/mars/session.py", line 124, in run
    result = self._sess.run(*run_tensors, **kw)
  File "/usr/local/lib/python3.5/dist-packages/mars/session.py", line 47, in run
    return self._executor.execute_tensors(tensors, **kw)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 128, in execute_tensors
    sparse_mock_percent=sparse_mock_percent)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 77, in execute_graph
    prefetch=self._prefetch, retval=True)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 300, in execute_graph
    [f.result() for f in fs.values()]
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 300, in <listcomp>
    [f.result() for f in fs.values()]
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 398, in result
    return self.__get_result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 159, in execute_chunk
    executor.handle(chunk, chunk_result)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 60, in handle
    return self._op_runners[cls](results, chunk)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/linalg.py", line 111, in _tensordot
    ctx[chunk.key] = xp.tensordot(a, b, axes)
  File "/usr/local/lib/python3.5/dist-packages/cupy/linalg/product.py", line 195, in tensordot
    return core.tensordot_core(a, b, None, n, m, k, ret_shape)
  File "cupy/core/core.pyx", line 4112, in cupy.core.core.tensordot_core
  File "cupy/core/core.pyx", line 4147, in cupy.core.core.tensordot_core
  File "cupy/core/core.pyx", line 150, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 517, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1064, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1085, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 899, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 920, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 694, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: out of memory to allocate 134217728 bytes (total 11592040448 bytes)

code:

import mars.tensor as mt
a = mt.random.rand(20000, 20000, gpu=True)
b = mt.random.rand(20000, 20000, gpu=True)
c = mt.dot(a,b)
c.execute()

error:
Unhandled exception in promise
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/mars/promise.py", line 141, in _log_unexpected_error
    six.reraise(*args)
  File "/usr/local/lib/python3.5/dist-packages/mars/lib/six.py", line 702, in reraise
    raise value.with_traceback(tb)
  File "/usr/local/lib/python3.5/dist-packages/mars/promise.py", line 86, in _wrapped
    result = func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/mars/utils.py", line 287, in _wrapped
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/mars/worker/calc.py", line 72, in _try_put_chunk
    ref = self._chunk_store.put(session_id, chunk_key, _calc_result_cache[chunk_key][1])
  File "/usr/local/lib/python3.5/dist-packages/mars/worker/chunkstore.py", line 153, in put
    serialized = pyarrow.serialize(value, self._serialize_context)
  File "pyarrow/serialization.pxi", line 337, in pyarrow.lib.serialize
  File "pyarrow/serialization.pxi", line 136, in pyarrow.lib.SerializationContext._serialize_callback
pyarrow.lib.SerializationCallbackError: pyarrow does not know how to serialize objects of type <class 'cupy.core.core.ndarray'>.

code:

import mars.tensor as mt
from mars.deploy.local import new_cluster
cluster = new_cluster()

a = mt.random.rand(2000, 2000, gpu=True)
b = mt.random.rand(2000, 2000, gpu=True)
c = mt.dot(a,b)
c.execute()

Our distributed version cannot work fine currently on GPU, we are working on it.

Could you please try out the thread-based scheduling?

from mars.session import new_session

sess = new_session()

a = mt.random.rand(2000, 2000, gpu=True)
b = mt.random.rand(2000, 2000, gpu=True)
c = mt.dot(a,b)
c.execute(session=sess)

Our distributed version cannot work fine currently on GPU, we are working on it.

Could you please try out the thread-based scheduling?
from mars.session import new_session

sess = new_session()

a = mt.random.rand(2000, 2000, gpu=True)
b = mt.random.rand(2000, 2000, gpu=True)
c = mt.dot(a,b)
c.execute(session=sess)

code:

import mars.tensor as mt
from mars.session import new_session
sess = new_session()

a = mt.random.rand(20000, 20000, gpu=True)
b = mt.random.rand(20000, 20000, gpu=True)
c = mt.dot(a,b)
c.execute(session=sess)

Error:
  File "demo.py", line 12, in <module>
    c.execute(session=sess)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/core.py", line 446, in execute
    return session.run(self, **kw)
  File "/usr/local/lib/python3.5/dist-packages/mars/session.py", line 124, in run
    result = self._sess.run(*run_tensors, **kw)
  File "/usr/local/lib/python3.5/dist-packages/mars/session.py", line 47, in run
    return self._executor.execute_tensors(tensors, **kw)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 128, in execute_tensors
    sparse_mock_percent=sparse_mock_percent)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 77, in execute_graph
    prefetch=self._prefetch, retval=True)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 300, in execute_graph
    [f.result() for f in fs.values()]
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 300, in <listcomp>
    [f.result() for f in fs.values()]
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 405, in result
    return self.__get_result()
  File "/usr/lib/python3.5/concurrent/futures/_base.py", line 357, in __get_result
    raise self._exception
  File "/usr/lib/python3.5/concurrent/futures/thread.py", line 55, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 159, in execute_chunk
    executor.handle(chunk, chunk_result)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/core.py", line 60, in handle
    return self._op_runners[cls](results, chunk)
  File "/usr/local/lib/python3.5/dist-packages/mars/tensor/execution/linalg.py", line 111, in _tensordot
    ctx[chunk.key] = xp.tensordot(a, b, axes)
  File "/usr/local/lib/python3.5/dist-packages/cupy/linalg/product.py", line 195, in tensordot
    return core.tensordot_core(a, b, None, n, m, k, ret_shape)
  File "cupy/core/core.pyx", line 4112, in cupy.core.core.tensordot_core
  File "cupy/core/core.pyx", line 4147, in cupy.core.core.tensordot_core
  File "cupy/core/core.pyx", line 150, in cupy.core.core.ndarray.__init__
  File "cupy/cuda/memory.pyx", line 517, in cupy.cuda.memory.alloc
  File "cupy/cuda/memory.pyx", line 1064, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 1085, in cupy.cuda.memory.MemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 899, in cupy.cuda.memory.SingleDeviceMemoryPool.malloc
  File "cupy/cuda/memory.pyx", line 920, in cupy.cuda.memory.SingleDeviceMemoryPool._malloc
  File "cupy/cuda/memory.pyx", line 694, in cupy.cuda.memory._try_malloc
cupy.cuda.memory.OutOfMemoryError: out of memory to allocate 134217728 bytes (total 11593883648 bytes)

OK, this computation may not be achieved by the limited resource provided by a single GPU card, I think there are mainly two solutions.

Try to take advantage of the spill mechanism which we do not mention much in our documents, the downside is that the speed may be effected heavily.
Use multiple cards, currently we haven't done much tests on it, thus nothing can be guaranteed.

I think this issue is a good start, we can start more work on the multiple cards, as well as the distributed runtime with GPU.

OK, this computation may not be achieved by the limited resource provided by a single GPU card, I think there are mainly two solutions.

Try to take advantage of the spill mechanism which we do not mention much in our documents, the downside is that the speed may be effected heavily.

Use multiple cards, currently we haven't done much tests on it, thus nothing can be guaranteed.

I think this issue is a good start, we can start more work on the multiple cards, as well as the distributed runtime with GPU.

expect to be able to achieve single node computing using multiple cards, it is the foundation of large-scale distributed computing！

by the way, can you give a example about the "spill mechanism" ?

by the way, can you give a example about the "spill mechanism" ?

I want to bring some background that this version of Mars is actually the refactored one, and during the refactor, some GPU ability does not catch up, because the main challenge we faced before is too large for GPU, thus we focus more on the CPU side.

Hence, the spill mechanism of GPU is absent during the refactor, apologize for that.

We will start our full GPU support in a soon time, please keep in touch, and really welcome to join us for the coming developments.

mars-project / mars

Big tensor will OOM on GPU , How to use multi GPU on single machine ? #138