marmotlab / PRIMAL2

Training code PRIMAL2 - Public Repo
MIT License
157 stars 59 forks source link

Scaling down PRIMAL2 for testing #12

Open JacksonArthurClark opened 1 year ago

JacksonArthurClark commented 1 year ago

I'm currently running a system with Ubuntu 22, an i7 8700, 16gb of RAM, and a GTX 1080 with 8gb of VRAM. I keep running into this issue: ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1024,2048] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [[{{node gradients_1/global/qvalues/rnn/while/basic_lstm_cell/MatMul_grad/MatMul_1}} = MatMul[T=DT_FLOAT, transpose_a=true, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:CPU:0"](gradients_1/global/qvalues/rnn/while/basic_lstm_cell/MatMul_grad/MatMul_1/StackPopV2, gradients_1/global/qvalues/rnn/while/basic_lstm_cell/split_grad/concat)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.

I've tried using less resources in the python ray.init() function and have tried turning down NUM_META_AGENTS but it still seems to be too much for my system.

I will eventually be deploying on a much larger system, but in order to test my changes I'd like to have it running on a small scale before deploying since compute time is expensive.