shyamsn97 / mario-gpt

[Neurips 2023] Generating Mario Levels with GPT2. Code for the paper "MarioGPT: Open-Ended Text2Level Generation through Large Language Models" https://arxiv.org/abs/2302.05981
https://huggingface.co/shyamsn97/Mario-GPT2-700-context-length
MIT License
1.11k stars 103 forks source link

RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED #12

Closed TheFiZi closed 1 year ago

TheFiZi commented 1 year ago

This happens randomly when generating a level.

Using the prompts: no blocks, no pipes, many goombas, fireball

shape: torch.Size([1, 678]), torch.Size([1, 1393]) first: 56, last: 51:  99%|██████████████████████████████████████████████████████████████████▌| 1392/1400 [02:58<00:01,  7.82it/s]Traceback (most recent call last):
  File "/home/me/apps/mariogpt/capturePlay.py", line 38, in <module>
    generated_level = mario_lm.sample(                                                                                                                                                File "/home/me/apps/mariogpt/lib/python3.10/site-packages/mario_gpt/lm/gpt.py", line 54, in sample
    return sampler(
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/mario_gpt/sampler.py", line 248, in __call__
    return self.sample(*args, **kwargs)
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/mario_gpt/sampler.py", line 223, in sample
    next_tokens, encoder_hidden_states = self.step(
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/mario_gpt/sampler.py", line 158, in step
    out = self.mario_lm.lm(
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 1043, in forward
    transformer_outputs = self.transformer(
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 887, in forward
    outputs = block(
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 388, in forward
    attn_outputs = self.attn(
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 329, in forward
    attn_output, attn_weights = self._attn(query, key, value, attention_mask, head_mask)
  File "/home/me/apps/mariogpt/lib/python3.10/site-packages/transformers/models/gpt2/modeling_gpt2.py", line 216, in _attn
    attn_output = torch.matmul(attn_weights, value)
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemmStridedBatched( handle, opa, opb, m, n, k, &alpha, a, lda, stridea, b, ldb, strideb, &beta, c, ldc, stridec, num_batches)`

I'm using:

generated_level = mario_lm.sample(
    prompts=prompts,
    num_steps=1400,
    #num_steps=100,
    temperature=2.0,
    use_tqdm=True
)

This was happening less frequently in 0.1.2 it feels like. I just upgraded to 0.1.3.

TheFiZi commented 1 year ago

It just dawned on me. I wonder if these are happening because sometimes the generation process goes over the amount of RAM I have available on my GPU? I'm just using a dinky Quadro P620 right now which only has 2GB of RAM.

shyamsn97 commented 1 year ago

Not sure actually, what torch version are you using? Maybe an upgrade is needed

TheFiZi commented 1 year ago

Not sure actually, what torch version are you using? Maybe an upgrade is needed

Same response as https://github.com/shyamsn97/mario-gpt/issues/13#issuecomment-1440646703 :)

shyamsn97 commented 1 year ago

Yeah I find it strange that it’s happening at a random part in the generation, seems like it could be some weird cuda issue lol

TheFiZi commented 1 year ago

I am going to close this off as a not enough memory issue. I ran the default generation example and it peaked at ~6GB of vRAM.

The Quadro I was running it on only has 2GB.