CUDA Error on 3090 - Githubissues

sewageseaweed commented 3 years ago

Hey guys,

Completely new here. I went through the installation process and ran the sampling command and got the following error. I also changed the command to

python jukebox/sample.py --model=1b_lyrics --name=sample_1b --levels=2 --sample_length_in_seconds=20 --total_sample_length_in_seconds=180 --sr=44100 --n_samples=2 --hop_fraction=0.5,0.5,0.125

and still got the same error:

Traceback (most recent call last):
  File "jukebox/sample.py", line 279, in <module>
    fire.Fire(run)
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "jukebox/sample.py", line 276, in run
    save_samples(model, device, hps, sample_hps)
  File "jukebox/sample.py", line 244, in save_samples
    ancestral_sample(labels, sampling_kwargs, priors, hps)
  File "jukebox/sample.py", line 127, in ancestral_sample
    zs = _sample(zs, labels, sampling_kwargs, priors, sample_levels, hps)
  File "jukebox/sample.py", line 102, in _sample
    zs = sample_level(zs, labels[level], sampling_kwargs[level], level, prior, total_length, hop_length, hps)
  File "jukebox/sample.py", line 85, in sample_level
    zs = sample_single_window(zs, labels, sampling_kwargs, level, prior, start, hps)
  File "jukebox/sample.py", line 69, in sample_single_window
    z_samples_i = prior.sample(n_samples=z_i.shape[0], z=z_i, z_conds=z_conds_i, y=y_i, **sampling_kwargs)
  File "/home/clyde/Projects/jukebox/jukebox/prior/prior.py", line 271, in sample
    top_k=top_k, top_p=top_p, chunk_size=chunk_size, sample_tokens=sample_tokens)
  File "/home/clyde/Projects/jukebox/jukebox/prior/autoregressive.py", line 309, in primed_sample
    x_prime = self.transformer(x_prime, encoder_kv=encoder_kv, sample=True, fp16=fp16)
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/clyde/Projects/jukebox/jukebox/transformer/transformer.py", line 187, in forward
    x = l(x, encoder_kv=None, sample=sample)
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/clyde/Projects/jukebox/jukebox/transformer/transformer.py", line 64, in forward
    a = self.attn(self.ln_0(x), encoder_kv, sample)
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/clyde/Projects/jukebox/jukebox/transformer/factored_attention.py", line 291, in forward
    x = self.c_attn(x)
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/clyde/Projects/jukebox/jukebox/transformer/ops.py", line 99, in forward
    x = t.addmm(self.b.type_as(x), x.view(-1, x.size(-1)), self.w.type_as(x)) # If x if float then float else half
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)

My systems on CUDA 11.4. Wasn't sure if it was either running OOM or due to a conflicting CUDA version.

Thank you!

sewageseaweed commented 3 years ago

Reducing the chunk size and batch size and installing pytorch with the following command:

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia

seems to have fixed it. It now seems to sample, but now I run into this error:

Traceback (most recent call last):
  File "jukebox/sample.py", line 279, in <module>
    fire.Fire(run)
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 127, in Fire
    component_trace = _Fire(component, args, context, name)
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 366, in _Fire
    component, remaining_args)
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/fire/core.py", line 542, in _CallCallable
    result = fn(*varargs, **kwargs)
  File "jukebox/sample.py", line 276, in run
    save_samples(model, device, hps, sample_hps)
  File "jukebox/sample.py", line 244, in save_samples
    ancestral_sample(labels, sampling_kwargs, priors, hps)
  File "jukebox/sample.py", line 127, in ancestral_sample
    zs = _sample(zs, labels, sampling_kwargs, priors, sample_levels, hps)
  File "jukebox/sample.py", line 120, in _sample
    save_html(logdir, x, zs, labels[-1], alignments, hps)
  File "/home/clyde/Projects/jukebox/jukebox/save_html.py", line 24, in save_html
    _save_item_html(item_dir, item, item, data)
  File "/home/clyde/Projects/jukebox/jukebox/save_html.py", line 50, in _save_item_html
    max_attn_at_token = np.max(alignment, axis=0)
  File "<__array_function__ internals>", line 6, in amax
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 2734, in amax
    keepdims=keepdims, initial=initial, where=where)
  File "/home/clyde/anaconda3/envs/jukebox/lib/python3.7/site-packages/numpy/core/fromnumeric.py", line 87, in _wrapreduction
    return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
ValueError: zero-size array to reduction operation maximum which has no identity

sewageseaweed commented 3 years ago

Looks like no longer running into zero-size array error when level=3. Currently sampling. Hopefully finishes without any hitches

allynee commented 1 year ago

you saved our lives 😭 thank you!!

openai / jukebox

CUDA Error on 3090 #241