Protocol prottrans_t5_xl_u50: RuntimeError: "baddbmm__mkl" not implemented for 'Half'

sacdallago commented 3 years ago

Metadata

key	value
version	0.1.7
cuda	True

Parameter

key	value
type	embed
protocol	prottrans_t5_xl_u50
model_directory	/mnt/project/bio_embeddings/models/lms/t5_u50
half_precision_model	True
half_precision	True
reduce	True
discard_per_amino_acid_embeddings	True

Traceback

Traceback (most recent call last):
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/embedder_interfaces.py", line 172, in embed_batch
    yield from self._embed_batch_impl(batch, self._model)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/prottrans_t5_embedder.py", line 107, in _embed_batch_impl
    embeddings = model(
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1728, in forward
    encoder_outputs = self.encoder(
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 948, in forward
    layer_outputs = layer_module(
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 631, in forward
    self_attention_outputs = self.layer[0](
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 538, in forward
    attention_output = self.SelfAttention(
RuntimeError: CUDA out of memory. Tried to allocate 8.71 GiB (GPU 0; 47.46 GiB total capacity; 21.82 GiB already allocated; 2.12 GiB free; 22.17 GiB reserved in total by PyTorch)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 280, in execute_pipeline_from_config
    stage_output_parameters = stage_runnable(**stage_parameters)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 404, in run
    return embed_and_write_batched(embedder, file_manager, result_kwargs, kwargs.get("half_precision", False))
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 231, in embed_and_write_batched
    for sequence_id, original_id, embedding in zip(
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/tqdm/std.py", line 1178, in __iter__
    for obj in iterable:
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/embedder_interfaces.py", line 122, in embed_many
    yield self.embed(seq)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/prottrans_t5_embedder.py", line 143, in embed
    [embedding] = self.embed_batch([sequence])
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/embedder_interfaces.py", line 180, in embed_batch
    yield from self._embed_batch_impl(batch, self._get_fallback_model())
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/prottrans_t5_embedder.py", line 107, in _embed_batch_impl
    embeddings = model(
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1728, in forward
    encoder_outputs = self.encoder(
RuntimeError: "baddbmm__mkl" not implemented for 'Half'

More info

@konstin you can find the run @

/mnt/project/bio_embeddings/runs/carles

sacdallago commented 3 years ago

I think this just goes hand in hand with #126 and #127

konstin commented 3 years ago

Reported upstream: https://github.com/huggingface/transformers/issues/11546

konstin commented 3 years ago

T5 fp16 is now blocked from running on the CPU, unless torch/transformers happen to fix this there not much we can do

sacdallago / bio_embeddings

Protocol prottrans_t5_xl_u50: RuntimeError: "baddbmm__mkl" not implemented for 'Half' #129

Metadata

Parameter

Traceback

More info