sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
463 stars 65 forks source link

ESM pipeline failing? #158

Closed sacdallago closed 3 years ago

sacdallago commented 3 years ago

@HannesStark reports:

Traceback (most recent call last):
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/utilities/pipeline.py", line 284, in execute_pipeline_from_config
    stage_output_parameters = stage_runnable(**stage_parameters)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 409, in run
    return embed_and_write_batched(embedder, file_manager, result_kwargs, kwargs.get("half_precision", False))
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/pipeline.py", line 229, in embed_and_write_batched
    for sequence_id, original_id, embedding in zip(
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/tqdm/std.py", line 1178, in iter
    for obj in iterable:
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/embedder_interfaces.py", line 119, in embed_many
    yield from self.embed_batch(batch)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/esm_embedder.py", line 34, in embed_batch
    self._assert_max_len(batch)
  File "/mnt/lsf-nas-1/os-shared/anaconda3/envs/bio_embeddings_unstable/lib/python3.8/site-packages/bio_embeddings/embed/esm_embedder.py", line 56, in _assert_max_len
    max_len = max(len(i) for i in sequences)
ValueError: max() arg is an empty sequence

Basically:

max_len = max(len(i) for i in sequences)

means max is run on an empty sequence. Seems strange to me...

@HannesStark please add config & fasta here

sacdallago commented 3 years ago

quick fix would be to default to 0

max_len = max(len(i) for i in sequences, default=0)

but may be symptom of bigger issue (happens at the beginning of the pipeline -- why would the pipeline try to embed an empty set of sequences?)

CC @konstin

konstin commented 3 years ago

I had already fixed that once but only embed_many and not also in embed_batch:

https://github.com/sacdallago/bio_embeddings/blob/9a1d7574e793b98d720b5d005b1ff155e7fe6c85/bio_embeddings/embed/esm_embedder.py#L69-L71

I guess the empty sets are artifacts of the batching, but we have to support them for the python api anyway so I think they're fine

sacdallago commented 3 years ago

Alright. I suppose the default fix should work as does the if not, no? The former comes with the plus that you don't need to cast the iterable to list

konstin commented 3 years ago

iirc I tried default=0 but it fails because now you iterate twice (once for _assert_max_len and once for embedding), so if you pass an actual iterator instead of a list it's empty on the second iteration and no embeddings will be produced.

sacdallago commented 3 years ago

Aha! Good catch. I implemented it with tee

sacdallago commented 3 years ago

Fixed in a5e6bce76dbf0de964b84abed891b3f4dca4be21

HannesStark commented 3 years ago

Everything works now on my end. A thousand thanks for the super fast fix!