sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
463 stars 65 forks source link

Protocol prottrans_bert_bfd: BadZipFile: File is not a zip file #189

Closed pskvins closed 2 years ago

pskvins commented 2 years ago

Hi, I am trying to get embeddings of proteins, but I'm encountering an error as following

## Metadata
|key|value|
|--|--|
|**version**|0.2.2|
|**cuda**|False|

## Parameter
|key|value|
|--|--|
type|embed
protocol|prottrans_bert_bfd
reduce|True

## Traceback
Traceback (most recent call last):
  File "/home/sukhwan/.local/lib/python3.7/site-packages/bio_embeddings/utilities/pipeline.py", line 284, in execute_pipeline_from_config
    stage_output_parameters = stage_runnable(**stage_parameters)
  File "/home/sukhwan/.local/lib/python3.7/site-packages/bio_embeddings/embed/pipeline.py", line 400, in run
    embedder: EmbedderInterface = embedder_class(**result_kwargs)
  File "/home/sukhwan/.local/lib/python3.7/site-packages/bio_embeddings/embed/prottrans_bert_bfd_embedder.py", line 30, in __init__
    super().__init__(**kwargs)
  File "/home/sukhwan/.local/lib/python3.7/site-packages/bio_embeddings/embed/embedder_interfaces.py", line 60, in __init__
    model=self.name, directory=directory
  File "/home/sukhwan/.local/lib/python3.7/site-packages/bio_embeddings/utilities/remote_file_retriever.py", line 93, in get_model_directories_from_zip
    with zipfile.ZipFile(file_name, "r") as zip_ref:
  File "/home/sukhwan/miniconda3/lib/python3.7/zipfile.py", line 1258, in __init__
    self._RealGetContents()
  File "/home/sukhwan/miniconda3/lib/python3.7/zipfile.py", line 1325, in _RealGetContents
    raise BadZipFile("File is not a zip file")
zipfile.BadZipFile: File is not a zip file

## More info

I installed bio_embeddings by pip install bio-embeddings[all] and ran bio_embeddings --overwrite embed.yml. The content of the embed.yml is as following `global: sequences_file: /home/sukhwan/cluster_idr/sample.fasta prefix: sample_result simple_remapping: True

prottrans_t5_bfd_embeddings: type: embed protocol: prottrans_bert_bfd reduce: True`

I tried again after removing the cache file of the bio_embeddings, but got the same error message. Do you know what can be the cause of this problem?

sacdallago commented 2 years ago

Hey, I'm unsure 🤔 a corrupted zip download? You can download the ZIP manually from here: http://data.bioembeddings.com/public/embeddings/embedding_models/bert/

Then unzip it using whatever system software you have.

last, in the config you just need to add a parameter model_directory in the prottrans_t5_bfd_embeddings stage: https://github.com/sacdallago/bio_embeddings/blob/develop/examples/parameters_blueprint.yml#L95

Let me know if this works

sacdallago commented 2 years ago

P.S.: why ProtBert? It's not the best performing model! ProtT5 is: https://github.com/agemagician/ProtTrans/blob/master/README.md#-comparison-to-other-protein-language-models-plms