ievapudz / TemStaPro

TemStaPro - a program for protein thermostability prediction using sequence representations from a protein language model.
MIT License
46 stars 9 forks source link

KeyError: 'mean_representations' #2

Closed ievapudz closed 9 months ago

ievapudz commented 1 year ago

@xing-he529 I have moved the comment to another issue with a more relevant title.

Hello,

This program can be successfully performed using my test dataset(~ 300 sequences), but it got an error when I used a larger dataset (~2w sequences), how can I solve this problem?

The error I got:

2023-05-12 14:45:12.299011: beginning to load the model 
2023-05-12 14:45:43.333932: finished loading the model
Traceback (most recent call last):
  File "./temstapro", line 183, in <module>
    input_size=PARAMETERS["INPUT_SIZE"])
  File "/home/xinghe/app/TemStaPro/data_process.py", line 52, in collect_mean_embeddings
    sha256(sequences[seq_id].encode('utf-8')).hexdigest()))["mean_representations"]
KeyError: 'mean_representations'

btw, it's a nice program! Congratulations!

Thanks,X

Originally posted by @xing-he529 in https://github.com/ievapudz/TemStaPro/issues/1#issuecomment-1545258920

ievapudz commented 1 year ago

@xing-he529, in order to understand, why this error occured, I would need:

  1. the command that you used to execute the program;
  2. how many sequences were in your input (could you specify what "w" stands for in "~2w sequences"?);
  3. whether the directory for embeddings was used before;
  4. how many RAM your system contains;
  5. whether you received some other error before.
xing-he529 commented 1 year ago

After testing, I found that the key factor is protein length. When I tested using a long protein (10881 bp), it got this error:

[xinghe@T64021 TemStaPro]$ ./temstapro -f ./input/test.fa -d ./ProtTrans/ -e tests/outputs/ --mean-output ./input/test.fa.tsv 2023-05-31 00:08:30.364604: beginning to load the model 2023-05-31 00:08:43.281625: finished loading the model 2023-05-31 00:08:43.281776: beginning to generate embeddings ./temstapro: runtime error generating embedding for Phes|1197G000056 (L=10881). Try lowering batch size. If single sequence processing does not work, you need more vRAM to process your protein. Portion 1. 0/1: sequences with generated mean embeddings 0/1: sequences with generated per-residue embeddings 0:00:00.935153: time to generate embeddings ./temstapro: no embeddings were generated.

May I ask you to help me fix this error, btw, the vRAM in my system is 16M.

ievapudz commented 1 year ago

Such unusually long proteins require more RAM to be processed by the program. We have provided guidelines that in most cases it is possible to run the program having 16 GB of RAM (as I understand, that is exactly the amount that you have, correct me if I misunderstood what is meant by "16M").

If possible, I would suggest you to run the program for the protein of such length on a machine with more RAM.