mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
147 stars 13 forks source link

Converting the AA sequences to 3Di and getting the output in PDB format for Foldseek input? #3

Closed Jigyasa3 closed 3 months ago

Jigyasa3 commented 9 months ago

Hi @mheinzinger!

Thank you for the great software! If I am starting with amino acid sequences and want to generate PDB files to run Foldseek's easy-cluster function.

Once I have the 3Di from the amino acid sequence using the following command- python translate_clean.py --input /path/to/some_AA_sequences.fasta --output /path/to/output_directory --half 1 --is_3Di 0

How do I convert the 3Di to PDB format?

Looking forward to your reply!

mheinzinger commented 8 months ago

Hi, so for this use-case we have put a script here that allows you to go from two FASTA files (one for amino acids, one for 3Di sequences) directly to a Foldseek DB (so no need to go via actual 3D structures/PDB files): generate_foldseek_db.py Let me know if this works for you or if you have any problems

Jigyasa3 commented 8 months ago

hi @mheinzinger, Thank you! This is super helpful!

Jigyasa3 commented 8 months ago

Hi @mheinzinger,

I am able to run the translate.py script on one protein sequence at a time. When I scale up to more sequences (I have ~30,000 protein sequences) then I get this error and the output folder is empty.

Error- You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Traceback (most recent call last): File "/home/jigyasaa/downloads/prostT5/translate.py", line 327, in <module> main() File "/home/jigyasaa/downloads/prostT5/translate.py", line 314, in main translate( File "/home/jigyasaa/downloads/prostT5/translate.py", line 234, in translate f" took {compute_time/60:.1f}[m] ({compute_time/len(generation_results):.1f}[s/protein])\n"+ ZeroDivisionError: float division by zero

is_3Di is False. (0=expect input to be 3Di, 1= input is AA
##########################
Loading model from: Rostlab/ProstT5
##########################
##########################
Input is 3Di: False . Sequence below should be lower-case if input is 3Di.
Example sequence: >GCF_000010485_1__protEFPPCJ_00090 Nucleoside-specific channel-forming protein Tsx
MSAKRRLLIACTLITAIYHFPAYSSLEYKGTFGSINAGYADWNSGFVNTHRGEVWKVTADFGVNFKEAEFYSFYESNVLNHAVAGRNHTVSVMTHVRLFDSDMTFFGKIYGQWDNSWGDDLDMFYGFGYLGWNGEWGFFKPYIGLHNQSGDYVSAKYGQTNGWNGYVVGWTAVLPFTLFDEKFVLSNWNEIELDRNDAYTEQQFGRNGLNGGLTIAWKFYPRWKASVTWRYFDNKLGYDGFGDQMIYMLGYDF
##########################
Average sequence length: 253 measured over 1 sequences
Parameters used for generation: {'do_sample': True, 'num_beams': 3, 'top_p': 0.95, 'temperature': 1.2, 'top_k': 6, 'repetition_penalty': 1.2}
RuntimeError during target generation for GCF_000010485_1__protEFPPCJ_00090 Nucleoside-specific channel-forming protein Tsx
If this is triggered by OOM, try lowering num_return_sequences and/or max_batch
CUDA out of memory. Tried to allocate 12.00 MiB. GPU 0 has a total capacty of 23.68 GiB of which 11.88 MiB is free. Including non-PyTorch memory, this process has 12.23 GiB memory in use. Process 1092293 has 11.43 GiB memory in use. Of the allocated memory 11.23 GiB is allocated by PyTorch, and 30.03 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Using device: cuda:0
is_3Di is False. (0=expect input to be 3Di, 1= input is AA
##########################

My input-

#SBATCH --job-name=prostT5

#SBATCH --partition=gpu 
#SBATCH --gpus=2
parallel "python /home/jigyasaa/downloads/prostT5/translate.py --input ${IN_DIR}/{} --output ${OUT_DIR}/prostT5-{.} --half 0 --is_3Di 0" ::: *.fa 
mheinzinger commented 8 months ago

So the error says that you ran out of memory (CUDA out of memory). How long is the protein that you tried to process (from the error I read that it was (GCF_000010485_1__protEFPPCJ_00090 Nucleoside-specific channel-forming protein Tsx)? You could try to activate half-precision (--half 1) to lower memory consumption. In case you changed the default, you can also set the batch-size agian to 1 (single protein processing; max_batch=1): https://github.com/mheinzinger/ProstT5/blob/main/scripts/translate.py#L126C52-L126C61

If your project solely requires predicting 3Di sequences from amino acid sequences, I would strongly recommend to simply use our CNN-predictor trained on top of ProstT5's encoder (dropping the decoder saves quite some inference time & memory while performance remains as sensitive): https://github.com/mheinzinger/ProstT5/blob/main/scripts/predict_3Di_encoderOnly.py