mheinzinger / ProstT5

Bilingual Language Model for Protein Sequence and Structure
MIT License
183 stars 15 forks source link

add LoRA finetune script #11

Closed gbouras13 closed 9 months ago

gbouras13 commented 9 months ago

Adds a Python script that runs LoRA fine-tuning for ProstT5 (essentially it adds preprocessing code to this notebook and makes it into one script https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/PT5_LoRA_Finetuning_per_residue_class.ipynb).

Requires 5 inputs:

Example code

python finetune_prostt5_lora_script.py --trainaafasta train_aa.faa --trainthreedifasta train_3di.faa \
      --validaafasta valid_aa.faa  --validthreedifasta valid_3di.faa \
       -o test_prostt5_lora -b 1 --finetune_name prostt5_finetuned_model -f

You can also mask all residues below a certain pLDDT threshold (defaults to 70) with --mask and --mask_threshold. If you do this, you also need to provide a directory using --pdb_dir containing all training and validation set structures in the .pdb format, where the FASTA record ID matches the file name in that directory.

python finetune_prostt5_lora_script.py --trainaafasta train_aa.faa --trainthreedifasta train_3di.faa \
      --validaafasta valid_aa.faa  --validthreedifasta valid_3di.faa \
       -o test_prostt5_lora -b 1 --finetune_name prostt5_finetuned_model --mask --mask_threshold 70 --pdb_dir directory_with_structures -f

George

mheinzinger commented 9 months ago

Great, thanks for the addition! :)