You can also mask all residues below a certain pLDDT threshold (defaults to 70) with --mask and --mask_threshold. If you do this, you also need to provide a directory using --pdb_dir containing all training and validation set structures in the .pdb format, where the FASTA record ID matches the file name in that directory.
Adds a Python script that runs LoRA fine-tuning for ProstT5 (essentially it adds preprocessing code to this notebook and makes it into one script https://github.com/agemagician/ProtTrans/blob/master/Fine-Tuning/PT5_LoRA_Finetuning_per_residue_class.ipynb).
Requires 5 inputs:
--trainaafasta
AA FASTA file of training protein dataset--trainthreedifasta
matching 3Di FASTA file of training protein dataset. Headers must match those in--trainaafasta
--validaafasta
AA FASTA file of validation protein dataset--validthreedifasta
matching 3Di FASTA file of validation protein dataset. Headers must match those in--validaafasta
-o
output directory where the fine-tuned model, plot and other files will be stored.Example code
You can also mask all residues below a certain pLDDT threshold (defaults to 70) with
--mask
and--mask_threshold
. If you do this, you also need to provide a directory using--pdb_dir
containing all training and validation set structures in the .pdb format, where the FASTA record ID matches the file name in that directory.George