instadeepai / InstaNovo

De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments
Apache License 2.0
51 stars 7 forks source link

Illustration of pretrained model checkpoint #37

Open irleader opened 5 months ago

irleader commented 5 months ago

Hi,

Changelog for release 0.1.4 states: "add checkpoints instanovo.pt trained on HC-PT, and instanovo_yeast.pt fine-tuned on nine-species excluding yeast."

After checking the vocab/residues for both model checkpoints: instanovo.pt does not have 'N(+.98)', 'Q(+.98)', and uses 'C' and 'M(ox)' instanovo_yeast.pt has 'N(+.98)', 'Q(+.98)', and uses 'C(+57.02)' and 'M(+15.99)'.

Therefore, I doubt instanovo_yeast.pt can not be finetuned based on instanovo.pt as they have different vocabs/residues.

So is instanovo_yeast.pt trained from scratch on nine-species dataset excluding yeast?

Thanks!

irleader commented 5 months ago

Also, is it possible to share with us the instanovo.pt checkpoint trained on HC-PT and has 'N(+.98)', 'Q(+.98)', and uses 'C(+57.02)' and 'M(+15.99)'? Thanks a lot!

KevinEloff commented 1 week ago

Hi @irleader,

Apologies again for the delayed response.

We actually do use the same checkpoint, but modify the decoder to allow for the new vocabulary. In the 1.0.0 release this process is performed automatically, adjusting the size of the amino acid embedding layer, the head weight, and head bias of the decoder.

In the 0.1.4 release, we did this manually by removing the weights associated with the input and output of the decoder and trained them from scratch on the nine-species dataset. We did this for both InstaNovo and InstaNovo+.