mheinzinger / SeqVec

Modelling the Language of Life - Deep Learning Protein Sequences
MIT License
70 stars 27 forks source link

Question regarding training the ELMo model #4

Closed ashishjain1988 closed 4 years ago

ashishjain1988 commented 4 years ago

What is the format of the data while training the protein embedding model (ELMo)? It would be helpful if you can share a snapshot of that.

mheinzinger commented 4 years ago

The format for training ELMo on protein sequences follows the format used in NLP. In NLP, the training corpus usually holds one sentence per line with words being separated by white-spaces. In our case, we considered every protein sequences as a single sentence and each amino acid as a word. An example would look like this: P R O T E I N S E Q W E N C E

pzhang84 commented 2 years ago

@mheinzinger I would love to reproduce the trained model. Can you please share the code for training SeqVec? Thanks!

mheinzinger commented 2 years ago

We used the official Tensorflow implementation of ELMo and changed the input to protein sequences: https://github.com/allenai/bilm-tf All you have to do is to create an input corpus with protein sequences (one sequence per line, amino acids separated by white-space)

pzhang84 commented 2 years ago

@mheinzinger Thanks for your clarification about the input corpus (one sequence per line)! I noticed that pre-training ELMo using protein sequence requires me to prepare multiple .txt input files into a training folder. Just want to confirm that, how many sequences did you include within one single txt input file (or how many lines do you have within one single input file?

mheinzinger commented 2 years ago

Yes, depending on your hardware setup, it might be beneficial to have multiple splits (esp. in a multi-GPU setting this was important when we trained ProtTrans). However, I do not want to tell you sth wrong and unfortunately I do not perfectly remember how many files we created for SeqVec/ELMo. But I also have to say that most other parameters (learning-rate, batch-size, num_layers, n_hidden, corpus etc.) will have much more impact on the performance. In fact, I think splitting the training data into multiple chunks is mostly related to efficiency and the choice should not affect your final performance. So depending on your setup you might squeeze out a few percent more throughput by tuning this parameter but I doubt that it will change your performance.

pzhang84 commented 2 years ago

Thanks for your quick response! It makes sense to me now

pzhang84 commented 2 years ago

@mheinzinger Since you mentioned your 'ProtTrans', very impressive work btw, love the idea!!

So I retrained a small protein seqs bert model using google's official implementation code(https://github.com/google-research/bert), and I'd love to fine-tune the model based on your pre-trained weights (checkpoints) to see what happens. However, I can only find your pytorch model (.bin) on huggingface. I wonder if you could kindly share the checkpoints for your pre-trained models (checkpoints)?

Thanks in advance for your help!

mheinzinger commented 2 years ago

I recovered the tensorflow checkpoints for ProtBERT-UniRef100 (not ProtBERT-BFD, sorry; could not find the corresponding files for BFD). You can download the ProtBERT-UniRef100 TF-checkpoints here: https://rostlab.org/~deepppi/protbert_u100.tar.gz