Question about sequence length

tbepler / protein-sequence-embedding-iclr2019

Source code for "Learning protein sequence embeddings using information from structure" - ICLR 2019

Other

253 stars 75 forks source link

For the structure-based embedding model, yes, I recommend loading the entire sequence to embed each position, because the embeddings are contextual (i.e. they depend on the whole sequence). The Bi-LSTM model naturally handles sequences of arbitrary length and memory consumption scales linearly with sequence length so this shouldn't be an issue even with long proteins.

If you're referring specifically to the pretrained language model component, it conditions on all previous sequence, so it's likely to perform better if shown all 1,663 amino acids when predicting the final amino acid. You could show the model a smaller context window (i.e. by feeding only the previous N amino acids the model), but it may hurt performance.

tbepler / protein-sequence-embedding-iclr2019

Question about sequence length #3