tbepler / protein-sequence-embedding-iclr2019

Source code for "Learning protein sequence embeddings using information from structure" - ICLR 2019
Other
253 stars 75 forks source link

Question about sequence length #3

Closed kezhai closed 5 years ago

kezhai commented 5 years ago

HI,

I am a little bit worried about the capacity of Bi-LSTM. As it is shown at Table 4, the maximum sequence length is 1,664. Is that mean your pre-trained LSTM model need to load all 1663 amino acid to predict the last one? How does that sequence perform? Do you have any algorithm to avoid the long sequence length that may encounter?

Thanks in advance,

tbepler commented 5 years ago

For the structure-based embedding model, yes, I recommend loading the entire sequence to embed each position, because the embeddings are contextual (i.e. they depend on the whole sequence). The Bi-LSTM model naturally handles sequences of arbitrary length and memory consumption scales linearly with sequence length so this shouldn't be an issue even with long proteins.

If you're referring specifically to the pretrained language model component, it conditions on all previous sequence, so it's likely to perform better if shown all 1,663 amino acids when predicting the final amino acid. You could show the model a smaller context window (i.e. by feeding only the previous N amino acids the model), but it may hurt performance.