tbepler / protein-sequence-embedding-iclr2019

Source code for "Learning protein sequence embeddings using information from structure" - ICLR 2019
Other
253 stars 75 forks source link

Protein encoder - PackedSequence error #14

Closed dominikabasaj closed 4 years ago

dominikabasaj commented 4 years ago

Hi! Thank you for open sourcing your work!

I am trying to encode my protein sequence with your pretrained model according to the procedure you described in the issue: https://github.com/tbepler/protein-sequence-embedding-iclr2019/issues/1, so basically for testing purposes:

  1. I convert sequence into bytes:
    alphabet = Uniprot21()
    encoded_f = encode_sequence('ABC', alphabet)
    encoded_f2 = np.array([encoded_f, encoded_f]) . # just an imitation of the batch
  2. I upload model and encode the converted batched sequences:
    pretrained_model = torch.load('pfam_lm_lstm2x1024_tied_mb64.sav')
    pretrained_model.eval()
    features = TorchModel(pretrained_model, use_cuda=0, full_features=False)
    features(encoded_f2)

an error occurs:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-150-ed0bea8b81e2> in <module>
      1 import numpy as np
----> 2 features(encoded_f2)

./protein-sequence-embedding-iclr2019/eval_secstr.py in __call__(self, x)
    115             z = featurize(c, self.lm_embed, self.lstm_stack, self.proj)
    116         else:
--> 117             z = self.model(c) # embed the sequences
    118         z = unpack_sequences(z, order)
    119 

/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py in __call__(self, *input, **kwargs)
    487             result = self._slow_forward(*input, **kwargs)
    488         else:
--> 489             result = self.forward(*input, **kwargs)
    490         for hook in self._forward_hooks.values():
    491             hook_result = hook(self, input, result)

./protein-sequence-embedding-iclr2019/src/models/sequence.py in forward(self, x)
    225         # postpend reverse logp with zero
    226 
--> 227         b = h_fwd.size(0)
    228         zero = h_fwd.data.new(b,1,logp_fwd.size(2)).zero_()
    229         logp_fwd = torch.cat([zero, logp_fwd], 1)

AttributeError: 'PackedSequence' object has no attribute 'size'

I would be grateful for letting me know what I am doing wrong!

tbepler commented 4 years ago

The code you're referencing doesn't work out of the box for just the LM. To get the hidden layers from the LM, you can use the BiLM.encode() function.

A simple alteration that should get the job done is to replace

z = self.model(c)

in TorchModel.__call__() with

z = self.model.encode(c)
dominikabasaj commented 4 years ago

Thanks for your answer! Does that mean that this is the way to obtain sequences encoded by full SSA model? (Although I should probably change saved model to ('ssa_L1_100d_lstm3x512_lm_i512_mb64_tau0.5_p0.05_epoch100.sav')

tbepler commented 4 years ago

Yes, that's correct. Setting the full_features argument to False gives only the final SSA embedding. Setting it to True gives the concatenation of all hidden layers as well.