sacdallago / bio_embeddings

Get protein embeddings from protein sequences
http://docs.bioembeddings.com
MIT License
463 stars 65 forks source link

How to emb some proteins which are not same length to same length? #172

Closed viko-3 closed 2 years ago

viko-3 commented 3 years ago

Hi, author,Good job for this code! But I have a problem. If I have some proteins, such as ['VRWFPFDVQHCKLK', 'PFDVQHC',...] As you can see, they have different length, but I want to deal them to a same length. So can I use the padding method in NLP? If so, which token should I pick as the pad character?

mheinzinger commented 2 years ago

You can use the 'embed_many' function of the embedders which handles the padding internally. The following notebooks holds a small example: https://github.com/sacdallago/bio_embeddings/blob/develop/notebooks/embed_fasta_sequences.ipynb

wenyuhaokikika commented 1 year ago

sorry it means i need add padding manully? for example

max_length = 5
s = ['QWLG','GS']

i need convert s=['QWLG[PAD]','GS[PAD][PAD][PAD]'] as model input it is right?

mheinzinger commented 1 year ago

Nope, you don't need to do this manually. This is done in the background. You can check this for an example implementation of how bio_embeddings handles padding internally: https://github.com/sacdallago/bio_embeddings/blob/efb9801f0de9b9d51d19b741088763a7d2d0c3a2/bio_embeddings/embed/embedder_interfaces.py#L91

So you don't have to worry about this.

wenyuhaokikika commented 1 year ago

I run example

from bio_embeddings.embed import ProtTransBertBFDEmbedder
from Bio import SeqIO
sequences = []
for record in SeqIO.parse("tiny_sampled.fasta", "fasta"):
    sequences.append(record)
embedder = ProtTransBertBFDEmbedder()
embeddings = embedder.embed_many([str(s.seq) for s in sequences])
embeddings = list(embeddings)
[i.shape for i in embeddings]

and output

[(129, 1024),(129, 1024),(46, 1024),(133, 1024),(172, 1024),(386, 1024),(133, 1024),(207, 1024),(165, 1024),(439, 1024),(159, 1024),(1584, 1024)]

and i think the first dim should be same,every sentence has dim=[maxLength,1024] embedding.

Actually, when i use other pretrain model,

model_input = torch.tensor(tokenizer.encode_plus(seq, add_special_tokens=True, max_length=512,pad_to_max_length = True)["input_ids"], dtype=torch.long).unsqueeze(0)
output = model(model_input )

it will return array with dim = [512,1024],1024 is embedding dim.

how i get the same output with bio_embedding.

Thank u~~~