Closed viko-3 closed 2 years ago
You can use the 'embed_many' function of the embedders which handles the padding internally. The following notebooks holds a small example:
sorry it means i need add padding manully? for example
max_length = 5
s = ['QWLG','GS']
i need convert s=['QWLG[PAD]','GS[PAD][PAD][PAD]']
as model input
it is right?
Nope, you don't need to do this manually. This is done in the background. You can check this for an example implementation of how bio_embeddings handles padding internally:
So you don't have to worry about this.
I run example
from bio_embeddings.embed import ProtTransBertBFDEmbedder
from Bio import SeqIO
sequences = []
for record in SeqIO.parse("tiny_sampled.fasta", "fasta"):
embedder = ProtTransBertBFDEmbedder()
embeddings = embedder.embed_many([str(s.seq) for s in sequences])
embeddings = list(embeddings)
[i.shape for i in embeddings]
and output
[(129, 1024),(129, 1024),(46, 1024),(133, 1024),(172, 1024),(386, 1024),(133, 1024),(207, 1024),(165, 1024),(439, 1024),(159, 1024),(1584, 1024)]
and i think the first dim should be same,every sentence has dim=[maxLength,1024] embedding.
Actually, when i use other pretrain model,
model_input = torch.tensor(tokenizer.encode_plus(seq, add_special_tokens=True, max_length=512,pad_to_max_length = True)["input_ids"], dtype=torch.long).unsqueeze(0)
output = model(model_input )
it will return array with dim = [512,1024],1024 is embedding dim.
how i get the same output with bio_embedding.
Thank u~~~
Hi, author,Good job for this code! But I have a problem. If I have some proteins, such as ['VRWFPFDVQHCKLK', 'PFDVQHC',...] As you can see, they have different length, but I want to deal them to a same length. So can I use the padding method in NLP? If so, which token should I pick as the pad character?