mheinzinger / SeqVec

Modelling the Language of Life - Deep Learning Protein Sequences
MIT License
70 stars 27 forks source link

why strange token is allowed like '#@' ? #10

Closed yy1252450987 closed 4 years ago

yy1252450987 commented 4 years ago

(1)i defined a strange which maybe wrong sequence, but the embedding procedure is successful without any warnings and errors, how to explain the vector of these strange tokens. (2) If i have sequence like 'CGATWQEE' and how to padding to sequence with length 10, ['C', 'G', 'A', 'T', 'W', 'Q', 'E', 'E', '', ''] is OK ?

from allennlp.commands.elmo import ElmoEmbedder from pathlib import Path model_dir=Path('~/project/SDP/software/seqvec/uniref50_v2/') weights = model_dir / 'weights.hdf5' options = model_dir / 'options.json' seqvec = ElmoEmbedder(options,weights,cuda_device=0) seqs = ['A', 'YDHASYDH', 'DSADAS', '#@'] seqvec.embed_sentence(seqs) array([[[ 4.0139422e-01, 2.5591016e-01, 3.3083811e-01, ..., -2.6204288e-01, -2.1510348e-01, -7.6741979e-02], [ 5.6246715e+00, 6.3809357e+00, 3.6932642e+00, ..., ...

sacdallago commented 4 years ago

There are no sanity checks for the sequences you use; this is not a commercial product. We don't have the energy to produce better packages just yet (we have a more advanced pip package on the writing since months, but we always fall behind schedule).

As written in the paper, the sequences should contain only standard AAs. Everything else: we don't know what the embedding means.

You don't need to pad sequences. Just pass sequences as they are. There is only an upper bound length, not a lower bound length limitation.