why strange token is allowed like '#@' ?

mheinzinger / SeqVec

Modelling the Language of Life - Deep Learning Protein Sequences

MIT License

70 stars 27 forks source link

(1)i defined a strange which maybe wrong sequence, but the embedding procedure is successful without any warnings and errors, how to explain the vector of these strange tokens. (2) If i have sequence like 'CGATWQEE' and how to padding to sequence with length 10, ['C', 'G', 'A', 'T', 'W', 'Q', 'E', 'E', '', ''] is OK ?

from allennlp.commands.elmo import ElmoEmbedder from pathlib import Path model_dir=Path('~/project/SDP/software/seqvec/uniref50_v2/') weights = model_dir / 'weights.hdf5' options = model_dir / 'options.json' seqvec = ElmoEmbedder(options,weights,cuda_device=0) seqs = ['A', 'YDHASYDH', 'DSADAS', '#@'] seqvec.embed_sentence(seqs) array([[[ 4.0139422e-01, 2.5591016e-01, 3.3083811e-01, ..., -2.6204288e-01, -2.1510348e-01, -7.6741979e-02], [ 5.6246715e+00, 6.3809357e+00, 3.6932642e+00, ..., ...

mheinzinger / SeqVec

why strange token is allowed like '#@' ? #10