mheinzinger / SeqVec

Modelling the Language of Life - Deep Learning Protein Sequences
MIT License
70 stars 27 forks source link

why the embedding of '<mask>' is not zeros? #12

Closed yy1252450987 closed 4 years ago

yy1252450987 commented 4 years ago

image

mheinzinger commented 4 years ago

There is no '< mask >' token in ELMo/SeqVec as it is auto-regressive, i.e. it does not need to mask out tokens during training (like BERT) because it is only trained on predicting the next character in a given sequence. Your '< mask >' is mapped to '< unk >' (unknown character) because it is not a valid amino acid.