Vocabulary not accepted by Luke Models

studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings

Apache License 2.0

705 stars 101 forks source link

Hi, @ikuyamada I've great experience with luke. As I mentioned in #129 I am not asking for a solution to this problem. But needs some clarification.

Do luke models accept alphanumerics as individual tokens. Looks like the tokenization is incorrect for such tokens and eventually it errored out as mentioned in #129.
Also is there any way in the implementations to handle new words/tokens that are not in the lookup table of model vocabulary? I would like to solve this problem as it is unsolved and many people have the same issue with RoBERTa-based models.

studio-ousia / luke