studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings
Apache License 2.0
705 stars 101 forks source link

Vocabulary not accepted by Luke Models #131

Closed elonmusk-01 closed 2 years ago

elonmusk-01 commented 2 years ago

Hi, @ikuyamada I've great experience with luke. As I mentioned in #129 I am not asking for a solution to this problem. But needs some clarification.

  1. Do luke models accept alphanumerics as individual tokens. Looks like the tokenization is incorrect for such tokens and eventually it errored out as mentioned in #129.
  2. Also is there any way in the implementations to handle new words/tokens that are not in the lookup table of model vocabulary? I would like to solve this problem as it is unsolved and many people have the same issue with RoBERTa-based models.
ikuyamada commented 2 years ago

Hi @elonmusk-01, LUKE is an extension of RoBERTa using entity embeddings and the subword tokenization adopted in our model is exactly the same as the one adopted in RoBERTa. We simply used the RoBERTa's tokenization implementation available in the Huggingface library in our experiments, so if the original implementation has such issue, LUKE also has the same issue. However, we do not have a detailed knowledge on the implementation of the RoBERTa tokenizer.