Will luke support fast tokenizer

studio-ousia / luke

LUKE -- Language Understanding with Knowledge-based Embeddings

Apache License 2.0

705 stars 102 forks source link

Will luke support fast tokenizer #170

Open TrickyyH opened 1 year ago

TrickyyH commented 1 year ago

Hello everyone, I am tring to use luke-large for question answering. I met serveral issues when finetune the model by SQAUD-like data, most of the issues comes by not supporting fast tokenizer. So I am wondering if luke will support fast tokenizer in the future, or is any ways to solve the issues. Thank you so much!

abebe9849 commented 1 year ago

Hi! If you refer to the following blog, it seems that offset_mapping can be used with LUKE. It has not been confirmed whether misalignment does not occur at any time. sorry

https://srad.jp/~yasuoka/journal/651897/

tealgreen0503 commented 1 year ago

I thought the same as @TrickyyH. Apart from offset_mapping, for instance, the behaviour of return_overflowing_tokens differs between slow and fast tokenisers. As a result, it becomes difficult to handle long texts in tasks like NER and QA, which LUKE excels at. I would be pleased if you could accommodate the fast tokeniser.

ryokan0123 commented 1 year ago

One possible workaround is to use the fast version of the base tokenizer, such as the Fast version of RobertaTokenizer, which LukeTokenizer is based on 'they have the same subword vocabulary).

However, this approach may not support entity-related outputs, which would require additional code to be written.