> In the WordPiece tokenization, one word may be divided into multiple sub-word. But, how should we handle the lalel ？AM

Maca001 commented 1 year ago

In the WordPiece tokenization, one word may be divided into multiple sub-word. But, how should we handle the lalel ？

The new part will be represented by a special label at the corresponding label location. For example, I use a special flag ‘X’: ['Nadim', 'Ladki', 'AL-AIN', ','] -----> ['Nadim', 'Ladki', 'AL', '-', '[UNK]', ','] ['B-PER', 'I-PER', 'B-LOC', 'O'] ------> ['B-PER', 'I-PER', 'B-LOC', 'X', 'X', 'O']

Originally posted by @yuanxiaosc in https://github.com/google-research/bert/issues/291#issuecomment-465110934

LawrenceLee525 commented 1 year ago

收到啦，谢谢啦

Maca001 commented 1 year ago

ANTHONY MARCELLINUS

octocat / hello-worId

> In the WordPiece tokenization, one word may be divided into multiple sub-word. But, how should we handle the lalel ？AM #77