microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
https://aka.ms/GeneralAI
MIT License
19.62k stars 2.5k forks source link

LayoutLM: why use `pad_token_label_id`? #303

Open tengerye opened 3 years ago

tengerye commented 3 years ago

Model I am using is LayoutLM. In the function convert_examples_to_features, there is a snippet:

    # Use the real label id for the first token of the word, and padding ids for the remaining tokens
    label_ids.extend(
        [label_map[label]] + [pad_token_label_id] * (len(word_tokens) - 1)
    )

May I ask why not give the real label id for all tokenized words?

fredo838 commented 3 years ago

it's just a way to store the label. Let's say a sentence hello there general kenobi hello becomes hello there general ke ##no ##bi hello when you tokenize it, then and the labels are "the position in the sentence" than the label was originally 0 1 2 3 4 and the resulting 'tokenized' label becomes (pt = pad token) 0 1 2 3 pt pt 4. You could make it 0 1 2 3 3 3 4, but you're not adding any information, as you can convert 0 1 2 3 pt pt 4 to 0 1 2 3 3 3 4 and back without any loss of information

monuminu commented 3 years ago

You need to add a padding to labels as well for token classification problem as you do for input tokens .

WangKK1996 commented 3 years ago

I changed this to tokenizer.pad_token_id and delete pad_token_label_id param in inputs. I quess there is no differences