Open tengerye opened 3 years ago
it's just a way to store the label. Let's say a sentence hello there general kenobi hello
becomes hello there general ke ##no ##bi hello
when you tokenize it, then and the labels are "the position in the sentence" than the label was originally 0 1 2 3 4
and the resulting 'tokenized' label becomes (pt = pad token) 0 1 2 3 pt pt 4
. You could make it 0 1 2 3 3 3 4
, but you're not adding any information, as you can convert 0 1 2 3 pt pt 4
to 0 1 2 3 3 3 4
and back without any loss of information
You need to add a padding to labels as well for token classification problem as you do for input tokens .
I changed this to tokenizer.pad_token_id
and delete pad_token_label_id
param in inputs. I quess there is no differences
Model I am using is LayoutLM. In the function
convert_examples_to_features
, there is a snippet:May I ask why not give the real label id for all tokenized words?