This here is just a empty label id because it corresponds to a subtoken ("empty" labels are assigned to special tokens and subtokens introduced by the tokenizer). The labels are not related to the tokenizer, so we should not use self.tokenizer.pad_token.
I've noticed that the label to pad is hard-coded and in certain tokenizer it may be different (e.g.
<s>
).For example (preprocess.py:343 and few other places below):
perhaps we should change it as:
This was found when investigating #150 , I'm asking because I'm not 100% sure this is actually a problem.