kermitt2 / delft

a Deep Learning Framework for Text https://delft.readthedocs.io/
Apache License 2.0
387 stars 64 forks source link

Hardcoded padding tokens #151

Closed lfoppiano closed 1 year ago

lfoppiano commented 1 year ago

I've noticed that the label to pad is hard-coded and in certain tokenizer it may be different (e.g. <s>).

For example (preprocess.py:343 and few other places below):

label_ids.append("<PAD>")

perhaps we should change it as:

label_ids.append(self.tokenizer.pad_token)

This was found when investigating #150 , I'm asking because I'm not 100% sure this is actually a problem.

kermitt2 commented 1 year ago

Hi @lfoppiano !

This here is just a empty label id because it corresponds to a subtoken ("empty" labels are assigned to special tokens and subtokens introduced by the tokenizer). The labels are not related to the tokenizer, so we should not use self.tokenizer.pad_token.

kermitt2 commented 1 year ago

See also https://github.com/kermitt2/delft/issues/150#issuecomment-1382290023 for more details.

lfoppiano commented 1 year ago

Thanks for the clarification! I close this, we can eventually comment in #150