microsoft / HMNet

Official Implementation of "A Hierarchical Network for Abstractive Meeting Summarization with Cross-Domain Pretraining""
Other
78 stars 19 forks source link

tokenizer.convert_ids_to_tokens not generating special tokens with predefined position offset #7

Closed YebowenHu closed 2 years ago

YebowenHu commented 3 years ago

https://github.com/microsoft/HMNet/blob/1f5a24d656e8bf111560551daa66d81a5028dd93/Models/Networks/MeetingNet_Transformer.py#L50-L58 In this snippet of code, it set up a default special_token_name with offset. Then later, the special token (pad_token, bos_token are not exist in pretrained_tokenizer) need to be added into tokenizer. I tried to load pretrained tokenizer from transof-xl-wt103 under ExampleInitModel and generate tokens from ids base on the predefined offset.

tokenizer.convert_ids_to_tokens(len(self.tokenizer)-special_token_id_offset))

The returned tokens turn out to be specific words, not '\<pad>' or '\<bos>' tokens.

When the token_name is "pad_token" or "bos_token" with offset of "130", "131": 'The return: Islahul 267605,McShan 267604'

May I ask how did you setup the offset value of these special tokens? Is it normal that the 'transof-xl-wt103' doesn't need pad_token and bos_token or these special tokens actually should be set up somewhere else?