Vocab size issue - Githubissues

wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].

MIT License

186 stars 35 forks source link

I found that the vocab size of embedding layer is 50,004. However, the vocab size of subword (bpe) tokenizer is 50,044, which causes out-of-vocab problem.

I got the vocab size of tokenizer by using this code vocab_size = len(bart.task.source_dictionary)

I faced this problem when I use bart.encode() function.

Here is how I load the pre-trained bart model in python

bart = BARTModel.from_pretrained(model_path, checkpoint_file=model_file)

where model_path contains the pre-trained PLBART, and model_file is plbart_base.pt

I am not sure if I do something wrong here. Can anyone help me?

Thanks.

bart = BARTModel.from_pretrained( args.checkpoint_dir, checkpoint_file=args.checkpoint_file, data_name_or_path=args.data_name_or_path, user_dir=root_dir.joinpath('source'), task=args.task, bpe='sentencepiece', sentencepiece_model=spm_dir, )

wasiahmad / PLBART

Vocab size issue #30