wasiahmad / PLBART

Official code of our work, Unified Pre-training for Program Understanding and Generation [NAACL 2021].
https://arxiv.org/abs/2103.06333
MIT License
186 stars 35 forks source link

Vocab size issue #30

Closed oathaha closed 2 years ago

oathaha commented 2 years ago

I found that the vocab size of embedding layer is 50,004. However, the vocab size of subword (bpe) tokenizer is 50,044, which causes out-of-vocab problem.

I got the vocab size of tokenizer by using this code vocab_size = len(bart.task.source_dictionary)

I faced this problem when I use bart.encode() function.

Here is how I load the pre-trained bart model in python

bart = BARTModel.from_pretrained(model_path, checkpoint_file=model_file)

where model_path contains the pre-trained PLBART, and model_file is plbart_base.pt

I am not sure if I do something wrong here. Can anyone help me?

Thanks.

wasiahmad commented 2 years ago

Follow this.

bart = BARTModel.from_pretrained(
        args.checkpoint_dir,
        checkpoint_file=args.checkpoint_file,
        data_name_or_path=args.data_name_or_path,
        user_dir=root_dir.joinpath('source'),
        task=args.task,
        bpe='sentencepiece',
        sentencepiece_model=spm_dir,
    )

The last 4 parameters are optional. But data_name_or_path is crucial for dictionary loading. args.data_name_or_path should contain the *.dict.txt files.