Tokenization issue with pretrained model

pruksmhc commented 3 years ago

I am trying to pretrain BART further from the huggingface checkpoint with the below command, and it seems like there is an issue with mismatched amount of arguments for _tokenize.

The command is below: python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model facebook/bart-large --tokenizer_name_or_path facebook/bart-large --langs en --mono_src examples/data/train.en --batch_size 8

The error is: Using softmax temperature of 1.0 Masking ratio: 0.3 Training for: ['en'] Shuffling corpus! Traceback (most recent call last): File "pretrain_nmt.py", line 628, in run_demo() File "pretrain_nmt.py", line 625, in run_demo mp.spawn(model_create_load_run_save, nprocs=args.gpus, args=(args,files,train_files,)) # File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 199, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 157, in start_processes while not context.join(): File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join raise Exception(msg) Exception:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/root/yanmtt/py36/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap fn(i, *args) File "/root/yanmtt/pretrain_nmt.py", line 221, in model_create_load_run_save for input_ids, input_masks, decoder_input_ids, labels in generate_batches_monolingual_masked_or_bilingual(tok, args, rank, files, train_files, ctr): #Batches are generated from here. The argument (0.30, 0.40) is a range which indicates the percentage of the source sentence to be masked in case we want masking during training just like we did during BART pretraining. The argument 3.5 is the lambda to the poisson length sampler which indicates the average length of a word sequence that will be masked. Since this is pretraining we do not do any evaluations even if we train on parallel corpora. File "/root/yanmtt/common_utils.py", line 482, in generate_batches_monolingual_masked iids = tok(lang + " " + masked_sentence + " ", add_special_tokens=False, return_tensors="pt").input_ids File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2377, in call kwargs, File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils_base.py", line 2447, in encode_plus kwargs, File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 441, in _encode_plus first_ids = get_input_ids(text) File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 410, in get_input_ids tokens = self.tokenize(text, **kwargs) File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 342, in tokenize tokenized_text = split_on_tokens(no_split_token, text) File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in split_on_tokens for token in tokenized_text File "/root/yanmtt/py36/lib/python3.6/site-packages/transformers-4.3.2-py3.6.egg/transformers/tokenization_utils.py", line 336, in for token in tokenized_text TypeError: _tokenize() takes 2 positional arguments but 5 were given

Upon some further inspection, it seems like in a commit a few days ago, this line was changed to have 4 arguments: https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/tokenization_utils.py#L319

However, the _tokenize function for BART tokenizer (which inherits all the way down from GPT2 I believe), takes in less arguments: https://github.com/prajdabre/yanmtt/blob/main/transformers/src/transformers/models/gpt2/tokenization_gpt2.py#L241

prajdabre commented 3 years ago

Hi,

Thats because I have not made the necessary modifications to the "generate_batches_monolingual_masked" method to handle official BART tokenizers. Can you give me a few hours? Ill code it up and make push my changes.

prajdabre commented 3 years ago

Hi again,

Pull the latest version of the code and try the same command again. It should work. Lemme know if it doesn't.

BTW the default batch size is in number of tokens so please change it to something like 2048 or pass the flag --batch_size_indicates_lines.

pruksmhc commented 3 years ago

Hm, I'm still getting the tokenization error. Is it because I'm trying to train a BART model (using BartTokenizer) rather than MBart? I see that in the latest commit, only bart50/tokenization_mbart50.py has been modified for _tokenize function. Since the tokenize API for MBart and BART seems to differ slightly, perhaps it makes sense to have some if-else condition in tokenization_utils? Or to introduce sentencepiece into BART tokenization as well, although it seems like Sentencepiece isn't used in HF version of RoBERTa/BART tokenizer either.

prajdabre commented 3 years ago

Hi,

I previously thought that it was just the masking code that was the issue but it turned out that the GPT2 tokenizer did not take additional arguments which are passed by default by the code modifications I made to the tokenizer_utils methods.

Ive addressed and tested it this time.

Try the command:

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model facebook/bart-large --tokenizer_name_or_path facebook/bart-large --langs en --mono_src examples/data/train.en --batch_size 512 --shard_files

or

python pretrain_nmt.py -n 1 -nr 0 -g 1 --use_official_pretrained --pretrained_model facebook/bart-large --tokenizer_name_or_path facebook/bart-large --langs en --mono_src examples/data/train.en --batch_size 8 --batch_size_indicates_lines --shard_files

Note that --shard_files is needed if you are running the code for the first time on unsharded data. (I know this just creates a duplicate file with the suffix 0 but I chose not to handle this case separately to keep my code simpler.)

prajdabre / yanmtt

Tokenization issue with pretrained model #2