huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
131.74k stars 26.23k forks source link

RoBERTa tokenizer does not add start and end token at the beginning and end of the sentence #9502

Closed ameet-1997 closed 3 years ago

ameet-1997 commented 3 years ago

Environment info

Who can help

@mfuntowicz @sgugger

Information

Model I am using (Bert, XLNet ...): RoBERTa

The problem arises when using:

The tasks I am working on is: Language Modeling

To reproduce

Steps to reproduce the behavior:

  1. Run python -m pdb examples/language-modeling/run_mlm.py --train_file= wikitext --dataset_config_name wikitext-2-raw-v1 --output_dir=/tmp/debug --model_type=roberta --config_name=roberta-base --tokenizer_name=roberta-base --learning_rate 1e-4 --num_train_epochs 2 --warmup_steps 10000 --do_train --save_steps 10000 --per_device_train_batch_size 2 --overwrite_output_dir
  2. Insert breakpoint using the following command: (At line if self.use_amp):b ../../src/transformers/trainer.py:1138
  3. Press c
  4. print(self.tokenizer.decode(inputs['input_ids'][0]))

The output will look like the following:

' Photograph : The Very Best of Ringo Starr, and as a bonus track hisastered studio album Goodnight Vienna. Since his return touring in 1989, Starr has performed " Back Offogaloo " regularly in concert with the various incarnations of his All @-@ Starr Band. Commentators have interpreted the song, particularly this statement as an Starr on his former Beatles band facet McCartney. Starr denied such interpretation, instead " claiming that the song was inspired by Bolan and nothing more ", Beatles bi Robert Rodriguez writes. Starr had publicly criticised\'s solo albums McCartney 1970 ) and Ram ( 1971 ) on'

Expected behavior

Ideally the first token should have been <s> in RoBERTa because that is the start token. And the last token should have been </s> because that is the ending token. But those are not the start or end tokens. Wouldn't this be a departure from the implementation in the RoBERTa paper?

PS: Please ignore the strikethrough. No idea why that is appearing.

sgugger commented 3 years ago

You are inspecting an input of the training datalaoder, which has been shuffled. Therefore you do not have the beginning of one of your original documents since by default, the script concatenates all your texts (after adding the special tokens at the beginning and the end) then splits the result in contiguous chunks of length max_seq_length (unspecified here so the default of a roberta-base model).

So the text you are inspecting is inside one of your original documents, which is why it doesn't have that and

You can use the line_by_line option to change the script preprocessing to consider each line of your dataset as a separate entry (and apply padding or truncation to always have them of max_seq_length), in which case every input will have that </s> at the beginning.

ameet-1997 commented 3 years ago

Thanks for the information, this makes sense!