RoBERTa tokenizer does not add start and end token at the beginning and end of the sentence

Environment info

transformers version: 4.2.0dev0
Platform: Linux-3.10.0-1127.13.1.el7.x86_64-x86_64-with-redhat-7.8-Verona
Python version: 3.6.12
PyTorch version (GPU?): 1.7.1+cu101 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?: Yes
Using distributed or parallel set-up in script?: No

Who can help

@mfuntowicz @sgugger

Information

Model I am using (Bert, XLNet ...): RoBERTa

The problem arises when using:

[Yes] the official example scripts: (give details below) The problem occurs when running the run_mlm.py file in examples/language-modeling
[Yes] my own modified scripts: (give details below)

The tasks I am working on is: Language Modeling

To reproduce

Steps to reproduce the behavior:

Run python -m pdb examples/language-modeling/run_mlm.py --train_file= wikitext --dataset_config_name wikitext-2-raw-v1 --output_dir=/tmp/debug --model_type=roberta --config_name=roberta-base --tokenizer_name=roberta-base --learning_rate 1e-4 --num_train_epochs 2 --warmup_steps 10000 --do_train --save_steps 10000 --per_device_train_batch_size 2 --overwrite_output_dir
Insert breakpoint using the following command: (At line if self.use_amp):b ../../src/transformers/trainer.py:1138
Press c
print(self.tokenizer.decode(inputs['input_ids'][0]))

The output will look like the following:

' Photograph : The Very Best of Ringo Starr, and as a bonus track hisastered studio album Goodnight Vienna. Since his return touring in 1989, Starr has performed " Back Offogaloo " regularly in concert with the various incarnations of his All @-@ Starr Band. Commentators have interpreted the song, particularly this statement as an Starr on his former Beatles band facet McCartney. Starr denied such interpretation, instead " claiming that the song was inspired by Bolan and nothing more ", Beatles bi Robert Rodriguez writes. Starr had publicly criticised\'s solo albums McCartney 1970 ) and Ram ( 1971 ) on'

Expected behavior

Ideally the first token should have been <s> in RoBERTa because that is the start token. And the last token should have been </s> because that is the ending token. But those are not the start or end tokens. Wouldn't this be a departure from the implementation in the RoBERTa paper?

PS: Please ignore the strikethrough. No idea why that is appearing.

You are inspecting an input of the training datalaoder, which has been shuffled. Therefore you do not have the beginning of one of your original documents since by default, the script concatenates all your texts (after adding the special tokens at the beginning and the end) then splits the result in contiguous chunks of length max_seq_length (unspecified here so the default of a roberta-base model).

So the text you are inspecting is inside one of your original documents, which is why it doesn't have that ~~and~~

You can use the line_by_line option to change the script preprocessing to consider each line of your dataset as a separate entry (and apply padding or truncation to always have them of max_seq_length), in which case every input will have that </s> at the beginning.

huggingface / transformers