Closed ameet-1997 closed 3 years ago
You are inspecting an input of the training datalaoder, which has been shuffled. Therefore you do not have the beginning of one of your original documents since by default, the script concatenates all your texts (after adding the special tokens at the beginning and the end) then splits the result in contiguous chunks of length max_seq_length
(unspecified here so the default of a roberta-base model).
So the text you are inspecting is inside one of your original documents, which is why it doesn't have that and
You can use the line_by_line
option to change the script preprocessing to consider each line of your dataset as a separate entry (and apply padding or truncation to always have them of max_seq_length
), in which case every input will have that </s>
at the beginning.
Thanks for the information, this makes sense!
Environment info
transformers
version: 4.2.0dev0Who can help
@mfuntowicz @sgugger
Information
Model I am using (Bert, XLNet ...): RoBERTa
The problem arises when using:
run_mlm.py
file inexamples/language-modeling
The tasks I am working on is: Language Modeling
To reproduce
Steps to reproduce the behavior:
python -m pdb examples/language-modeling/run_mlm.py --train_file= wikitext --dataset_config_name wikitext-2-raw-v1 --output_dir=/tmp/debug --model_type=roberta --config_name=roberta-base --tokenizer_name=roberta-base --learning_rate 1e-4 --num_train_epochs 2 --warmup_steps 10000 --do_train --save_steps 10000 --per_device_train_batch_size 2 --overwrite_output_dir
if self.use_amp
):b ../../src/transformers/trainer.py:1138
c
print(self.tokenizer.decode(inputs['input_ids'][0]))
The output will look like the following:
Expected behavior
Ideally the first token should have been
<s>
in RoBERTa because that is the start token. And the last token should have been</s>
because that is the ending token. But those are not the start or end tokens. Wouldn't this be a departure from the implementation in the RoBERTa paper?PS: Please ignore the strikethrough. No idea why that is appearing.