UnicodeDecodeError when using run_mlm_flax.py

hazalturkmen commented 2 years ago

Hi, I want to develop BERT model from scratch using Turkish text corpus. First I created tokenizer and I load text data from my local as seen below

from tokenizers import BertWordPieceTokenizer
import glob
tokenizer = BertWordPieceTokenizer(
        clean_text=True,
        handle_chinese_chars=False,
        strip_accents=False,
        lowercase=False,
    )
files = glob.glob('/content/drive/MyDrive/Scorpus.txt')
trainer = tokenizer.train(
    files,
    vocab_size=32000,
    min_frequency=2,
    show_progress=True,
    special_tokens=['[PAD]', '[UNK]', '[CLS]', '[SEP]', '[MASK]'],
    limit_alphabet=1000,
    wordpieces_prefix="##"
)
tokenizer.save_model("/content/bert")
from datasets import load_dataset

# load dataset
datasetr = load_dataset('text', data_files={'train': ['/content/drive/MyDrive/Scorpus.txt']},encoding='utf-8')

Then, I run run_mlm_flax.py

!python run_mlm_flax.py \
    --output_dir="/content/bert" \
    --model_type="bert" \
    --config_name="/content/bert" \
    --tokenizer_name="/content/bert" \
    --line_by_line=True \
    --dataset_name="text" \
    --dataset_config_name="default-b06526c46e9384b1" \
    --max_seq_length="512" \
    --weight_decay="0.01" \
    --per_device_train_batch_size="128" \
    --learning_rate="3e-4" \
    --overwrite_output_dir \
    --num_train_epochs="16" \
    --adam_beta1="0.9"

And I get an error


[19:02:31] - INFO - __main__ - Training/evaluation parameters TrainingArguments(output_dir='/content/bert', overwrite_output_dir=True, do_train=False, do_eval=False, per_device_train_batch_size=128, per_device_eval_batch_size=8, learning_rate=0.0003, weight_decay=0.01, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, adafactor=False, num_train_epochs=16.0, warmup_steps=0, logging_steps=500, save_steps=500, eval_steps=None, seed=42, push_to_hub=False, hub_model_id=None, hub_token=None)
[19:02:31] - WARNING - datasets.builder - Using custom data configuration default-b06526c46e9384b1-d2418f61cbe4411a
Downloading and preparing dataset text/default-b06526c46e9384b1 to /root/.cache/huggingface/datasets/text/default-b06526c46e9384b1-d2418f61cbe4411a/0.0.0/21a506d1b2b34316b1e82d0bd79066905d846e5d7e619823c0dd338d6f1fa6ad...
Downloading data files: 100% 1/1 [00:00<00:00, 5190.97it/s]
Extracting data files: 100% 1/1 [00:00<00:00, 543.80it/s]
Traceback (most recent call last):
  File "run_mlm_flax.py", line 880, in <module>
    main()
  File "run_mlm_flax.py", line 430, in main
    use_auth_token=True if model_args.use_auth_token else None,
  File "/usr/local/lib/python3.7/dist-packages/datasets/load.py", line 1751, in load_dataset
    use_auth_token=use_auth_token,
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 705, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 793, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/datasets/builder.py", line 1275, in _prepare_split
    generator, unit=" tables", leave=False, disable=(not logging.is_progress_bar_enabled())
  File "/usr/local/lib/python3.7/dist-packages/tqdm/std.py", line 1195, in __iter__
    for obj in iterable:
  File "/usr/local/lib/python3.7/dist-packages/datasets/packaged_modules/text/text.py", line 77, in _generate_tables
    batch = f.read(self.config.chunksize)
  File "/usr/lib/python3.7/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

Note: I use google colab

LysandreJik commented 2 years ago

If you first serialize your dataset to local files, and then use that path as a dataset name; does it work then?

hazalturkmen commented 2 years ago

Thank you for answer. I am new for using Huggingface dataset libary. My corpus only contains sentences with each line being a sentence. How can I serialize my dataset?

LysandreJik commented 2 years ago

Let me cc @albertvillanova, who might have seen this error before :)

albertvillanova commented 2 years ago

Hi @hazalturkmen, thanks for reporting.

When looking at your script call, I see you pass "default-b06526c46e9384b1" as dataset_config_name, then pointing to the cache (containing the Arrow file instead of the text file). I guess this is the cause of the issue.

When running the script run_mlm_flax.py with local data files, you should pass instead the parameter train_file (and validation_file if applicable).

In summary, when calling run_mlm_flax.py:

you should not pass dataset_name nor dataset_config_name

you should pass train_file

!python run_mlm_flax.py \
--output_dir="/content/bert" \
--model_type="bert" \
--config_name="/content/bert" \
--tokenizer_name="/content/bert" \
--line_by_line=True \
--train_file="/content/drive/MyDrive/Scorpus.txt" \
--max_seq_length="512" \
--weight_decay="0.01" \
--per_device_train_batch_size="128" \
--learning_rate="3e-4" \
--overwrite_output_dir \
--num_train_epochs="16" \
--adam_beta1="0.9"

Please, let me know if this fixes your issue.

hazalturkmen commented 2 years ago

Thank you @albertvillanova ! It fixed the issue :+1:

huggingface / transformers

UnicodeDecodeError when using run_mlm_flax.py #18367