File exists error when used with TPU

Hi,

I'm getting a "File exists" error when I use text dataset for pre-training a RoBERTa model using transformers (3.0.2) and nlp(0.4.0) on a VM with TPU (v3-8).

I modified line 131 in the original run_language_modeling.py as follows:

# line 131: return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
dataset = load_dataset("text", data_files=file_path, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                        truncation=True, max_length=args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
return dataset

When I run this with xla_spawn.py, I get the following error (it produces one message per core in TPU, which I believe is fine).

It seems the current version doesn't take into account distributed training processes as in this example?

08/25/2020 13:59:41 - WARNING - nlp.builder -   Using custom data configuration default
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Exception in device=TPU:6: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:4: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:1: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Exception in device=TPU:7: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:3: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Exception in device=TPU:2: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:0: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
      main()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
      File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
Traceback (most recent call last):
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'

I am facing probably facing similar issues with

wiki40b_en_100_0

Could you try to run dataset = load_dataset("text", data_files=file_path, split="train") once before calling the script ?

It looks like several processes try to create the dataset in arrow format at the same time. If the dataset is already created it should be fine

Thanks! I tested on 328MB text data on n1-standard-8 (8 vCPUs, 30 GB memory). The main script ran without any issue, but it seems to require a huge space in the drive.

As suggested, I ran the following script before running the pre-training command with xla_spawn.py.

from nlp import load_dataset

file_path="your_file_name"
load_dataset("text", data_files=file_path, split="train")

This will create text-train.arrow under the default cache directory. Then, I run the script with xla_spawn.py. It will load data from the cached file. My understanding is that there's no other way but to do this two-step process with the current version (0.4) of nlp.

During another caching process that happens in the main script:

08/26/2020 09:19:51 - INFO - nlp.utils.info_utils -   All the checksums matched successfully for post processing resources
08/26/2020 09:19:53 - INFO - nlp.arrow_dataset -   Caching processed dataset at /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d/cache-f90f341e5308a7469
8d872bcc88f9c0e.arrow

nlp generates a temporary file per core, each of which is three times larger than the original text data. If each process is actually writing on the disk, you will need a huge amount of space in your drive. (Maybe I'm missing something.)

-rw-r--r-- 1 ***** *****  674 Aug 26 09:19 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 26 09:19 LICENSE
-rw-r--r-- 1 ***** ***** 332M Aug 26 09:10 text-train.arrow
-rw------- 1 ***** ***** 940M Aug 26 09:31 tmp0k43sazw
-rw------- 1 ***** ***** 940M Aug 26 09:31 tmp7sxs9mj5
-rw------- 1 ***** ***** 939M Aug 26 09:31 tmpbbiqw2vp
-rw------- 1 ***** ***** 937M Aug 26 09:31 tmpjxb5ptyu
-rw------- 1 ***** ***** 933M Aug 26 09:31 tmpk3hkdh0e
-rw------- 1 ***** ***** 944M Aug 26 09:31 tmpnoalwftz
-rw------- 1 ***** ***** 931M Aug 26 09:31 tmpuxdr_dz3
-rw------- 1 ***** ***** 945M Aug 26 09:31 tmpxjyuy6dk

After the caching process, they seem to be merged into one file.

-rw------- 1  ***** ***** 989M Aug 26 09:32 cache-f90f341e5308a74698d872bcc88f9c0e.arrow
-rw-r--r-- 1  ***** *****  674 Aug 26 09:19 dataset_info.json
-rw-r--r-- 1  ***** *****    0 Aug 26 09:19 LICENSE
-rw-r--r-- 1  ***** ***** 332M Aug 26 09:10 text-train.arrow

Again it looks like every process tries to tokenize the full dataset at the same time. If you do the tokenization before calling xla_spawn.py once, then each process will then use the tokenized cached file cache-f90f341e5308a74698d872bcc88f9c0e.arrow and not recompute it.

Not sure if there's a better way to do that cc @julien-c @thomwolf

I wrote a separate script just for preparing a cached file, including tokenization. Each process did use the tokenized cached file.

Currently I'm testing the pipeline on 24GB text data. It took about 1.5 hour to create a cached file on n1-highmem-16 (16 vCPUs, 104 GB memory). I assume loading this cached file in the main script with xla_spawn.py won't be an issue (even if there are 8 processes).

total 98G
drwxr-xr-x 2 ***** ***** 4.0K Aug 26 13:38 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 26 12:24 ..
-rw------- 1 ***** *****  74G Aug 26 13:38 cache-a7aa04134ba7b1aff5d9710f14a4e334.arrow
-rw-r--r-- 1 ***** *****  681 Aug 26 12:24 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 26 12:24 LICENSE
-rw-r--r-- 1 ***** *****  25G Aug 26 12:24 text-train.arrow

Yes loading the cached file should be fine from different processes

Sorry, I thought it was working, but actually the second call doesn't use the cached file that was generated separately, and it will generate another cache-****.arrorw file with a different name. If I run the training script again (with xla_spawn.py), it will use the second cached file, which was generated by the training script itself in the previous run.

drwxr-xr-x 2 ***** ***** 4.0K Aug 26 15:35 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 26 15:29 ..
-rw------- 1 ***** *****  99M Aug 26 15:35 cache-0d77dfce704493dbe63f071eed6a5431.arrow
-rw------- 1 ***** *****  99M Aug 26 15:29 cache-69633651476e943b93c89ace715f9487.arrow
-rw-r--r-- 1 ***** *****  670 Aug 26 15:33 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 26 15:33 LICENSE
-rw-r--r-- 1 ***** *****  33M Aug 26 15:29 text-train.arrow

So if I understand correctly it means that the cached file generated by your separated script is different by the one used by the training script ?

Yes.

cache-69633651476e943b93c89ace715f9487.arrow was generated with a separate script.
I ran the entire script with xla_spawn.py.
cache-69633651476e943b93c89ace715f9487.arrow is not used.
cache-0d77dfce704493dbe63f071eed6a5431.arrow is created.
training starts...

Now, if I kill the process at step 5, and do the step 2 again, it will use cache-0d77dfce704493dbe63f071eed6a5431.arrow (cached file created at step 4) without any issue.

I used the following to generate the first cached file.

dataset = load_dataset("text", data_files=file_path, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                        truncation=True, max_length=args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])

Here's the log from the first step.

Downloading and preparing dataset text/default-e84dd29acc4ad9ef (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Dataset text downloaded and prepared to /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d. Subsequent calls will reuse this data.

There's a file named cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow, so it did create a cached file.

drwxr-xr-x 2 ***** ***** 4.0K Aug 26 15:59 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 26 15:58 ..
-rw------- 1 ***** *****  99M Aug 26 15:59 cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow
-rw-r--r-- 1 ***** *****  670 Aug 26 15:58 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 26 15:58 LICENSE
-rw-r--r-- 1 ***** *****  33M Aug 26 15:58 text-train.arrow

Ideally, cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow should be used in run_language_modeling.py (modified version using nlp) with xla_spawn.py. But it looks like it's creating a new cached file.

08/26/2020 16:13:03 - INFO - filelock -   Lock 139635836351096 released on /home/*****/.cache/huggingface/datasets/3e34209a2741375a1db1ff03bf1abba1a9bd0e6016912d3ead0114b9d1ca2685.202fa4f84f552bff1f5400ae012663839c61efb3de068c6c8722d34ac0ea6192
.py.lock
08/26/2020 16:13:03 - WARNING - nlp.builder -   Using custom data configuration default
08/26/2020 16:13:03 - INFO - nlp.builder -   Overwrite dataset info from restored data version.
08/26/2020 16:13:03 - INFO - nlp.info -   Loading Dataset info from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.builder -   Reusing dataset text (/home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/26/2020 16:13:03 - INFO - nlp.builder -   Constructing Dataset for split train, from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.utils.info_utils -   All the checksums matched successfully for post processing resources
08/26/2020 16:13:03 - INFO - nlp.builder -   Overwrite dataset info from restored data version.
08/26/2020 16:13:03 - INFO - nlp.info -   Loading Dataset info from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.builder -   Reusing dataset text (/home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/26/2020 16:13:03 - INFO - nlp.builder -   Constructing Dataset for split train, from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.utils.info_utils -   All the checksums matched successfully for post processing resources
08/26/2020 16:13:05 - INFO - nlp.arrow_dataset -   Caching processed dataset at /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d/cache-0d77dfce704493dbe
63f071eed6a5431.arrow
^M  0%|          | 0/100 [00:00<?, ?it/s]08/26/2020 16:13:05 - INFO - nlp.arrow_dataset -   Caching processed dataset at /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6
fe661fe4d070d380d/cache-0d77dfce704493dbe63f071eed6a5431.arrow

There are two cached files in the directory:

drwxr-xr-x 2 ***** ***** 4.0K Aug 26 16:14 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 26 15:58 ..
-rw------- 1 ***** *****  99M Aug 26 16:14 cache-0d77dfce704493dbe63f071eed6a5431.arrow
-rw------- 1 ***** *****  99M Aug 26 15:59 cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow
-rw-r--r-- 1 ***** *****  670 Aug 26 16:13 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 26 16:13 LICENSE
-rw-r--r-- 1 ***** *****  33M Aug 26 15:58 text-train.arrow

If I kill the process, and run it again, it will use the second cached file.

08/26/2020 16:19:52 - WARNING - nlp.builder -   Using custom data configuration default
08/26/2020 16:19:52 - INFO - nlp.builder -   Overwrite dataset info from restored data version.
08/26/2020 16:19:52 - INFO - nlp.info -   Loading Dataset info from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:19:52 - INFO - nlp.builder -   Reusing dataset text (/home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/26/2020 16:19:52 - INFO - nlp.builder -   Constructing Dataset for split train, from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:19:52 - INFO - nlp.utils.info_utils -   All the checksums matched successfully for post processing resources
08/26/2020 16:19:53 - INFO - nlp.arrow_dataset -   Loading cached processed dataset at /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d/cache-0d77dfce70
4493dbe63f071eed6a5431.arrow
08/26/2020 16:19:53 - INFO - nlp.arrow_dataset -   Set __getitem__(key) output type to torch for ['input_ids'] columns  (when key is int or slice) and don't output other (un-formatted) columns.

Thanks for all the details. The two cached files are supposed to be the same. I suspect that the caching has a problem with the tokenizer. Which tokenizer did you use ?

I trained a byte-level BPE tokenizer on my data with tokenziers library following this example.

And I put these model files in a directory named "model_name". I also put config.json, which is the original RoBERTa config file.

%ls  model_name
config.json     merges.txt      vocab.json

This is the line where run_language_modeling.py loads the tokenier.

tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)

I use "model_name" for model_args.tokenizer_name. I don't specify model_args.cache_dir. It is 'None' by default.

In my separated script for caching, I'm using use_fast=True when initializing a tokenizer.

tokenizer = AutoTokenizer.from_pretrained(args.config_name, use_fast=True)

I wasn't using that option in the main script. That could be the reason...

Yea it could definitely explain why you have two different cache files. Let me know if using the same tokenizers on both sides fixes the issue

It still creates a new file even if I remove use_fast=True...

Here's the script used to create a cached file.

#!/usr/bin/env python3

import argparse

from transformers import AutoTokenizer

from nlp import load_dataset

def main():
    parser = argparse.ArgumentParser(description='description')
    parser.add_argument('--config_name', type=str, help='Pretrained config name or path if not the same as model_name')
    parser.add_argument('--data_file', type=str, help='The input data file (a text file).')
    parser.add_argument('--block_size', type=int, default=-1, help='The training dataset will be truncated in block of this size for training')
    args = parser.parse_args()

    tokenizer = AutoTokenizer.from_pretrained(args.config_name)

    dataset = load_dataset("text", data_files=args.data_file, split="train")
    dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                truncation=True, max_length=args.block_size), batched=True)
    dataset.set_format(type='torch', columns=['input_ids'])

if __name__ == "__main__":
    main()

Here's how the data is loaded in the modified run_language_modeling.py. [original function]

def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    split = "validation" if evaluate else "train"
    if args.line_by_line:
        # return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
        dataset = load_dataset("text", data_files=file_path, split="train")
        dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                              truncation=True, max_length=args.block_size), batched=True)
        dataset.set_format(type='torch', columns=['input_ids'])
        return dataset

    else:
        return TextDataset(
            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, overwrite_cache=args.overwrite_cache
        )

Probably I don't need this part in the main script,

dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                              truncation=True, max_length=args.block_size), batched=True)
        dataset.set_format(type='torch', columns=['input_ids'])

and simply do this?

dataset = load_dataset("text", data_files=file_path, split="train")
return dataset

You need this part in the main script or it will use the dataset that is not tokenized

I can see that the tokenizer in run_language_modeling.py is not instantiated the same way as in your separated script. Indeed we can see L196:

tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)

Could you try to make it so they are instantiated the exact same way please ?

I updated my separated script, but it's creating a cached file again. If I don't use the model_args.cache_dir, both will get None, so they should be the same.

#!/usr/bin/env python3
import argparse

from transformers import AutoTokenizer
from nlp import load_dataset

def main():
    parser = argparse.ArgumentParser(description='description')
    parser.add_argument('--tokenizer_name', type=str, help='Pretrained tokenizer name or path if not the same as model_name')
    parser.add_argument('--data_file', type=str, help='The input data file (a text file).')
    parser.add_argument('--cache_dir', type=str, default=None, help='Where do you want to store the pretrained models downloaded from s3')
    parser.add_argument('--block_size', type=int, default=-1, help='The training dataset will be truncated in block of this size for training')

    model_args = parser.parse_args()

    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)

    dataset = load_dataset("text", data_files=model_args.data_file, split="train")
    dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                truncation=True, max_length=model_args.block_size), batched=True)
    dataset.set_format(type='torch', columns=['input_ids'])

if __name__ == "__main__":
    main()

Is there a way to specify the cache file to load, and skip the re-computation?

Could you also check that the args.block_size used in the lambda function is the same as well ?

Here's a minimal working example to reproduce this issue.

Assumption:

You have access to TPU.
You have installed transformers and nlp.
You have tokenizer files (config.json, merges.txt, vocab.json) under the directory named model_name.
You have xla_spawn.py (Download from https://github.com/huggingface/transformers/blob/master/examples/xla_spawn.py).
You have saved the following script as prepare_cached_dataset.py.

#!/usr/bin/env python3
import argparse
from transformers import AutoTokenizer
from nlp import load_dataset

def main():
    parser = argparse.ArgumentParser(description='description')
    parser.add_argument('--tokenizer_name', type=str, help='Pretrained tokenizer name or path if not the same as model_name')
    parser.add_argument('--data_file', type=str, help='The input data file (a text file).')
    parser.add_argument('--cache_dir', type=str, default=None, help='Where do you want to store the pretrained models downloaded from s3')
    parser.add_argument('--block_size', type=int, default=-1, help='The training dataset will be truncated in block of this size for training')
    parser.add_argument('--tpu_num_cores', type=int, default=1, help='Number of TPU cores to use (1 or 8). For xla_apwan.py')
    model_args = parser.parse_args()

    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=True)

    dataset = load_dataset("text", data_files=model_args.data_file, split="train")
    dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                truncation=True, max_length=model_args.block_size), batched=True)
    dataset.set_format(type='torch', columns=['input_ids'])

def _mp_fn(index):
    # For xla_spawn (TPUs)
    main()

if __name__ == "__main__":
    main()

Run the following command. Replace your_training_data with some text file.

export TRAIN_DATA=your_training_data

python prepare_cached_dataset.py \
--tokenizer_name=model_name \
--block_size=512 \
--data_file=$TRAIN_DATA

Check the cached directory.

ls -lha /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
total 132M
drwxr-xr-x 2 ***** ***** 4.0K Aug 28 13:08 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 28 13:08 ..
-rw------- 1 ***** *****  99M Aug 28 13:08 cache-bfc7cb0702426d19242db5e8c079f04b.arrow
-rw-r--r-- 1 ***** *****  670 Aug 28 13:08 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 28 13:08 LICENSE
-rw-r--r-- 1 ***** *****  33M Aug 28 13:08 text-train.arrow

Run the same script again. (The output should be just Using custom data configuration default.)

python prepare_cached_dataset.py \
--tokenizer_name=model_name \
--block_size=512 \
--data_file=$TRAIN_DATA

Check the cached directory.

ls -lha /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
total 132M
drwxr-xr-x 2 ***** ***** 4.0K Aug 28 13:08 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 28 13:08 ..
-rw------- 1 ***** *****  99M Aug 28 13:08 cache-bfc7cb0702426d19242db5e8c079f04b.arrow
-rw-r--r-- 1 ***** *****  670 Aug 28 13:20 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 28 13:20 LICENSE
-rw-r--r-- 1 ***** *****  33M Aug 28 13:08 text-train.arrow

The cached file (cache-bfc7cb0702426d19242db5e8c079f04b.arrow) is reused.
Now, run this script with xla_spawn.py. Ideally, it should reuse the cached file, however, you will see each process is creating a cache file again.

python xla_spawn.py --num_cores 8 \
prepare_cached_dataset.py \
--tokenizer_name=model_name \
--block_size=512 \
--data_file=$TRAIN_DATA

Check the cached directory. There are two arrrow files.

ls -lha /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
total 230M
drwxr-xr-x 2 ***** ***** 4.0K Aug 28 13:25 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 28 13:08 ..
-rw------- 1 ***** *****  99M Aug 28 13:08 cache-bfc7cb0702426d19242db5e8c079f04b.arrow
-rw------- 1 ***** *****  99M Aug 28 13:25 cache-e0e2313e49c8a110aafcc8133154c19a.arrow
-rw-r--r-- 1 ***** *****  670 Aug 28 13:24 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 28 13:24 LICENSE
-rw-r--r-- 1 ***** *****  33M Aug 28 13:08 text-train.arrow

I ended up specifying the cache_file_name argument when I call map function.

dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True, truncation=True, max_length=args.block_size),
                      batched=True,
                      cache_file_name=cache_file_name)

Note:

text dataset in nlp does not strip "\n". If you want the same output as in LineByLineTextDataset, you would need to create your own dataset class where you replace line to line.strip() here.

huggingface / datasets

File exists error when used with TPU #532