huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
18.99k stars 2.63k forks source link

File exists error when used with TPU #532

Open go-inoue opened 4 years ago

go-inoue commented 4 years ago

Hi,

I'm getting a "File exists" error when I use text dataset for pre-training a RoBERTa model using transformers (3.0.2) and nlp(0.4.0) on a VM with TPU (v3-8).

I modified line 131 in the original run_language_modeling.py as follows:

# line 131: return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
dataset = load_dataset("text", data_files=file_path, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                        truncation=True, max_length=args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
return dataset

When I run this with xla_spawn.py, I get the following error (it produces one message per core in TPU, which I believe is fine).

It seems the current version doesn't take into account distributed training processes as in this example?

08/25/2020 13:59:41 - WARNING - nlp.builder -   Using custom data configuration default
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/25/2020 13:59:43 - INFO - nlp.builder -   Generating dataset text (/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Exception in device=TPU:6: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:4: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:1: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Exception in device=TPU:7: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:3: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Downloading and preparing dataset text/default-b0932b2bdbb63283 (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Exception in device=TPU:2: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Exception in device=TPU:0: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
      main()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
      File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
Traceback (most recent call last):
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
Traceback (most recent call last):
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/torch_xla/distributed/xla_multiprocessing.py", line 231, in _start_fn
    fn(gindex, *args)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 300, in _mp_fn
    main()
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 240, in main
    train_dataset = get_dataset(data_args, tokenizer=tokenizer) if training_args.do_train else None
  File "/home/*****/huggingface_roberta/run_language_modeling.py", line 134, in get_dataset
    dataset = load_dataset("text", data_files=file_path, split="train")
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/load.py", line 546, in load_dataset
    download_config=download_config, download_mode=download_mode, ignore_verifications=ignore_verifications,
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 450, in download_and_prepare
    with incomplete_dir(self._cache_dir) as tmp_data_dir:
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/contextlib.py", line 81, in __enter__
    return next(self.gen)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/nlp/builder.py", line 422, in incomplete_dir
    os.makedirs(tmp_dir)
  File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d.incomplete'
praetr commented 4 years ago

I am facing probably facing similar issues with

wiki40b_en_100_0

lhoestq commented 4 years ago

Could you try to run dataset = load_dataset("text", data_files=file_path, split="train") once before calling the script ?

It looks like several processes try to create the dataset in arrow format at the same time. If the dataset is already created it should be fine

go-inoue commented 4 years ago

Thanks! I tested on 328MB text data on n1-standard-8 (8 vCPUs, 30 GB memory). The main script ran without any issue, but it seems to require a huge space in the drive.

As suggested, I ran the following script before running the pre-training command with xla_spawn.py.

from nlp import load_dataset

file_path="your_file_name"
load_dataset("text", data_files=file_path, split="train")

This will create text-train.arrow under the default cache directory. Then, I run the script with xla_spawn.py. It will load data from the cached file. My understanding is that there's no other way but to do this two-step process with the current version (0.4) of nlp.

During another caching process that happens in the main script:

08/26/2020 09:19:51 - INFO - nlp.utils.info_utils -   All the checksums matched successfully for post processing resources
08/26/2020 09:19:53 - INFO - nlp.arrow_dataset -   Caching processed dataset at /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d/cache-f90f341e5308a7469
8d872bcc88f9c0e.arrow

nlp generates a temporary file per core, each of which is three times larger than the original text data. If each process is actually writing on the disk, you will need a huge amount of space in your drive. (Maybe I'm missing something.)

-rw-r--r-- 1 ***** *****  674 Aug 26 09:19 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 26 09:19 LICENSE
-rw-r--r-- 1 ***** ***** 332M Aug 26 09:10 text-train.arrow
-rw------- 1 ***** ***** 940M Aug 26 09:31 tmp0k43sazw
-rw------- 1 ***** ***** 940M Aug 26 09:31 tmp7sxs9mj5
-rw------- 1 ***** ***** 939M Aug 26 09:31 tmpbbiqw2vp
-rw------- 1 ***** ***** 937M Aug 26 09:31 tmpjxb5ptyu
-rw------- 1 ***** ***** 933M Aug 26 09:31 tmpk3hkdh0e
-rw------- 1 ***** ***** 944M Aug 26 09:31 tmpnoalwftz
-rw------- 1 ***** ***** 931M Aug 26 09:31 tmpuxdr_dz3
-rw------- 1 ***** ***** 945M Aug 26 09:31 tmpxjyuy6dk

After the caching process, they seem to be merged into one file.

-rw------- 1  ***** ***** 989M Aug 26 09:32 cache-f90f341e5308a74698d872bcc88f9c0e.arrow
-rw-r--r-- 1  ***** *****  674 Aug 26 09:19 dataset_info.json
-rw-r--r-- 1  ***** *****    0 Aug 26 09:19 LICENSE
-rw-r--r-- 1  ***** ***** 332M Aug 26 09:10 text-train.arrow
lhoestq commented 4 years ago

Again it looks like every process tries to tokenize the full dataset at the same time. If you do the tokenization before calling xla_spawn.py once, then each process will then use the tokenized cached file cache-f90f341e5308a74698d872bcc88f9c0e.arrow and not recompute it.

Not sure if there's a better way to do that cc @julien-c @thomwolf

go-inoue commented 4 years ago

I wrote a separate script just for preparing a cached file, including tokenization. Each process did use the tokenized cached file.

Currently I'm testing the pipeline on 24GB text data. It took about 1.5 hour to create a cached file on n1-highmem-16 (16 vCPUs, 104 GB memory). I assume loading this cached file in the main script with xla_spawn.py won't be an issue (even if there are 8 processes).

total 98G
drwxr-xr-x 2 ***** ***** 4.0K Aug 26 13:38 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 26 12:24 ..
-rw------- 1 ***** *****  74G Aug 26 13:38 cache-a7aa04134ba7b1aff5d9710f14a4e334.arrow
-rw-r--r-- 1 ***** *****  681 Aug 26 12:24 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 26 12:24 LICENSE
-rw-r--r-- 1 ***** *****  25G Aug 26 12:24 text-train.arrow
lhoestq commented 4 years ago

Yes loading the cached file should be fine from different processes

go-inoue commented 4 years ago

Sorry, I thought it was working, but actually the second call doesn't use the cached file that was generated separately, and it will generate another cache-****.arrorw file with a different name. If I run the training script again (with xla_spawn.py), it will use the second cached file, which was generated by the training script itself in the previous run.

drwxr-xr-x 2 ***** ***** 4.0K Aug 26 15:35 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 26 15:29 ..
-rw------- 1 ***** *****  99M Aug 26 15:35 cache-0d77dfce704493dbe63f071eed6a5431.arrow
-rw------- 1 ***** *****  99M Aug 26 15:29 cache-69633651476e943b93c89ace715f9487.arrow
-rw-r--r-- 1 ***** *****  670 Aug 26 15:33 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 26 15:33 LICENSE
-rw-r--r-- 1 ***** *****  33M Aug 26 15:29 text-train.arrow
lhoestq commented 4 years ago

So if I understand correctly it means that the cached file generated by your separated script is different by the one used by the training script ?

go-inoue commented 4 years ago

Yes.

  1. cache-69633651476e943b93c89ace715f9487.arrow was generated with a separate script.
  2. I ran the entire script with xla_spawn.py.
  3. cache-69633651476e943b93c89ace715f9487.arrow is not used.
  4. cache-0d77dfce704493dbe63f071eed6a5431.arrow is created.
  5. training starts...

Now, if I kill the process at step 5, and do the step 2 again, it will use cache-0d77dfce704493dbe63f071eed6a5431.arrow (cached file created at step 4) without any issue.

I used the following to generate the first cached file.

dataset = load_dataset("text", data_files=file_path, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                        truncation=True, max_length=args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
go-inoue commented 4 years ago
  1. Here's the log from the first step.
    Downloading and preparing dataset text/default-e84dd29acc4ad9ef (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/
    447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
    Dataset text downloaded and prepared to /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d. Subsequent calls will reuse this data.

    There's a file named cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow, so it did create a cached file.

    drwxr-xr-x 2 ***** ***** 4.0K Aug 26 15:59 .
    drwxr-xr-x 3 ***** ***** 4.0K Aug 26 15:58 ..
    -rw------- 1 ***** *****  99M Aug 26 15:59 cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow
    -rw-r--r-- 1 ***** *****  670 Aug 26 15:58 dataset_info.json
    -rw-r--r-- 1 ***** *****    0 Aug 26 15:58 LICENSE
    -rw-r--r-- 1 ***** *****  33M Aug 26 15:58 text-train.arrow
  2. Ideally, cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow should be used in run_language_modeling.py (modified version using nlp) with xla_spawn.py. But it looks like it's creating a new cached file.
08/26/2020 16:13:03 - INFO - filelock -   Lock 139635836351096 released on /home/*****/.cache/huggingface/datasets/3e34209a2741375a1db1ff03bf1abba1a9bd0e6016912d3ead0114b9d1ca2685.202fa4f84f552bff1f5400ae012663839c61efb3de068c6c8722d34ac0ea6192
.py.lock
08/26/2020 16:13:03 - WARNING - nlp.builder -   Using custom data configuration default
08/26/2020 16:13:03 - INFO - nlp.builder -   Overwrite dataset info from restored data version.
08/26/2020 16:13:03 - INFO - nlp.info -   Loading Dataset info from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.builder -   Reusing dataset text (/home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/26/2020 16:13:03 - INFO - nlp.builder -   Constructing Dataset for split train, from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.utils.info_utils -   All the checksums matched successfully for post processing resources
08/26/2020 16:13:03 - INFO - nlp.builder -   Overwrite dataset info from restored data version.
08/26/2020 16:13:03 - INFO - nlp.info -   Loading Dataset info from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.builder -   Reusing dataset text (/home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/26/2020 16:13:03 - INFO - nlp.builder -   Constructing Dataset for split train, from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.utils.info_utils -   All the checksums matched successfully for post processing resources
08/26/2020 16:13:05 - INFO - nlp.arrow_dataset -   Caching processed dataset at /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d/cache-0d77dfce704493dbe
63f071eed6a5431.arrow
^M  0%|          | 0/100 [00:00<?, ?it/s]08/26/2020 16:13:05 - INFO - nlp.arrow_dataset -   Caching processed dataset at /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6
fe661fe4d070d380d/cache-0d77dfce704493dbe63f071eed6a5431.arrow

There are two cached files in the directory:

drwxr-xr-x 2 ***** ***** 4.0K Aug 26 16:14 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 26 15:58 ..
-rw------- 1 ***** *****  99M Aug 26 16:14 cache-0d77dfce704493dbe63f071eed6a5431.arrow
-rw------- 1 ***** *****  99M Aug 26 15:59 cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow
-rw-r--r-- 1 ***** *****  670 Aug 26 16:13 dataset_info.json
-rw-r--r-- 1 ***** *****    0 Aug 26 16:13 LICENSE
-rw-r--r-- 1 ***** *****  33M Aug 26 15:58 text-train.arrow

If I kill the process, and run it again, it will use the second cached file.

08/26/2020 16:19:52 - WARNING - nlp.builder -   Using custom data configuration default
08/26/2020 16:19:52 - INFO - nlp.builder -   Overwrite dataset info from restored data version.
08/26/2020 16:19:52 - INFO - nlp.info -   Loading Dataset info from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:19:52 - INFO - nlp.builder -   Reusing dataset text (/home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/26/2020 16:19:52 - INFO - nlp.builder -   Constructing Dataset for split train, from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:19:52 - INFO - nlp.utils.info_utils -   All the checksums matched successfully for post processing resources
08/26/2020 16:19:53 - INFO - nlp.arrow_dataset -   Loading cached processed dataset at /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d/cache-0d77dfce70
4493dbe63f071eed6a5431.arrow
08/26/2020 16:19:53 - INFO - nlp.arrow_dataset -   Set __getitem__(key) output type to torch for ['input_ids'] columns  (when key is int or slice) and don't output other (un-formatted) columns.
lhoestq commented 4 years ago

Thanks for all the details. The two cached files are supposed to be the same. I suspect that the caching has a problem with the tokenizer. Which tokenizer did you use ?

go-inoue commented 4 years ago

I trained a byte-level BPE tokenizer on my data with tokenziers library following this example.

And I put these model files in a directory named "model_name". I also put config.json, which is the original RoBERTa config file.

%ls  model_name
config.json     merges.txt      vocab.json

This is the line where run_language_modeling.py loads the tokenier.

tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)

I use "model_name" for model_args.tokenizer_name. I don't specify model_args.cache_dir. It is 'None' by default.

go-inoue commented 4 years ago

In my separated script for caching, I'm using use_fast=True when initializing a tokenizer.

tokenizer = AutoTokenizer.from_pretrained(args.config_name, use_fast=True)

I wasn't using that option in the main script. That could be the reason...

lhoestq commented 4 years ago

Yea it could definitely explain why you have two different cache files. Let me know if using the same tokenizers on both sides fixes the issue

go-inoue commented 4 years ago

It still creates a new file even if I remove use_fast=True...

Here's the script used to create a cached file.

#!/usr/bin/env python3

import argparse

from transformers import AutoTokenizer

from nlp import load_dataset

def main():
    parser = argparse.ArgumentParser(description='description')
    parser.add_argument('--config_name', type=str, help='Pretrained config name or path if not the same as model_name')
    parser.add_argument('--data_file', type=str, help='The input data file (a text file).')
    parser.add_argument('--block_size', type=int, default=-1, help='The training dataset will be truncated in block of this size for training')
    args = parser.parse_args()

    tokenizer = AutoTokenizer.from_pretrained(args.config_name)

    dataset = load_dataset("text", data_files=args.data_file, split="train")
    dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                truncation=True, max_length=args.block_size), batched=True)
    dataset.set_format(type='torch', columns=['input_ids'])

if __name__ == "__main__":
    main()

Here's how the data is loaded in the modified run_language_modeling.py. [original function]

def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False):
    file_path = args.eval_data_file if evaluate else args.train_data_file
    split = "validation" if evaluate else "train"
    if args.line_by_line:
        # return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
        dataset = load_dataset("text", data_files=file_path, split="train")
        dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                              truncation=True, max_length=args.block_size), batched=True)
        dataset.set_format(type='torch', columns=['input_ids'])
        return dataset

    else:
        return TextDataset(
            tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, overwrite_cache=args.overwrite_cache
        )

Probably I don't need this part in the main script,

dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                              truncation=True, max_length=args.block_size), batched=True)
        dataset.set_format(type='torch', columns=['input_ids'])

and simply do this?

dataset = load_dataset("text", data_files=file_path, split="train")
return dataset
lhoestq commented 4 years ago

You need this part in the main script or it will use the dataset that is not tokenized

lhoestq commented 4 years ago

I can see that the tokenizer in run_language_modeling.py is not instantiated the same way as in your separated script. Indeed we can see L196:

tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)

Could you try to make it so they are instantiated the exact same way please ?

go-inoue commented 4 years ago

I updated my separated script, but it's creating a cached file again. If I don't use the model_args.cache_dir, both will get None, so they should be the same.

#!/usr/bin/env python3
import argparse

from transformers import AutoTokenizer
from nlp import load_dataset

def main():
    parser = argparse.ArgumentParser(description='description')
    parser.add_argument('--tokenizer_name', type=str, help='Pretrained tokenizer name or path if not the same as model_name')
    parser.add_argument('--data_file', type=str, help='The input data file (a text file).')
    parser.add_argument('--cache_dir', type=str, default=None, help='Where do you want to store the pretrained models downloaded from s3')
    parser.add_argument('--block_size', type=int, default=-1, help='The training dataset will be truncated in block of this size for training')

    model_args = parser.parse_args()

    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)

    dataset = load_dataset("text", data_files=model_args.data_file, split="train")
    dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                truncation=True, max_length=model_args.block_size), batched=True)
    dataset.set_format(type='torch', columns=['input_ids'])

if __name__ == "__main__":
    main()

Is there a way to specify the cache file to load, and skip the re-computation?

lhoestq commented 4 years ago

Could you also check that the args.block_size used in the lambda function is the same as well ?

go-inoue commented 4 years ago

Here's a minimal working example to reproduce this issue.

Assumption:

#!/usr/bin/env python3
import argparse
from transformers import AutoTokenizer
from nlp import load_dataset

def main():
    parser = argparse.ArgumentParser(description='description')
    parser.add_argument('--tokenizer_name', type=str, help='Pretrained tokenizer name or path if not the same as model_name')
    parser.add_argument('--data_file', type=str, help='The input data file (a text file).')
    parser.add_argument('--cache_dir', type=str, default=None, help='Where do you want to store the pretrained models downloaded from s3')
    parser.add_argument('--block_size', type=int, default=-1, help='The training dataset will be truncated in block of this size for training')
    parser.add_argument('--tpu_num_cores', type=int, default=1, help='Number of TPU cores to use (1 or 8). For xla_apwan.py')
    model_args = parser.parse_args()

    tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=True)

    dataset = load_dataset("text", data_files=model_args.data_file, split="train")
    dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
                                truncation=True, max_length=model_args.block_size), batched=True)
    dataset.set_format(type='torch', columns=['input_ids'])

def _mp_fn(index):
    # For xla_spawn (TPUs)
    main()

if __name__ == "__main__":
    main()
export TRAIN_DATA=your_training_data

python prepare_cached_dataset.py \
--tokenizer_name=model_name \
--block_size=512 \
--data_file=$TRAIN_DATA
python xla_spawn.py --num_cores 8 \
prepare_cached_dataset.py \
--tokenizer_name=model_name \
--block_size=512 \
--data_file=$TRAIN_DATA
go-inoue commented 4 years ago

I ended up specifying the cache_file_name argument when I call map function.

dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True, truncation=True, max_length=args.block_size),
                      batched=True,
                      cache_file_name=cache_file_name)

Note: