Open go-inoue opened 4 years ago
I am facing probably facing similar issues with
wiki40b_en_100_0
Could you try to run dataset = load_dataset("text", data_files=file_path, split="train")
once before calling the script ?
It looks like several processes try to create the dataset in arrow format at the same time. If the dataset is already created it should be fine
Thanks! I tested on 328MB text data on n1-standard-8 (8 vCPUs, 30 GB memory)
. The main script ran without any issue, but it seems to require a huge space in the drive.
As suggested, I ran the following script before running the pre-training command with xla_spawn.py
.
from nlp import load_dataset
file_path="your_file_name"
load_dataset("text", data_files=file_path, split="train")
This will create text-train.arrow
under the default cache directory. Then, I run the script with xla_spawn.py
. It will load data from the cached file. My understanding is that there's no other way but to do this two-step process with the current version (0.4) of nlp
.
During another caching process that happens in the main script:
08/26/2020 09:19:51 - INFO - nlp.utils.info_utils - All the checksums matched successfully for post processing resources
08/26/2020 09:19:53 - INFO - nlp.arrow_dataset - Caching processed dataset at /home/*****/.cache/huggingface/datasets/text/default-b0932b2bdbb63283/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d/cache-f90f341e5308a7469
8d872bcc88f9c0e.arrow
nlp
generates a temporary file per core, each of which is three times larger than the original text data. If each process is actually writing on the disk, you will need a huge amount of space in your drive. (Maybe I'm missing something.)
-rw-r--r-- 1 ***** ***** 674 Aug 26 09:19 dataset_info.json
-rw-r--r-- 1 ***** ***** 0 Aug 26 09:19 LICENSE
-rw-r--r-- 1 ***** ***** 332M Aug 26 09:10 text-train.arrow
-rw------- 1 ***** ***** 940M Aug 26 09:31 tmp0k43sazw
-rw------- 1 ***** ***** 940M Aug 26 09:31 tmp7sxs9mj5
-rw------- 1 ***** ***** 939M Aug 26 09:31 tmpbbiqw2vp
-rw------- 1 ***** ***** 937M Aug 26 09:31 tmpjxb5ptyu
-rw------- 1 ***** ***** 933M Aug 26 09:31 tmpk3hkdh0e
-rw------- 1 ***** ***** 944M Aug 26 09:31 tmpnoalwftz
-rw------- 1 ***** ***** 931M Aug 26 09:31 tmpuxdr_dz3
-rw------- 1 ***** ***** 945M Aug 26 09:31 tmpxjyuy6dk
After the caching process, they seem to be merged into one file.
-rw------- 1 ***** ***** 989M Aug 26 09:32 cache-f90f341e5308a74698d872bcc88f9c0e.arrow
-rw-r--r-- 1 ***** ***** 674 Aug 26 09:19 dataset_info.json
-rw-r--r-- 1 ***** ***** 0 Aug 26 09:19 LICENSE
-rw-r--r-- 1 ***** ***** 332M Aug 26 09:10 text-train.arrow
Again it looks like every process tries to tokenize the full dataset at the same time.
If you do the tokenization before calling xla_spawn.py
once, then each process will then use the tokenized cached file cache-f90f341e5308a74698d872bcc88f9c0e.arrow
and not recompute it.
Not sure if there's a better way to do that cc @julien-c @thomwolf
I wrote a separate script just for preparing a cached file, including tokenization. Each process did use the tokenized cached file.
Currently I'm testing the pipeline on 24GB text data. It took about 1.5 hour to create a cached file on n1-highmem-16 (16 vCPUs, 104 GB memory)
. I assume loading this cached file in the main script with xla_spawn.py
won't be an issue (even if there are 8 processes).
total 98G
drwxr-xr-x 2 ***** ***** 4.0K Aug 26 13:38 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 26 12:24 ..
-rw------- 1 ***** ***** 74G Aug 26 13:38 cache-a7aa04134ba7b1aff5d9710f14a4e334.arrow
-rw-r--r-- 1 ***** ***** 681 Aug 26 12:24 dataset_info.json
-rw-r--r-- 1 ***** ***** 0 Aug 26 12:24 LICENSE
-rw-r--r-- 1 ***** ***** 25G Aug 26 12:24 text-train.arrow
Yes loading the cached file should be fine from different processes
Sorry, I thought it was working, but actually the second call doesn't use the cached file that was generated separately, and it will generate another cache-****.arrorw file with a different name. If I run the training script again (with xla_spawn.py
), it will use the second cached file, which was generated by the training script itself in the previous run.
drwxr-xr-x 2 ***** ***** 4.0K Aug 26 15:35 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 26 15:29 ..
-rw------- 1 ***** ***** 99M Aug 26 15:35 cache-0d77dfce704493dbe63f071eed6a5431.arrow
-rw------- 1 ***** ***** 99M Aug 26 15:29 cache-69633651476e943b93c89ace715f9487.arrow
-rw-r--r-- 1 ***** ***** 670 Aug 26 15:33 dataset_info.json
-rw-r--r-- 1 ***** ***** 0 Aug 26 15:33 LICENSE
-rw-r--r-- 1 ***** ***** 33M Aug 26 15:29 text-train.arrow
So if I understand correctly it means that the cached file generated by your separated script is different by the one used by the training script ?
Yes.
cache-69633651476e943b93c89ace715f9487.arrow
was generated with a separate script. xla_spawn.py
.cache-69633651476e943b93c89ace715f9487.arrow
is not used.cache-0d77dfce704493dbe63f071eed6a5431.arrow
is created.Now, if I kill the process at step 5, and do the step 2 again, it will use cache-0d77dfce704493dbe63f071eed6a5431.arrow
(cached file created at step 4) without any issue.
I used the following to generate the first cached file.
dataset = load_dataset("text", data_files=file_path, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
truncation=True, max_length=args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
Downloading and preparing dataset text/default-e84dd29acc4ad9ef (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/
447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d...
Dataset text downloaded and prepared to /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d. Subsequent calls will reuse this data.
There's a file named cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow
, so it did create a cached file.
drwxr-xr-x 2 ***** ***** 4.0K Aug 26 15:59 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 26 15:58 ..
-rw------- 1 ***** ***** 99M Aug 26 15:59 cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow
-rw-r--r-- 1 ***** ***** 670 Aug 26 15:58 dataset_info.json
-rw-r--r-- 1 ***** ***** 0 Aug 26 15:58 LICENSE
-rw-r--r-- 1 ***** ***** 33M Aug 26 15:58 text-train.arrow
cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow
should be used in run_language_modeling.py
(modified version using nlp
) with xla_spawn.py
. But it looks like it's creating a new cached file.08/26/2020 16:13:03 - INFO - filelock - Lock 139635836351096 released on /home/*****/.cache/huggingface/datasets/3e34209a2741375a1db1ff03bf1abba1a9bd0e6016912d3ead0114b9d1ca2685.202fa4f84f552bff1f5400ae012663839c61efb3de068c6c8722d34ac0ea6192
.py.lock
08/26/2020 16:13:03 - WARNING - nlp.builder - Using custom data configuration default
08/26/2020 16:13:03 - INFO - nlp.builder - Overwrite dataset info from restored data version.
08/26/2020 16:13:03 - INFO - nlp.info - Loading Dataset info from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.builder - Reusing dataset text (/home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/26/2020 16:13:03 - INFO - nlp.builder - Constructing Dataset for split train, from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.utils.info_utils - All the checksums matched successfully for post processing resources
08/26/2020 16:13:03 - INFO - nlp.builder - Overwrite dataset info from restored data version.
08/26/2020 16:13:03 - INFO - nlp.info - Loading Dataset info from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.builder - Reusing dataset text (/home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/26/2020 16:13:03 - INFO - nlp.builder - Constructing Dataset for split train, from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:13:03 - INFO - nlp.utils.info_utils - All the checksums matched successfully for post processing resources
08/26/2020 16:13:05 - INFO - nlp.arrow_dataset - Caching processed dataset at /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d/cache-0d77dfce704493dbe
63f071eed6a5431.arrow
^M 0%| | 0/100 [00:00<?, ?it/s]08/26/2020 16:13:05 - INFO - nlp.arrow_dataset - Caching processed dataset at /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6
fe661fe4d070d380d/cache-0d77dfce704493dbe63f071eed6a5431.arrow
There are two cached files in the directory:
drwxr-xr-x 2 ***** ***** 4.0K Aug 26 16:14 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 26 15:58 ..
-rw------- 1 ***** ***** 99M Aug 26 16:14 cache-0d77dfce704493dbe63f071eed6a5431.arrow
-rw------- 1 ***** ***** 99M Aug 26 15:59 cache-7b1440ba7077af0f0d9035b5a55d01fc.arrow
-rw-r--r-- 1 ***** ***** 670 Aug 26 16:13 dataset_info.json
-rw-r--r-- 1 ***** ***** 0 Aug 26 16:13 LICENSE
-rw-r--r-- 1 ***** ***** 33M Aug 26 15:58 text-train.arrow
If I kill the process, and run it again, it will use the second cached file.
08/26/2020 16:19:52 - WARNING - nlp.builder - Using custom data configuration default
08/26/2020 16:19:52 - INFO - nlp.builder - Overwrite dataset info from restored data version.
08/26/2020 16:19:52 - INFO - nlp.info - Loading Dataset info from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:19:52 - INFO - nlp.builder - Reusing dataset text (/home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d)
08/26/2020 16:19:52 - INFO - nlp.builder - Constructing Dataset for split train, from /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
08/26/2020 16:19:52 - INFO - nlp.utils.info_utils - All the checksums matched successfully for post processing resources
08/26/2020 16:19:53 - INFO - nlp.arrow_dataset - Loading cached processed dataset at /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d/cache-0d77dfce70
4493dbe63f071eed6a5431.arrow
08/26/2020 16:19:53 - INFO - nlp.arrow_dataset - Set __getitem__(key) output type to torch for ['input_ids'] columns (when key is int or slice) and don't output other (un-formatted) columns.
Thanks for all the details. The two cached files are supposed to be the same. I suspect that the caching has a problem with the tokenizer. Which tokenizer did you use ?
I trained a byte-level BPE tokenizer on my data with tokenziers
library following this example.
And I put these model files in a directory named "model_name"
. I also put config.json, which is the original RoBERTa config file.
%ls model_name
config.json merges.txt vocab.json
This is the line where run_language_modeling.py
loads the tokenier.
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)
I use "model_name"
for model_args.tokenizer_name
. I don't specify model_args.cache_dir
. It is 'None' by default.
In my separated script for caching, I'm using use_fast=True
when initializing a tokenizer.
tokenizer = AutoTokenizer.from_pretrained(args.config_name, use_fast=True)
I wasn't using that option in the main script. That could be the reason...
Yea it could definitely explain why you have two different cache files. Let me know if using the same tokenizers on both sides fixes the issue
It still creates a new file even if I remove use_fast=True
...
Here's the script used to create a cached file.
#!/usr/bin/env python3
import argparse
from transformers import AutoTokenizer
from nlp import load_dataset
def main():
parser = argparse.ArgumentParser(description='description')
parser.add_argument('--config_name', type=str, help='Pretrained config name or path if not the same as model_name')
parser.add_argument('--data_file', type=str, help='The input data file (a text file).')
parser.add_argument('--block_size', type=int, default=-1, help='The training dataset will be truncated in block of this size for training')
args = parser.parse_args()
tokenizer = AutoTokenizer.from_pretrained(args.config_name)
dataset = load_dataset("text", data_files=args.data_file, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
truncation=True, max_length=args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
if __name__ == "__main__":
main()
Here's how the data is loaded in the modified run_language_modeling.py
. [original function]
def get_dataset(args: DataTrainingArguments, tokenizer: PreTrainedTokenizer, evaluate=False):
file_path = args.eval_data_file if evaluate else args.train_data_file
split = "validation" if evaluate else "train"
if args.line_by_line:
# return LineByLineTextDataset(tokenizer=tokenizer, file_path=file_path, block_size=args.block_size)
dataset = load_dataset("text", data_files=file_path, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
truncation=True, max_length=args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
return dataset
else:
return TextDataset(
tokenizer=tokenizer, file_path=file_path, block_size=args.block_size, overwrite_cache=args.overwrite_cache
)
Probably I don't need this part in the main script,
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
truncation=True, max_length=args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
and simply do this?
dataset = load_dataset("text", data_files=file_path, split="train")
return dataset
You need this part in the main script or it will use the dataset that is not tokenized
I can see that the tokenizer in run_language_modeling.py
is not instantiated the same way as in your separated script.
Indeed we can see L196:
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)
Could you try to make it so they are instantiated the exact same way please ?
I updated my separated script, but it's creating a cached file again. If I don't use the model_args.cache_dir
, both will get None
, so they should be the same.
#!/usr/bin/env python3
import argparse
from transformers import AutoTokenizer
from nlp import load_dataset
def main():
parser = argparse.ArgumentParser(description='description')
parser.add_argument('--tokenizer_name', type=str, help='Pretrained tokenizer name or path if not the same as model_name')
parser.add_argument('--data_file', type=str, help='The input data file (a text file).')
parser.add_argument('--cache_dir', type=str, default=None, help='Where do you want to store the pretrained models downloaded from s3')
parser.add_argument('--block_size', type=int, default=-1, help='The training dataset will be truncated in block of this size for training')
model_args = parser.parse_args()
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir)
dataset = load_dataset("text", data_files=model_args.data_file, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
truncation=True, max_length=model_args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
if __name__ == "__main__":
main()
Is there a way to specify the cache file to load, and skip the re-computation?
Could you also check that the args.block_size
used in the lambda function is the same as well ?
Here's a minimal working example to reproduce this issue.
Assumption:
transformers
and nlp
.config.json
, merges.txt
, vocab.json
) under the directory named model_name
.xla_spawn.py
(Download from https://github.com/huggingface/transformers/blob/master/examples/xla_spawn.py).prepare_cached_dataset.py
.#!/usr/bin/env python3
import argparse
from transformers import AutoTokenizer
from nlp import load_dataset
def main():
parser = argparse.ArgumentParser(description='description')
parser.add_argument('--tokenizer_name', type=str, help='Pretrained tokenizer name or path if not the same as model_name')
parser.add_argument('--data_file', type=str, help='The input data file (a text file).')
parser.add_argument('--cache_dir', type=str, default=None, help='Where do you want to store the pretrained models downloaded from s3')
parser.add_argument('--block_size', type=int, default=-1, help='The training dataset will be truncated in block of this size for training')
parser.add_argument('--tpu_num_cores', type=int, default=1, help='Number of TPU cores to use (1 or 8). For xla_apwan.py')
model_args = parser.parse_args()
tokenizer = AutoTokenizer.from_pretrained(model_args.tokenizer_name, cache_dir=model_args.cache_dir, use_fast=True)
dataset = load_dataset("text", data_files=model_args.data_file, split="train")
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True,
truncation=True, max_length=model_args.block_size), batched=True)
dataset.set_format(type='torch', columns=['input_ids'])
def _mp_fn(index):
# For xla_spawn (TPUs)
main()
if __name__ == "__main__":
main()
your_training_data
with some text file.export TRAIN_DATA=your_training_data
python prepare_cached_dataset.py \
--tokenizer_name=model_name \
--block_size=512 \
--data_file=$TRAIN_DATA
Check the cached directory.
ls -lha /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
total 132M
drwxr-xr-x 2 ***** ***** 4.0K Aug 28 13:08 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 28 13:08 ..
-rw------- 1 ***** ***** 99M Aug 28 13:08 cache-bfc7cb0702426d19242db5e8c079f04b.arrow
-rw-r--r-- 1 ***** ***** 670 Aug 28 13:08 dataset_info.json
-rw-r--r-- 1 ***** ***** 0 Aug 28 13:08 LICENSE
-rw-r--r-- 1 ***** ***** 33M Aug 28 13:08 text-train.arrow
Run the same script again. (The output should be just Using custom data configuration default
.)
python prepare_cached_dataset.py \
--tokenizer_name=model_name \
--block_size=512 \
--data_file=$TRAIN_DATA
Check the cached directory.
ls -lha /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
total 132M
drwxr-xr-x 2 ***** ***** 4.0K Aug 28 13:08 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 28 13:08 ..
-rw------- 1 ***** ***** 99M Aug 28 13:08 cache-bfc7cb0702426d19242db5e8c079f04b.arrow
-rw-r--r-- 1 ***** ***** 670 Aug 28 13:20 dataset_info.json
-rw-r--r-- 1 ***** ***** 0 Aug 28 13:20 LICENSE
-rw-r--r-- 1 ***** ***** 33M Aug 28 13:08 text-train.arrow
The cached file (cache-bfc7cb0702426d19242db5e8c079f04b.arrow
) is reused.
Now, run this script with xla_spawn.py
. Ideally, it should reuse the cached file, however, you will see each process is creating a cache file again.
python xla_spawn.py --num_cores 8 \
prepare_cached_dataset.py \
--tokenizer_name=model_name \
--block_size=512 \
--data_file=$TRAIN_DATA
ls -lha /home/*****/.cache/huggingface/datasets/text/default-e84dd29acc4ad9ef/0.0.0/447f2bcfa2a721a37bc8fdf23800eade1523cf07f7eada6fe661fe4d070d380d
total 230M
drwxr-xr-x 2 ***** ***** 4.0K Aug 28 13:25 .
drwxr-xr-x 3 ***** ***** 4.0K Aug 28 13:08 ..
-rw------- 1 ***** ***** 99M Aug 28 13:08 cache-bfc7cb0702426d19242db5e8c079f04b.arrow
-rw------- 1 ***** ***** 99M Aug 28 13:25 cache-e0e2313e49c8a110aafcc8133154c19a.arrow
-rw-r--r-- 1 ***** ***** 670 Aug 28 13:24 dataset_info.json
-rw-r--r-- 1 ***** ***** 0 Aug 28 13:24 LICENSE
-rw-r--r-- 1 ***** ***** 33M Aug 28 13:08 text-train.arrow
I ended up specifying the cache_file_name
argument when I call map
function.
dataset = dataset.map(lambda ex: tokenizer(ex["text"], add_special_tokens=True, truncation=True, max_length=args.block_size),
batched=True,
cache_file_name=cache_file_name)
Note:
text
dataset in nlp
does not strip "\n"
. If you want the same output as in LineByLineTextDataset
, you would need to create your own dataset class where you replace line
to line.strip()
here.
Hi,
I'm getting a "File exists" error when I use text dataset for pre-training a RoBERTa model using
transformers
(3.0.2) andnlp
(0.4.0) on a VM with TPU (v3-8).I modified line 131 in the original
run_language_modeling.py
as follows:When I run this with
xla_spawn.py
, I get the following error (it produces one message per core in TPU, which I believe is fine).It seems the current version doesn't take into account distributed training processes as in this example?