huggingface / datasets

πŸ€— The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.21k stars 2.68k forks source link

Load text file for RoBERTa pre-training. #610

Closed chiyuzhang94 closed 1 year ago

chiyuzhang94 commented 4 years ago

I migrate my question from https://github.com/huggingface/transformers/pull/4009#issuecomment-690039444

I tried to train a Roberta from scratch using transformers. But I got OOM issues with loading a large text file. According to the suggestion from @thomwolf , I tried to implement datasets to load my text file. This test.txt is a simple sample where each line is a sentence.

from datasets import load_dataset
dataset = load_dataset('text', data_files='test.txt',cache_dir="./")
dataset.set_format(type='torch',columns=["text"])
dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
next(iter(dataloader))

But dataload cannot yield sample and error is:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-12-388aca337e2f> in <module>
----> 1 next(iter(dataloader))

/Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    361 
    362     def __next__(self):
--> 363         data = self._next_data()
    364         self._num_yielded += 1
    365         if self._dataset_kind == _DatasetKind.Iterable and \

/Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    401     def _next_data(self):
    402         index = self._next_index()  # may raise StopIteration
--> 403         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    404         if self._pin_memory:
    405             data = _utils.pin_memory.pin_memory(data)

/Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

KeyError: 0

dataset.set_format(type='torch',columns=["text"]) returns a log says:

Set __getitem__(key) output type to torch for ['text'] columns (when key is int or slice) and don't output other (un-formatted) columns.

I noticed the dataset is DatasetDict({'train': Dataset(features: {'text': Value(dtype='string', id=None)}, num_rows: 44)}). Each sample can be accessed by dataset["train"]["text"] instead of dataset["text"].

Could you please give me any suggestions on how to modify this code to load the text file?

Versions: Python version 3.7.3 PyTorch version 1.6.0 TensorFlow version 2.3.0 datasets version: 1.0.1

lhoestq commented 4 years ago

Could you try

load_dataset('text', data_files='test.txt',cache_dir="./", split="train")

?

load_dataset returns a dictionary by default, like {"train": your_dataset}

chiyuzhang94 commented 4 years ago

Hi @lhoestq Thanks for your suggestion.

I tried

dataset = load_dataset('text', data_files='test.txt',cache_dir="./", split="train")
print(dataset)
dataset.set_format(type='torch',columns=["text"])
dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
next(iter(dataloader))

But it still doesn't work and got error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-7-388aca337e2f> in <module>
----> 1 next(iter(dataloader))

/Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
    361 
    362     def __next__(self):
--> 363         data = self._next_data()
    364         self._num_yielded += 1
    365         if self._dataset_kind == _DatasetKind.Iterable and \

/Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
    401     def _next_data(self):
    402         index = self._next_index()  # may raise StopIteration
--> 403         data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
    404         if self._pin_memory:
    405             data = _utils.pin_memory.pin_memory(data)

/Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
     42     def fetch(self, possibly_batched_index):
     43         if self.auto_collation:
---> 44             data = [self.dataset[idx] for idx in possibly_batched_index]
     45         else:
     46             data = self.dataset[possibly_batched_index]

/Library/Python/3.7/site-packages/datasets-0.4.0-py3.7.egg/datasets/arrow_dataset.py in __getitem__(self, key)
   1069             format_columns=self._format_columns,
   1070             output_all_columns=self._output_all_columns,
-> 1071             format_kwargs=self._format_kwargs,
   1072         )
   1073 

/Library/Python/3.7/site-packages/datasets-0.4.0-py3.7.egg/datasets/arrow_dataset.py in _getitem(self, key, format_type, format_columns, output_all_columns, format_kwargs)
   1056                 format_columns=format_columns,
   1057                 output_all_columns=output_all_columns,
-> 1058                 format_kwargs=format_kwargs,
   1059             )
   1060         return outputs

/Library/Python/3.7/site-packages/datasets-0.4.0-py3.7.egg/datasets/arrow_dataset.py in _convert_outputs(self, outputs, format_type, format_columns, output_all_columns, format_kwargs)
    872                     continue
    873                 if format_columns is None or k in format_columns:
--> 874                     v = map_nested(command, v, **map_nested_kwargs)
    875                 output_dict[k] = v
    876         return output_dict

/Library/Python/3.7/site-packages/datasets-0.4.0-py3.7.egg/datasets/utils/py_utils.py in map_nested(function, data_struct, dict_only, map_list, map_tuple, map_numpy, num_proc, types)
    214     # Singleton
    215     if not isinstance(data_struct, dict) and not isinstance(data_struct, types):
--> 216         return function(data_struct)
    217 
    218     disable_tqdm = bool(logger.getEffectiveLevel() > INFO)

/Library/Python/3.7/site-packages/datasets-0.4.0-py3.7.egg/datasets/arrow_dataset.py in command(x)
    833                     if x.dtype == np.object:  # pytorch tensors cannot be instantied from an array of objects
    834                         return [map_nested(command, i, **map_nested_kwargs) for i in x]
--> 835                 return torch.tensor(x, **format_kwargs)
    836 
    837         elif format_type == "tensorflow":

TypeError: new(): invalid data type 'str'

I found type can be ['numpy', 'torch', 'tensorflow', 'pandas'] only, how can I deal with the string type?

thomwolf commented 4 years ago

You need to tokenize the string inputs to convert them in integers before you can feed them to a pytorch dataloader.

You can read the quicktour of the datasets or the transformers libraries to know more about that:

sipah00 commented 4 years ago

Hey @chiyuzhang94, I was also having trouble in loading a large text file (11GB). But finally got it working. This is what I did after looking into the documentation.

  1. split the whole dataset file into smaller files

    mkdir ./shards
    split -a 4 -l 256000 -d full_raw_corpus.txt ./shards/shard_
  2. Pass paths of small data files to load_dataset

    files = glob.glob('shards/*')
    from datasets import load_dataset
    dataset = load_dataset('text', data_files=files, split='train')

    (On passing the whole dataset file (11GB) directly to load_dataset was resulting into RAM issue)

  3. Tokenization

    def encode(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length')
    dataset = dataset.map(encode, batched=True)
    dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

    Now you can pass dataset to Trainer or pytorch DataLoader

    dataloader = torch.utils.data.DataLoader(dataset, batch_size=4)
    next(iter(dataloader))

    Hope this helps

chiyuzhang94 commented 4 years ago

Thanks, @thomwolf and @sipah00 ,

I tried to implement your suggestions in my scripts. Now, I am facing some connection time-out error. I am using my local file, I have no idea why the module request s3 database.

The log is:

Traceback (most recent call last):
  File "/home/.local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    raise err
  File "/home/.local/lib/python3.6/site-packages/urllib3/util/connection.py", line 74, in create_connection
    timeout=timeout
  File "/home/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

Traceback (most recent call last):
  File "/home/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/home/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 436, in increment
    chunked=chunked,
  File "/home/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 376, in _make_request
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /datasets.huggingface.co/datasets/datasets/text/text.py (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obj
ect at 0x7fff401e0e48>: Failed to establish a new connection: [Errno 110] Connection timed out',))

Traceback (most recent call last):
  File "/scratch/roberta_emohash/run_language_modeling.py", line 1019, in <module>
    main()
  File "/scratch/roberta_emohash/run_language_modeling.py", line 962, in main
    train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
  File "/scratch/roberta_emohash/run_language_modeling.py", line 177, in load_and_cache_examples
    return HG_Datasets(tokenizer, file_path, args)
  File "/scratch/roberta_emohash/run_language_modeling.py", line 117, in HG_Datasets
    dataset = load_dataset('text', data_files=files, cache_dir = args.data_cache_dir, split="train")
  File "/arc/project/evn_py36/datasets/datasets/src/datasets/load.py", line 590, in load_dataset
    self._validate_conn(conn)
  File "/home/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 994, in _validate_conn
    conn.connect()
  File "/home/.local/lib/python3.6/site-packages/urllib3/connection.py", line 300, in connect
    conn = self._new_conn()
  File "/home/.local/lib/python3.6/site-packages/urllib3/connection.py", line 169, in _new_conn
    self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fff401e0da0>: Failed to establish a new connection: [Errno 110] Connection timed out

Do you have any experience on this issue?

sipah00 commented 4 years ago

No, I didn't encounter this problem, it seems to me a network problem

chiyuzhang94 commented 4 years ago

I noticed this is because I use a cloud server where does not provide for connections from our standard compute nodes to outside resources.

For the datasets package, it seems that if the loading script is not already cached in the library it will attempt to connect to an AWS resource to download the dataset loading script.

I am wondering why the package works in this way. Do you have any suggestions to solve this issue?

chiyuzhang94 commented 4 years ago

I solved the above issue by downloading text.py manually and passing the path to the load_dataset function.

Now, I have a new issue with the Read-only file system.

The error is:

I0916 22:14:38.453380 140737353971520 filelock.py:274] Lock 140734268996072 acquired on /scratch/chiyuzh/roberta/text.py.lock
Found main folder for dataset /scratch/chiyuzh/roberta/text.py at /home/chiyuzh/.cache/huggingface/modules/datasets_modules/datasets/text
Creating specific version folder for dataset /scratch/chiyuzh/roberta/text.py at /home/chiyuzh/.cache/huggingface/modules/datasets_modules/datasets/text/512f465342e4f4cd07a8791428a629c043bb89d55ad7817cbf7fcc649178b014
I0916 22:14:38.530371 140737353971520 filelock.py:318] Lock 140734268996072 released on /scratch/chiyuzh/roberta/text.py.lock
Traceback (most recent call last):
  File "/scratch/chiyuzh/roberta/run_language_modeling_hg.py", line 1019, in <module>
    main()
  File "/scratch/chiyuzh/roberta/run_language_modeling_hg.py", line 962, in main
    train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
  File "/scratch/chiyuzh/roberta/run_language_modeling_hg.py", line 177, in load_and_cache_examples
    return HG_Datasets(tokenizer, file_path, args)
  File "/scratch/chiyuzh/roberta/run_language_modeling_hg.py", line 117, in HG_Datasets
    dataset = load_dataset('/scratch/chiyuzh/roberta/text.py', data_files=files, cache_dir = args.data_cache_dir, split="train")
  File "/arc/project/chiyuzh/evn_py36/datasets/src/datasets/load.py", line 590, in load_dataset
    path, script_version=script_version, download_config=download_config, download_mode=download_mode, dataset=True
  File "/arc/project/chiyuzh/evn_py36/datasets/src/datasets/load.py", line 385, in prepare_module
    os.makedirs(hash_folder_path)
  File "/project/chiyuzh/evn_py36/lib/python3.6/os.py", line 220, in makedirs
    mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/home/chiyuzh/.cache/huggingface/modules/datasets_modules/datasets/text/512f465342e4f4cd07a8791428a629c043bb89d55ad7817cbf7fcc649178b014'

I installed datasets at /project/chiyuzh/evn_py36/datasets/src where is a writable directory. I also tried change the environment variables to the writable directory: export HF_MODULES_PATH=/project/chiyuzh/evn_py36/datasets/cache_dir/ export HF_DATASETS_CACHE=/project/chiyuzh/evn_py36/datasets/cache_dir/

In my scripts, I also changed to: dataset = load_dataset('/scratch/chiyuzh/roberta/text.py', data_files=files, cache_dir = args.data_cache_dir, split="train") data_cache_dir = $TMPDIR/data/ that also a writable directory.

But it still try to make directory at /home/chiyuzh/.cache/huggingface/modules/. Do you have any idea about this issue? @thomwolf

shizhediao commented 4 years ago

Hey @chiyuzhang94, I was also having trouble in loading a large text file (11GB). But finally got it working. This is what I did after looking into the documentation.

  1. split the whole dataset file into smaller files
mkdir ./shards
split -a 4 -l 256000 -d full_raw_corpus.txt ./shards/shard_
  1. Pass paths of small data files to load_dataset
files = glob.glob('shards/*')
from datasets import load_dataset
dataset = load_dataset('text', data_files=files, split='train')

(On passing the whole dataset file (11GB) directly to load_dataset was resulting into RAM issue)

  1. Tokenization
def encode(examples):
  return tokenizer(examples['text'], truncation=True, padding='max_length')
dataset = dataset.map(encode, batched=True)
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

Now you can pass dataset to Trainer or pytorch DataLoader

dataloader = torch.utils.data.DataLoader(dataset, batch_size=4)
next(iter(dataloader))

Hope this helps

When I run 'dataset = dataset.map(encode, batched=True)', I encountered a problem like this:

Testing the mapped function outputs Traceback (most recent call last): File "", line 1, in File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/dataset_dict.py", line 300, in map for k, dataset in self.items() File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/dataset_dict.py", line 300, in for k, dataset in self.items() File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1224, in map update_data = does_function_return_dict(test_inputs, test_indices) File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1195, in does_function_return_dict function(*fn_args, indices, *fn_kwargs) if with_indices else function(fn_args, **fn_kwargs) File "", line 3, in encode TypeError: init() takes 1 positional argument but 2 were given

chiyuzhang94 commented 4 years ago

Hey @chiyuzhang94, I was also having trouble in loading a large text file (11GB). But finally got it working. This is what I did after looking into the documentation.

  1. split the whole dataset file into smaller files
mkdir ./shards
split -a 4 -l 256000 -d full_raw_corpus.txt ./shards/shard_
  1. Pass paths of small data files to load_dataset
files = glob.glob('shards/*')
from datasets import load_dataset
dataset = load_dataset('text', data_files=files, split='train')

(On passing the whole dataset file (11GB) directly to load_dataset was resulting into RAM issue)

  1. Tokenization
def encode(examples):
  return tokenizer(examples['text'], truncation=True, padding='max_length')
dataset = dataset.map(encode, batched=True)
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

Now you can pass dataset to Trainer or pytorch DataLoader

dataloader = torch.utils.data.DataLoader(dataset, batch_size=4)
next(iter(dataloader))

Hope this helps

When I run 'dataset = dataset.map(encode, batched=True)', I encountered a problem like this:

Testing the mapped function outputs Traceback (most recent call last): File "", line 1, in File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/dataset_dict.py", line 300, in map for k, dataset in self.items() File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/dataset_dict.py", line 300, in for k, dataset in self.items() File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1224, in map update_data = does_function_return_dict(test_inputs, test_indices) File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1195, in does_function_return_dict function(*fn_args, indices, *fn_kwargs) if with_indices else function(fn_args, fn_kwargs) File "", line 3, in encode TypeError: init**() takes 1 positional argument but 2 were given

What is your encoder function?

shizhediao commented 4 years ago
def encode(examples):
  return tokenizer(examples['text'], truncation=True, padding='max_length')

It is the same as suggested:

def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')

chiyuzhang94 commented 4 years ago
def encode(examples):
  return tokenizer(examples['text'], truncation=True, padding='max_length')

It is the same as suggested:

def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')

Do you use this function in a class object?

init() takes 1 positional argument but 2 were given. I guess the additional argument is self?

shizhediao commented 4 years ago
def encode(examples):
  return tokenizer(examples['text'], truncation=True, padding='max_length')

It is the same as suggested:

def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')

Do you use this function in a class object?

init() takes 1 positional argument but 2 were given. I guess the additional argument is self?

Thanks for your reply. Could you provide some simple example here? Currently, I do not use this function in a class object. I think you are right and I was wondering how to construct this class. I try to modify it based on transformers' LineByLineTextDataset. Am I correct?

class LineByLineTextDataset(Dataset): """ This will be superseded by a framework-agnostic approach soon. """

def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
    assert os.path.isfile(file_path), f"Input file path {file_path} not found"
    # Here, we do not cache the features, operating under the assumption
    # that we will soon use fast multithreaded tokenizers from the
    # `tokenizers` repo everywhere =)
    #logger.info("Creating features from dataset file at %s", file_path)
    #with open(file_path, encoding="utf-8") as f:
    #    lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
    #batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)

import glob
files = glob.glob('/home/mtzhang111/fairseq/cs_doc/shards/shard_003*')
from datasets import load_dataset
dataset = load_dataset('text', data_files=files)
    batch_encoding= dataset.map(encode, batched=True)
    self.examples = batch_encoding["input_ids"]

def encode(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length')

def __len__(self):
    return len(self.examples)

def __getitem__(self, i) -> torch.Tensor:
    return torch.tensor(self.examples[i], dtype=torch.long)
chiyuzhang94 commented 4 years ago
def encode(examples):
  return tokenizer(examples['text'], truncation=True, padding='max_length')

It is the same as suggested:

def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')

Do you use this function in a class object? init() takes 1 positional argument but 2 were given. I guess the additional argument is self?

Thanks for your reply. Could you provide some simple example here? Currently, I do not use this function in a class object. I think you are right and I was wondering how to construct this class. I try to modify it based on transformers' LineByLineTextDataset. Am I correct?

class LineByLineTextDataset(Dataset): """ This will be superseded by a framework-agnostic approach soon. """

def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
    assert os.path.isfile(file_path), f"Input file path {file_path} not found"
    # Here, we do not cache the features, operating under the assumption
    # that we will soon use fast multithreaded tokenizers from the
    # `tokenizers` repo everywhere =)
    #logger.info("Creating features from dataset file at %s", file_path)
    #with open(file_path, encoding="utf-8") as f:
    #    lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
    #batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)

import glob
files = glob.glob('/home/mtzhang111/fairseq/cs_doc/shards/shard_003*')
from datasets import load_dataset
dataset = load_dataset('text', data_files=files)
    batch_encoding= dataset.map(encode, batched=True)
    self.examples = batch_encoding["input_ids"]

def encode(examples):
    return tokenizer(examples['text'], truncation=True, padding='max_length')

def __len__(self):
    return len(self.examples)

def __getitem__(self, i) -> torch.Tensor:
    return torch.tensor(self.examples[i], dtype=torch.long)

I am also struggling with this adaptation. I am not sure whether I am right.

I think you don't need to construct class LazyLineByLineTextDataset(Dataset) at all. torch.utils.data.Dataset is a generator.

Now, we use dataset = dataset.map(encode, batched=True) as a generator. So we just pass dataset to torch.utils.data.DataLoader.

shizhediao commented 4 years ago

@chiyuzhang94 Thanks for your reply. After some changes, currently, I managed to make the data loading process running. I published it in case you might want to take a look. Thanks for your help! https://github.com/shizhediao/Transformers_TPU

chiyuzhang94 commented 4 years ago

Hi @shizhediao ,

Thanks! It looks great!

But my problem still is the cache directory is a read-only file system. As I mentioned, I tried to change the cache directory but it didn't work.

Do you have any suggestions?

lhoestq commented 4 years ago

I installed datasets at /project/chiyuzh/evn_py36/datasets/src where is a writable directory. I also tried change the environment variables to the writable directory: export HF_MODULES_PATH=/project/chiyuzh/evn_py36/datasets/cache_dir/

I think it is HF_MODULES_CACHE and not HF_MODULES_PATH @chiyuzhang94 . Could you try again and let me know if it fixes your issue ?

thomwolf commented 4 years ago

We should probably add a section in the doc on the caching system with the env variables in particular.

chiyuzhang94 commented 4 years ago

Hi @thomwolf , @lhoestq ,

Thanks for your suggestions. With the latest version of this package, I can load text data without Internet.

But I found the speed of dataset loading is very slow.

My scrips like this:

    def token_encode(examples):
        tokenizer_out = tokenizer(examples['text'], truncation=True,  padding="max_length", add_special_tokens=True, max_length=args.block_size)
        return tokenizer_out

    path = Path(file_path)
    files = sorted(path.glob('*'))
    dataset = load_dataset('./text.py', data_files=files, cache_dir = args.data_cache_dir, split="train")
    dataset = dataset.map(token_encode, batched=True)

    dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])

I have 1,123,870,657 lines in my input directory. I can find the processing speed as following. It is very slow.

| 13/1123871 [00:02<62:37:39,  4.98ba/s]^M  0%|   
| 14/1123871 [00:03<61:27:31,  5.08ba/s]^M  0%|          
| 15/1123871 [00:03<66:34:19,  4.69ba/s]^M  0%|         
| 16/1123871 [00:03<68:25:01,  4.56ba/s]^M  0%|          
| 17/1123871 [00:03<72:00:03,  4.34ba/s]^M  0%|       

Do you have any suggestions to accelerate this loading process?

lhoestq commented 4 years ago

You can use multiprocessing by specifying num_proc= in .map()

Also it looks like you have 1123871 batches of 1000 elements (default batch size), i.e. 1,123,871,000 lines in total. Am I right ?

chiyuzhang94 commented 4 years ago

You can use multiprocessing by specifying num_proc= in .map()

Also it looks like you have 1123871 batches of 1000 elements (default batch size), i.e. 1,123,871,000 lines in total. Am I right ?

Hi @lhoestq ,

Thanks. I will try it.

You are right. I have 1,123,870,657 lines totally in the path. I split the large file into 440 small files. Each file has 2,560,000 lines.

I have another question. Because I am using a cloud server where only allows running a job up to 7 days. Hence, I need to resume my model every week. If the script needs to load and process the dataset every time. It is very low efficient based on the current processing speed. Is it possible that I process the dataset once and use the process cache to in the future work?

chiyuzhang94 commented 4 years ago

Hi @lhoestq ,

I tried to use multi-processor, but I got errors as follow: Because I am using python distributed training, it seems some conflicts with the distributed job.

Do you have any suggestions?

I0925 10:19:35.603023 140737353971520 filelock.py:318] Lock 140737229443368 released on /tmp/pbs.1120510.pbsha.ib.sockeye/cache/_tmp_pbs.1120510.pbsha.ib.sockeye_cache_text_default-7fb934ed6fac5d01_0.0.0_512f465342e4f4cd07a8791428a629c043bb89d55ad7817cbf7
fcc649178b014.lock
Traceback (most recent call last):
  File "/scratch/chiyuzh/roberta/run_language_modeling.py", line 1024, in <module>
    main()
  File "/scratch/chiyuzh/roberta/run_language_modeling.py", line 967, in main
    train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
  File "/scratch/chiyuzh/roberta/run_language_modeling.py", line 180, in load_and_cache_examples
    return HG_Datasets(tokenizer, file_path, args)
  File "/scratch/chiyuzh/roberta/run_language_modeling.py", line 119, in HG_Datasets
    dataset = dataset.map(token_encode, batched=True, batch_size = 10000, num_proc = 16)
  File "/project/chiyuzh/evn_py36/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1287, in map
    transformed_shards = [r.get() for r in results]
  File "/project/chiyuzh/evn_py36/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1287, in <listcomp>
    transformed_shards = [r.get() for r in results]
  File "/project/chiyuzh/evn_py36/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/project/chiyuzh/evn_py36/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
    put(task)
  File "/project/chiyuzh/evn_py36/lib/python3.6/multiprocessing/connection.py", line 206, in send
    self._send_bytes(_ForkingPickler.dumps(obj))
  File "/project/chiyuzh/evn_py36/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'HG_Datasets.<locals>.token_encode'
lhoestq commented 4 years ago

For multiprocessing, the function given to map must be picklable. Maybe you could try to define token_encode outside HG_Datasets ?

Also maybe #656 could make functions defined locally picklable for multiprocessing, once it's merged.

lhoestq commented 4 years ago

I have another question. Because I am using a cloud server where only allows running a job up to 7 days. Hence, I need to resume my model every week. If the script needs to load and process the dataset every time. It is very low efficient based on the current processing speed. Is it possible that I process the dataset once and use the process cache to in the future work?

Feel free to save your processed dataset using dataset.save_to_disk("path/to/save/directory").

Then you'll be able to reload it again using

from datasets import load_from_disk

dataset = load_from_disk("path/to/save/directory")
chiyuzhang94 commented 4 years ago

Hi @lhoestq ,

Thanks for your suggestion. I tried to process the dataset and save it to disk. I have 1.12B samples in the raw dataset. I used 16 processors. I run this process job for 7 days. But it didn't finish. I don't why the processing is such slow.

The log shows that some processors (#12, #14, #15) are very slow. The different processor has a different speed. These slow processors look like a bottleneck.

Could you please give me any suggestion to improve the processing speed?

Thanks. Chiyu

Here is my code:

def token_encode(examples):
        tokenizer_out = tokenizer(examples['text'], truncation=True,  padding="max_length", add_special_tokens=True, max_length=args.block_size)
        return tokenizer_out

path = Path(file_path)
files = sorted(path.glob('*'))
dataset = load_dataset('./text.py', data_files=files, cache_dir = args.data_cache_dir, split="train")
dataset = dataset.map(token_encode, batched=True, batch_size = 16384, num_proc = 16)
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
dataset.save_to_disk(output_dir)

Here is the log.

^M#6:   1%|▏         | 59/4288 [55:10<66:11:58, 56.35s/ba]
^M#1:   8%|β–Š         | 356/4288 [55:39<10:40:02,  9.77s/ba]
^M#2:   5%|▍         | 210/4288 [55:33<17:47:19, 15.70s/ba]
^M#0:  19%|β–ˆβ–‰        | 836/4288 [55:53<4:08:56,  4.33s/ba]
^M#0:  20%|β–ˆβ–‰        | 837/4288 [55:57<4:01:52,  4.21s/ba]
^M#1:   8%|β–Š         | 357/4288 [55:48<10:38:09,  9.74s/ba]
^M#0:  20%|β–ˆβ–‰        | 838/4288 [56:01<4:02:56,  4.23s/ba]
^M#3:   4%|β–Ž         | 155/4288 [55:43<24:41:20, 21.51s/ba]
^M#0:  20%|β–ˆβ–‰        | 839/4288 [56:05<4:04:48,  4.26s/ba]
^M#12:   1%|          | 29/4288 [54:50<133:20:53, 112.72s/ba]
^M#2:   5%|▍         | 211/4288 [55:48<17:40:33, 15.61s/ba]
^M#14:   0%|          | 2/4288 [04:24<157:17:50, 132.12s/ba]
^M#15:   0%|          | 1/4288 [02:24<172:11:37, 144.60s/ba]
lhoestq commented 4 years ago

Hi !

As far as I can tell, there could be several reasons for your processes to have different speeds:

So I'd suggest you to check that you have nothing else running in parallel to your processing job, and also maybe take a look at the slow parts of the datasets. When doing multiprocessing, the dataset is sharded in num_proc contiguous parts that are processed individually in each process. If you want to take a look at the dataset processed in the 12th shard of 16 for example, you can do:

my_shard = dataset.shard(num_shards=16, index=12, contiguous=True)

Hope this helps, let me know if you find what is causing this slow down.

thomwolf commented 4 years ago

Do you use a fast or a slow tokenizer from the transformers library @chiyuzhang94?

chiyuzhang94 commented 4 years ago

Do you use a fast or a slow tokenizer from the transformers library @chiyuzhang94?

Hi @thomwolf , I use this:

from transformers import
AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)

I guess this is a slow one, let me explore the fast tokenizer.

chiyuzhang94 commented 4 years ago

Hi !

As far as I can tell, there could be several reasons for your processes to have different speeds:

  • some parts of your dataset have short passages while some have longer passages, that take more time to be processed
  • OR there are other processes running that prevent some of them to run at full speed
  • OR the value of num_proc is higher than the number of actual processes that you can run in parallel at full speed.

So I'd suggest you to check that you have nothing else running in parallel to your processing job, and also maybe take a look at the slow parts of the datasets. When doing multiprocessing, the dataset is sharded in num_proc contiguous parts that are processed individually in each process. If you want to take a look at the dataset processed in the 12th shard of 16 for example, you can do:

my_shard = dataset.shard(num_shards=16, index=12, contiguous=True)

Hope this helps, let me know if you find what is causing this slow down.

Hi @lhoestq ,

Thanks for your suggestions. I don't think my problem is due to any one of these seasons.

  1. I have 1,123,870,657 lines totally in the path. I split the large file into 440 small files. Each file has 2,560,000 lines. The last file is smaller a little bit. But they are similar. I randomly shuffled all the 1,123,870,657 lines. Hence, the sequences should also be similar across all the files.

  2. I run this script on the entire node. I requested all the resources on the nodes (40 CPUs, 384GB memory). Hence, these were not any other processes.

    1. As I say, the node has 40 CPUs, but I set num_proc = 16. This should not be a problem.
chiyuzhang94 commented 4 years ago

Hi @thomwolf I am using RobertaTokenizerFast now.

But the speed is still imbalanced, some processors are still slow. Here is the part of the log. #0 is always much fast than lower rank processors.

#15:   3%|β–Ž         | 115/3513 [3:18:36<98:01:33, 103.85s/ba]
#2:  24%|β–ˆβ–ˆβ–       | 847/3513 [3:20:43<11:06:49, 15.01s/ba]
#1:  37%|β–ˆβ–ˆβ–ˆβ–‹      | 1287/3513 [3:20:52<6:19:02, 10.22s/ba]
#0:  72%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–  | 2546/3513 [3:20:52<1:51:03,  6.89s/ba]
#3:  18%|β–ˆβ–Š        | 617/3513 [3:20:36<15:50:30, 19.69s/ba]
#0:  73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 2547/3513 [3:20:59<1:50:25,  6.86s/ba]
#1:  37%|β–ˆβ–ˆβ–ˆβ–‹      | 1288/3513 [3:21:02<6:21:13, 10.28s/ba]
#7:   7%|β–‹         | 252/3513 [3:20:09<44:09:03, 48.74s/ba]
#12:   4%|▍         | 144/3513 [3:19:19<78:00:54, 83.36s/ba]
#4:  14%|β–ˆβ–        | 494/3513 [3:20:37<20:46:06, 24.77s/ba]
#0:  73%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–Ž  | 2548/3513 [3:21:06<1:49:26,  6.80s/ba]
#2:  24%|β–ˆβ–ˆβ–       | 848/3513 [3:20:58<11:06:17, 15.00s/ba]

Here is my script related to the datasets processing,

tokenizer = RobertaTokenizerFast.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)

def token_encode(examples):
    tokenizer_out = tokenizer(examples['text'], truncation=True,  padding="max_length", add_special_tokens=True, max_length=128)
    return tokenizer_out

def HG_Datasets(tokenizer, file_path, args):

    path = Path(file_path)
    files = sorted(path.glob('*'))
    dataset = load_dataset('./text.py', data_files=files, cache_dir = ""./, split="train")
    dataset = dataset.map(token_encode, batched=True, batch_size = 20000, num_proc = 16)

    dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
    return dataset

I have 1,123,870,657 lines totally in the path. I split the large file into 440 small files. Each file has 2,560,000 lines.

Could you please give any suggestion? Thanks very much!!

chiyuzhang94 commented 4 years ago

Hi @thomwolf @lhoestq ,

Thanks for your help. Finally, the preprocess is completed. I used 32 CPUs. The preprocess took 55 hours. The dataset.arrow size is 1.18TB. I saved the processed dataset and used it for RoBERTa pre-training. You can find my scripts here. I loaded the dataset by load_from_disk(). dataset = load_from_disk(file_path)

But the job runs out of memory when it loads batch samples at the first step. Here is the error log. I can find that the dataset was loaded successfully. The job was terminated at the first step. I expect the datasets library can handle the memory issue effectively. But it still runs out of memory.

I1017 00:22:08.109311 140737353971520 run_language_modeling.py:344]   Total optimization steps = 228650
I1017 00:22:08.109332 140737353971520 run_language_modeling.py:343]   Gradient Accumulation steps = 4
I1017 00:22:08.109332 140737353971520 run_language_modeling.py:343]   Gradient Accumulation steps = 4
I1017 00:22:08.109340 140737353971520 run_language_modeling.py:343]   Gradient Accumulation steps = 4
I1017 00:22:08.109347 140737353971520 run_language_modeling.py:345]   Warm-up steps = 10000
I1017 00:22:08.109370 140737353971520 run_language_modeling.py:344]   Total optimization steps = 228650
I1017 00:22:08.109369 140737353971520 run_language_modeling.py:344]   Total optimization steps = 228650
I1017 00:22:08.109378 140737353971520 run_language_modeling.py:344]   Total optimization steps = 228650
I1017 00:22:08.109405 140737353971520 run_language_modeling.py:345]   Warm-up steps = 10000
I1017 00:22:08.109406 140737353971520 run_language_modeling.py:345]   Warm-up steps = 10000
I1017 00:22:08.109414 140737353971520 run_language_modeling.py:345]   Warm-up steps = 10000
I1017 00:22:08.109496 140737353971520 run_language_modeling.py:359]   Continuing training from checkpoint, will skip to saved global_step
I1017 00:22:08.109497 140737353971520 run_language_modeling.py:359]   Continuing training from checkpoint, will skip to saved global_step
I1017 00:22:08.109499 140737353971520 run_language_modeling.py:359]   Continuing training from checkpoint, will skip to saved global_step
I1017 00:22:08.109500 140737353971520 run_language_modeling.py:359]   Continuing training from checkpoint, will skip to saved global_step
I1017 00:22:08.109534 140737353971520 run_language_modeling.py:360]   Continuing training from epoch 0
I1017 00:22:08.109534 140737353971520 run_language_modeling.py:360]   Continuing training from epoch 0
I1017 00:22:08.109535 140737353971520 run_language_modeling.py:360]   Continuing training from epoch 0
I1017 00:22:08.109537 140737353971520 run_language_modeling.py:360]   Continuing training from epoch 0
I1017 00:22:08.109573 140737353971520 run_language_modeling.py:361]   Continuing training from global step 1
I1017 00:22:08.109574 140737353971520 run_language_modeling.py:361]   Continuing training from global step 1
I1017 00:22:08.109574 140737353971520 run_language_modeling.py:361]   Continuing training from global step 1
I1017 00:22:08.109575 140737353971520 run_language_modeling.py:361]   Continuing training from global step 1
I1017 00:22:08.109614 140737353971520 run_language_modeling.py:362]   Will skip the first 1 steps in the first epoch
I1017 00:22:08.109613 140737353971520 run_language_modeling.py:362]   Will skip the first 1 steps in the first epoch
I1017 00:22:08.109613 140737353971520 run_language_modeling.py:362]   Will skip the first 1 steps in the first epoch
I1017 00:22:08.109620 140737353971520 run_language_modeling.py:362]   Will skip the first 1 steps in the first epoch
^MEpoch:   0%|          | 0/5 [00:00<?, ?it/s]
^MIteration:   0%|          | 0/182922 [00:00<?, ?it/s]^[[ATraceback (most recent call last):
  File "/project/evn_py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/project/evn_py36/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/project/evn_py36/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/project/evn_py36/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
    cmd=cmd)
subprocess.CalledProcessError: Command '['/project/evn_py36/bin/python3', '-u', '/scratch/roberta_og/run_language_modeling.py', '--local_rank=3', '--gradient_accumulation_steps', '4', '--train_data_file', '/tmp/pbs.1286839.pbsha.ib.sockeye/data/', '--output_dir', '/scratch/roberta_og/ckpt_model_hg/', '--model_type', 'roberta', '--logging_dir', '/scratch/roberta_og/runs/', '--mlm', '--fp16', '--data_cache_dir', '/tmp/pbs.1286839.pbsha.ib.sockeye/cache/', '--num_workers', '0', '--warmup_steps', '10000', '--lazy_loading', '--model_name_or_path', '/scratch/roberta_og/roberta/', '--config_name', '/scratch/roberta_og/roberta', '--tokenizer_name', '/scratch/roberta_og/roberta', '--do_train', '--block_size', '60', '--learning_rate', '5e-5', '--num_train_epochs', '5', '--save_total_limit', '5', '--per_gpu_train_batch_size', '256', '--seed', '42']' died with <Signals.SIGKILL: 9>.

Could you please investigate this OOM issue? Do you have any suggestions?

I run this job with 6 nodes, each node has 4 32-GB GPUs, 24 CPUs, and 192GB CPU memory. I use distributed training.

Versions: Python version 3.6.8 PyTorch version 1.6.0 TensorFlow version 2.3.0 Transformers: 3.0.1 datasets version: 1.1.2

phtephanx commented 3 years ago

I might have a related issue:

My custom arrow file created by load_dataset is around 260 GB large. The corresponding dataset is wrapped in a torch.utils.data.Dataset.

When spawning 8 processes with xla_multiprocessing but 0 num_workers for the DataLoader, the memory peaks at around 300 GB RAM in the beginning and then levels off at 250 GB RAM. Since the number of workers for the data loading is 0, there shouldn't be prefetched batches AFAIK.

When looping over the same dataset without the DataLoader in a single process and fetching the same batch size, less than 1 GB of RAM is requested (certainly due to some additional tensor creation in __getitem__).

Is this expected?

lhoestq commented 3 years ago

The library does not load the dataset in memory. Not sure why your script using xla_multiprocessing seems to load everything unfortunately :/ Pytorch DataLoader use __getitem__ to iterate through the dataset batch by batch without loading everything indeed.

Do you think you can share a script that reproduces the issue with a smaller dataset (a few GB) to experiment/debug with ?

phtephanx commented 3 years ago

I minimalistically put together the relevant steps in this colab. Maybe you spot a problem? I'm really wondering what causes this huge memory consumption. The DataLoader just produces tensors of shape (batch_size, seq_len) = (128, 128) which accounts for < 1 GB. Apart from the dataset, the only large component is the model which consumes around 12 GB per process but of the devices' (TPUs') memory.

Multiprocess-training with TPU leads to another problem: out of space in memory space smem (https://github.com/pytorch/xla/issues/2628). I'm not sure, though, whether this is related.

phtephanx commented 3 years ago

I spotted the problem: it's the sampler's list of indices (e.g. RandomSampler) which consumes lots of memory.

Try out and watch your memory, indices takes about 3.8 G RAM. Compare it to e.g. n=int(1e7).

indices = torch.randperm(int(1e8), generator=None).tolist()

Not converting to a list is much more memory-efficient.

This might be interesting for the TO @chiyuzhang94.

lhoestq commented 3 years ago

Oh good catch ! I'm wondering if there's a memory-efficient pytorch sampler out there ?

Maybe one based on the knuth algorithm to yield indices on the fly ?

mattivi commented 3 years ago

@phtephanx How did you solve your problem? e.g. have you defined your own RandomSampler in the Trainer class?
I am having same issues. Thanks!

phtephanx commented 3 years ago

I solved it in a hacky way since I don't have so much time for this. As I use preemptible engines anyway, I split my dataset into one chunk per day in the data-loading logic.

You could try to change this line from ...

yield from torch.randperm(n, generator=self.generator).tolist()

to

yield from torch.randperm(n, generator=self.generator)

and, thereby, avoid converting to list, then build pytorch from source. If you use DistributedSampler, then, of course, adapt it there.

As pointed out in this issue not converting to lists saves lots of memory already. Let me know if this works!

mattivi commented 3 years ago

Thanks a lot @phtephanx, will try this solution.

eloukas commented 3 years ago

Hey @chiyuzhang94, could you upload your modified gist somewhere, please? It would be beneficial for the community, thanks.

chiyuzhang94 commented 3 years ago

Hey @chiyuzhang94, could you upload your modified gist somewhere, please? It would be beneficial for the community, thanks.

Hi @eloukas , I can find the link to my script here https://github.com/huggingface/datasets/issues/610#issuecomment-711067895

almson commented 3 years ago

Hi, I’ve made a PR in pytorch which replaces RandomSampler implementation with an algorithm that doesn’t use memory. Please try it out and contribute to the PR.

https://github.com/pytorch/pytorch/pull/50202

lhoestq commented 1 year ago

Closing this one since the original issue has been discussed