Closed chiyuzhang94 closed 1 year ago
Could you try
load_dataset('text', data_files='test.txt',cache_dir="./", split="train")
?
load_dataset
returns a dictionary by default, like {"train": your_dataset}
Hi @lhoestq Thanks for your suggestion.
I tried
dataset = load_dataset('text', data_files='test.txt',cache_dir="./", split="train")
print(dataset)
dataset.set_format(type='torch',columns=["text"])
dataloader = torch.utils.data.DataLoader(dataset, batch_size=8)
next(iter(dataloader))
But it still doesn't work and got error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-7-388aca337e2f> in <module>
----> 1 next(iter(dataloader))
/Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in __next__(self)
361
362 def __next__(self):
--> 363 data = self._next_data()
364 self._num_yielded += 1
365 if self._dataset_kind == _DatasetKind.Iterable and \
/Library/Python/3.7/site-packages/torch/utils/data/dataloader.py in _next_data(self)
401 def _next_data(self):
402 index = self._next_index() # may raise StopIteration
--> 403 data = self._dataset_fetcher.fetch(index) # may raise StopIteration
404 if self._pin_memory:
405 data = _utils.pin_memory.pin_memory(data)
/Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in fetch(self, possibly_batched_index)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
/Library/Python/3.7/site-packages/torch/utils/data/_utils/fetch.py in <listcomp>(.0)
42 def fetch(self, possibly_batched_index):
43 if self.auto_collation:
---> 44 data = [self.dataset[idx] for idx in possibly_batched_index]
45 else:
46 data = self.dataset[possibly_batched_index]
/Library/Python/3.7/site-packages/datasets-0.4.0-py3.7.egg/datasets/arrow_dataset.py in __getitem__(self, key)
1069 format_columns=self._format_columns,
1070 output_all_columns=self._output_all_columns,
-> 1071 format_kwargs=self._format_kwargs,
1072 )
1073
/Library/Python/3.7/site-packages/datasets-0.4.0-py3.7.egg/datasets/arrow_dataset.py in _getitem(self, key, format_type, format_columns, output_all_columns, format_kwargs)
1056 format_columns=format_columns,
1057 output_all_columns=output_all_columns,
-> 1058 format_kwargs=format_kwargs,
1059 )
1060 return outputs
/Library/Python/3.7/site-packages/datasets-0.4.0-py3.7.egg/datasets/arrow_dataset.py in _convert_outputs(self, outputs, format_type, format_columns, output_all_columns, format_kwargs)
872 continue
873 if format_columns is None or k in format_columns:
--> 874 v = map_nested(command, v, **map_nested_kwargs)
875 output_dict[k] = v
876 return output_dict
/Library/Python/3.7/site-packages/datasets-0.4.0-py3.7.egg/datasets/utils/py_utils.py in map_nested(function, data_struct, dict_only, map_list, map_tuple, map_numpy, num_proc, types)
214 # Singleton
215 if not isinstance(data_struct, dict) and not isinstance(data_struct, types):
--> 216 return function(data_struct)
217
218 disable_tqdm = bool(logger.getEffectiveLevel() > INFO)
/Library/Python/3.7/site-packages/datasets-0.4.0-py3.7.egg/datasets/arrow_dataset.py in command(x)
833 if x.dtype == np.object: # pytorch tensors cannot be instantied from an array of objects
834 return [map_nested(command, i, **map_nested_kwargs) for i in x]
--> 835 return torch.tensor(x, **format_kwargs)
836
837 elif format_type == "tensorflow":
TypeError: new(): invalid data type 'str'
I found type can be ['numpy', 'torch', 'tensorflow', 'pandas'] only, how can I deal with the string type?
You need to tokenize the string inputs to convert them in integers before you can feed them to a pytorch dataloader.
You can read the quicktour of the datasets or the transformers libraries to know more about that:
Hey @chiyuzhang94, I was also having trouble in loading a large text file (11GB). But finally got it working. This is what I did after looking into the documentation.
split the whole dataset file into smaller files
mkdir ./shards
split -a 4 -l 256000 -d full_raw_corpus.txt ./shards/shard_
Pass paths of small data files to load_dataset
files = glob.glob('shards/*')
from datasets import load_dataset
dataset = load_dataset('text', data_files=files, split='train')
(On passing the whole dataset file (11GB) directly to load_dataset
was resulting into RAM issue)
Tokenization
def encode(examples):
return tokenizer(examples['text'], truncation=True, padding='max_length')
dataset = dataset.map(encode, batched=True)
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
Now you can pass dataset
to Trainer
or pytorch DataLoader
dataloader = torch.utils.data.DataLoader(dataset, batch_size=4)
next(iter(dataloader))
Hope this helps
Thanks, @thomwolf and @sipah00 ,
I tried to implement your suggestions in my scripts. Now, I am facing some connection time-out error. I am using my local file, I have no idea why the module request s3 database.
The log is:
Traceback (most recent call last):
File "/home/.local/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
raise err
File "/home/.local/lib/python3.6/site-packages/urllib3/util/connection.py", line 74, in create_connection
timeout=timeout
File "/home/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 720, in urlopen
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
Traceback (most recent call last):
File "/home/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 672, in urlopen
method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
File "/home/.local/lib/python3.6/site-packages/urllib3/util/retry.py", line 436, in increment
chunked=chunked,
File "/home/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 376, in _make_request
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='s3.amazonaws.com', port=443): Max retries exceeded with url: /datasets.huggingface.co/datasets/datasets/text/text.py (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection obj
ect at 0x7fff401e0e48>: Failed to establish a new connection: [Errno 110] Connection timed out',))
Traceback (most recent call last):
File "/scratch/roberta_emohash/run_language_modeling.py", line 1019, in <module>
main()
File "/scratch/roberta_emohash/run_language_modeling.py", line 962, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
File "/scratch/roberta_emohash/run_language_modeling.py", line 177, in load_and_cache_examples
return HG_Datasets(tokenizer, file_path, args)
File "/scratch/roberta_emohash/run_language_modeling.py", line 117, in HG_Datasets
dataset = load_dataset('text', data_files=files, cache_dir = args.data_cache_dir, split="train")
File "/arc/project/evn_py36/datasets/datasets/src/datasets/load.py", line 590, in load_dataset
self._validate_conn(conn)
File "/home/.local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 994, in _validate_conn
conn.connect()
File "/home/.local/lib/python3.6/site-packages/urllib3/connection.py", line 300, in connect
conn = self._new_conn()
File "/home/.local/lib/python3.6/site-packages/urllib3/connection.py", line 169, in _new_conn
self, "Failed to establish a new connection: %s" % e
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7fff401e0da0>: Failed to establish a new connection: [Errno 110] Connection timed out
Do you have any experience on this issue?
No, I didn't encounter this problem, it seems to me a network problem
I noticed this is because I use a cloud server where does not provide for connections from our standard compute nodes to outside resources.
For the datasets
package, it seems that if the loading script is not already cached in the library it will attempt to connect to an AWS resource to download the dataset loading script.
I am wondering why the package works in this way. Do you have any suggestions to solve this issue?
I solved the above issue by downloading text.py manually and passing the path to the load_dataset
function.
Now, I have a new issue with the Read-only file system.
The error is:
I0916 22:14:38.453380 140737353971520 filelock.py:274] Lock 140734268996072 acquired on /scratch/chiyuzh/roberta/text.py.lock
Found main folder for dataset /scratch/chiyuzh/roberta/text.py at /home/chiyuzh/.cache/huggingface/modules/datasets_modules/datasets/text
Creating specific version folder for dataset /scratch/chiyuzh/roberta/text.py at /home/chiyuzh/.cache/huggingface/modules/datasets_modules/datasets/text/512f465342e4f4cd07a8791428a629c043bb89d55ad7817cbf7fcc649178b014
I0916 22:14:38.530371 140737353971520 filelock.py:318] Lock 140734268996072 released on /scratch/chiyuzh/roberta/text.py.lock
Traceback (most recent call last):
File "/scratch/chiyuzh/roberta/run_language_modeling_hg.py", line 1019, in <module>
main()
File "/scratch/chiyuzh/roberta/run_language_modeling_hg.py", line 962, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
File "/scratch/chiyuzh/roberta/run_language_modeling_hg.py", line 177, in load_and_cache_examples
return HG_Datasets(tokenizer, file_path, args)
File "/scratch/chiyuzh/roberta/run_language_modeling_hg.py", line 117, in HG_Datasets
dataset = load_dataset('/scratch/chiyuzh/roberta/text.py', data_files=files, cache_dir = args.data_cache_dir, split="train")
File "/arc/project/chiyuzh/evn_py36/datasets/src/datasets/load.py", line 590, in load_dataset
path, script_version=script_version, download_config=download_config, download_mode=download_mode, dataset=True
File "/arc/project/chiyuzh/evn_py36/datasets/src/datasets/load.py", line 385, in prepare_module
os.makedirs(hash_folder_path)
File "/project/chiyuzh/evn_py36/lib/python3.6/os.py", line 220, in makedirs
mkdir(name, mode)
OSError: [Errno 30] Read-only file system: '/home/chiyuzh/.cache/huggingface/modules/datasets_modules/datasets/text/512f465342e4f4cd07a8791428a629c043bb89d55ad7817cbf7fcc649178b014'
I installed datasets at /project/chiyuzh/evn_py36/datasets/src where is a writable directory.
I also tried change the environment variables to the writable directory:
export HF_MODULES_PATH=/project/chiyuzh/evn_py36/datasets/cache_dir/
export HF_DATASETS_CACHE=/project/chiyuzh/evn_py36/datasets/cache_dir/
In my scripts, I also changed to:
dataset = load_dataset('/scratch/chiyuzh/roberta/text.py', data_files=files, cache_dir = args.data_cache_dir, split="train")
data_cache_dir = $TMPDIR/data/
that also a writable directory.
But it still try to make directory at /home/chiyuzh/.cache/huggingface/modules/. Do you have any idea about this issue? @thomwolf
Hey @chiyuzhang94, I was also having trouble in loading a large text file (11GB). But finally got it working. This is what I did after looking into the documentation.
- split the whole dataset file into smaller files
mkdir ./shards split -a 4 -l 256000 -d full_raw_corpus.txt ./shards/shard_
- Pass paths of small data files to
load_dataset
files = glob.glob('shards/*') from datasets import load_dataset dataset = load_dataset('text', data_files=files, split='train')
(On passing the whole dataset file (11GB) directly to
load_dataset
was resulting into RAM issue)
- Tokenization
def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length') dataset = dataset.map(encode, batched=True) dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
Now you can pass
dataset
toTrainer
orpytorch DataLoader
dataloader = torch.utils.data.DataLoader(dataset, batch_size=4) next(iter(dataloader))
Hope this helps
When I run 'dataset = dataset.map(encode, batched=True)', I encountered a problem like this:
Testing the mapped function outputs Traceback (most recent call last): File "
", line 1, in File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/dataset_dict.py", line 300, in map for k, dataset in self.items() File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/dataset_dict.py", line 300, in for k, dataset in self.items() File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1224, in map update_data = does_function_return_dict(test_inputs, test_indices) File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1195, in does_function_return_dict function(*fn_args, indices, *fn_kwargs) if with_indices else function(fn_args, **fn_kwargs) File " ", line 3, in encode TypeError: init() takes 1 positional argument but 2 were given
Hey @chiyuzhang94, I was also having trouble in loading a large text file (11GB). But finally got it working. This is what I did after looking into the documentation.
- split the whole dataset file into smaller files
mkdir ./shards split -a 4 -l 256000 -d full_raw_corpus.txt ./shards/shard_
- Pass paths of small data files to
load_dataset
files = glob.glob('shards/*') from datasets import load_dataset dataset = load_dataset('text', data_files=files, split='train')
(On passing the whole dataset file (11GB) directly to
load_dataset
was resulting into RAM issue)
- Tokenization
def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length') dataset = dataset.map(encode, batched=True) dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
Now you can pass
dataset
toTrainer
orpytorch DataLoader
dataloader = torch.utils.data.DataLoader(dataset, batch_size=4) next(iter(dataloader))
Hope this helps
When I run 'dataset = dataset.map(encode, batched=True)', I encountered a problem like this:
Testing the mapped function outputs Traceback (most recent call last): File "", line 1, in File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/dataset_dict.py", line 300, in map for k, dataset in self.items() File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/dataset_dict.py", line 300, in for k, dataset in self.items() File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1224, in map update_data = does_function_return_dict(test_inputs, test_indices) File "/anaconda3/envs/torch-xla-1.6/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1195, in does_function_return_dict function(*fn_args, indices, *fn_kwargs) if with_indices else function(fn_args, fn_kwargs) File "", line 3, in encode TypeError: init**() takes 1 positional argument but 2 were given
What is your encoder function?
def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')
It is the same as suggested:
def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')
def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')
It is the same as suggested:
def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')
Do you use this function in a class
object?
init() takes 1 positional argument but 2 were given. I guess the additional argument is self?
def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')
It is the same as suggested:
def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')
Do you use this function in a
class
object?init() takes 1 positional argument but 2 were given. I guess the additional argument is self?
Thanks for your reply. Could you provide some simple example here? Currently, I do not use this function in a class object. I think you are right and I was wondering how to construct this class. I try to modify it based on transformers' LineByLineTextDataset. Am I correct?
class LineByLineTextDataset(Dataset): """ This will be superseded by a framework-agnostic approach soon. """
def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int):
assert os.path.isfile(file_path), f"Input file path {file_path} not found"
# Here, we do not cache the features, operating under the assumption
# that we will soon use fast multithreaded tokenizers from the
# `tokenizers` repo everywhere =)
#logger.info("Creating features from dataset file at %s", file_path)
#with open(file_path, encoding="utf-8") as f:
# lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())]
#batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size)
import glob
files = glob.glob('/home/mtzhang111/fairseq/cs_doc/shards/shard_003*')
from datasets import load_dataset
dataset = load_dataset('text', data_files=files)
batch_encoding= dataset.map(encode, batched=True)
self.examples = batch_encoding["input_ids"]
def encode(examples):
return tokenizer(examples['text'], truncation=True, padding='max_length')
def __len__(self):
return len(self.examples)
def __getitem__(self, i) -> torch.Tensor:
return torch.tensor(self.examples[i], dtype=torch.long)
def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')
It is the same as suggested:
def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length')
Do you use this function in a
class
object? init() takes 1 positional argument but 2 were given. I guess the additional argument is self?Thanks for your reply. Could you provide some simple example here? Currently, I do not use this function in a class object. I think you are right and I was wondering how to construct this class. I try to modify it based on transformers' LineByLineTextDataset. Am I correct?
class LineByLineTextDataset(Dataset): """ This will be superseded by a framework-agnostic approach soon. """
def __init__(self, tokenizer: PreTrainedTokenizer, file_path: str, block_size: int): assert os.path.isfile(file_path), f"Input file path {file_path} not found" # Here, we do not cache the features, operating under the assumption # that we will soon use fast multithreaded tokenizers from the # `tokenizers` repo everywhere =) #logger.info("Creating features from dataset file at %s", file_path) #with open(file_path, encoding="utf-8") as f: # lines = [line for line in f.read().splitlines() if (len(line) > 0 and not line.isspace())] #batch_encoding = tokenizer(lines, add_special_tokens=True, truncation=True, max_length=block_size) import glob files = glob.glob('/home/mtzhang111/fairseq/cs_doc/shards/shard_003*') from datasets import load_dataset dataset = load_dataset('text', data_files=files) batch_encoding= dataset.map(encode, batched=True) self.examples = batch_encoding["input_ids"] def encode(examples): return tokenizer(examples['text'], truncation=True, padding='max_length') def __len__(self): return len(self.examples) def __getitem__(self, i) -> torch.Tensor: return torch.tensor(self.examples[i], dtype=torch.long)
I am also struggling with this adaptation. I am not sure whether I am right.
I think you don't need to construct class LazyLineByLineTextDataset(Dataset)
at all.
torch.utils.data.Dataset is a generator.
Now, we use dataset = dataset.map(encode, batched=True)
as a generator. So we just pass dataset to torch.utils.data.DataLoader.
@chiyuzhang94 Thanks for your reply. After some changes, currently, I managed to make the data loading process running. I published it in case you might want to take a look. Thanks for your help! https://github.com/shizhediao/Transformers_TPU
Hi @shizhediao ,
Thanks! It looks great!
But my problem still is the cache directory is a read-only file system. As I mentioned, I tried to change the cache directory but it didn't work.
Do you have any suggestions?
I installed datasets at /project/chiyuzh/evn_py36/datasets/src where is a writable directory. I also tried change the environment variables to the writable directory:
export HF_MODULES_PATH=/project/chiyuzh/evn_py36/datasets/cache_dir/
I think it is HF_MODULES_CACHE
and not HF_MODULES_PATH
@chiyuzhang94 .
Could you try again and let me know if it fixes your issue ?
We should probably add a section in the doc on the caching system with the env variables in particular.
Hi @thomwolf , @lhoestq ,
Thanks for your suggestions. With the latest version of this package, I can load text data without Internet.
But I found the speed of dataset loading is very slow.
My scrips like this:
def token_encode(examples):
tokenizer_out = tokenizer(examples['text'], truncation=True, padding="max_length", add_special_tokens=True, max_length=args.block_size)
return tokenizer_out
path = Path(file_path)
files = sorted(path.glob('*'))
dataset = load_dataset('./text.py', data_files=files, cache_dir = args.data_cache_dir, split="train")
dataset = dataset.map(token_encode, batched=True)
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
I have 1,123,870,657 lines in my input directory. I can find the processing speed as following. It is very slow.
| 13/1123871 [00:02<62:37:39, 4.98ba/s]^M 0%|
| 14/1123871 [00:03<61:27:31, 5.08ba/s]^M 0%|
| 15/1123871 [00:03<66:34:19, 4.69ba/s]^M 0%|
| 16/1123871 [00:03<68:25:01, 4.56ba/s]^M 0%|
| 17/1123871 [00:03<72:00:03, 4.34ba/s]^M 0%|
Do you have any suggestions to accelerate this loading process?
You can use multiprocessing by specifying num_proc=
in .map()
Also it looks like you have 1123871
batches of 1000 elements (default batch size), i.e. 1,123,871,000 lines in total.
Am I right ?
You can use multiprocessing by specifying
num_proc=
in.map()
Also it looks like you have
1123871
batches of 1000 elements (default batch size), i.e. 1,123,871,000 lines in total. Am I right ?
Hi @lhoestq ,
Thanks. I will try it.
You are right. I have 1,123,870,657 lines totally in the path. I split the large file into 440 small files. Each file has 2,560,000 lines.
I have another question. Because I am using a cloud server where only allows running a job up to 7 days. Hence, I need to resume my model every week. If the script needs to load and process the dataset every time. It is very low efficient based on the current processing speed. Is it possible that I process the dataset once and use the process cache to in the future work?
Hi @lhoestq ,
I tried to use multi-processor, but I got errors as follow: Because I am using python distributed training, it seems some conflicts with the distributed job.
Do you have any suggestions?
I0925 10:19:35.603023 140737353971520 filelock.py:318] Lock 140737229443368 released on /tmp/pbs.1120510.pbsha.ib.sockeye/cache/_tmp_pbs.1120510.pbsha.ib.sockeye_cache_text_default-7fb934ed6fac5d01_0.0.0_512f465342e4f4cd07a8791428a629c043bb89d55ad7817cbf7
fcc649178b014.lock
Traceback (most recent call last):
File "/scratch/chiyuzh/roberta/run_language_modeling.py", line 1024, in <module>
main()
File "/scratch/chiyuzh/roberta/run_language_modeling.py", line 967, in main
train_dataset = load_and_cache_examples(args, tokenizer, evaluate=False)
File "/scratch/chiyuzh/roberta/run_language_modeling.py", line 180, in load_and_cache_examples
return HG_Datasets(tokenizer, file_path, args)
File "/scratch/chiyuzh/roberta/run_language_modeling.py", line 119, in HG_Datasets
dataset = dataset.map(token_encode, batched=True, batch_size = 10000, num_proc = 16)
File "/project/chiyuzh/evn_py36/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1287, in map
transformed_shards = [r.get() for r in results]
File "/project/chiyuzh/evn_py36/lib/python3.6/site-packages/datasets/arrow_dataset.py", line 1287, in <listcomp>
transformed_shards = [r.get() for r in results]
File "/project/chiyuzh/evn_py36/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/project/chiyuzh/evn_py36/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/project/chiyuzh/evn_py36/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/project/chiyuzh/evn_py36/lib/python3.6/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
AttributeError: Can't pickle local object 'HG_Datasets.<locals>.token_encode'
For multiprocessing, the function given to map
must be picklable.
Maybe you could try to define token_encode
outside HG_Datasets
?
Also maybe #656 could make functions defined locally picklable for multiprocessing, once it's merged.
I have another question. Because I am using a cloud server where only allows running a job up to 7 days. Hence, I need to resume my model every week. If the script needs to load and process the dataset every time. It is very low efficient based on the current processing speed. Is it possible that I process the dataset once and use the process cache to in the future work?
Feel free to save your processed dataset using dataset.save_to_disk("path/to/save/directory")
.
Then you'll be able to reload it again using
from datasets import load_from_disk
dataset = load_from_disk("path/to/save/directory")
Hi @lhoestq ,
Thanks for your suggestion. I tried to process the dataset and save it to disk. I have 1.12B samples in the raw dataset. I used 16 processors. I run this process job for 7 days. But it didn't finish. I don't why the processing is such slow.
The log shows that some processors (#12, #14, #15) are very slow. The different processor has a different speed. These slow processors look like a bottleneck.
Could you please give me any suggestion to improve the processing speed?
Thanks. Chiyu
Here is my code:
def token_encode(examples):
tokenizer_out = tokenizer(examples['text'], truncation=True, padding="max_length", add_special_tokens=True, max_length=args.block_size)
return tokenizer_out
path = Path(file_path)
files = sorted(path.glob('*'))
dataset = load_dataset('./text.py', data_files=files, cache_dir = args.data_cache_dir, split="train")
dataset = dataset.map(token_encode, batched=True, batch_size = 16384, num_proc = 16)
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
dataset.save_to_disk(output_dir)
Here is the log.
^M#6: 1%|β | 59/4288 [55:10<66:11:58, 56.35s/ba]
^M#1: 8%|β | 356/4288 [55:39<10:40:02, 9.77s/ba]
^M#2: 5%|β | 210/4288 [55:33<17:47:19, 15.70s/ba]
^M#0: 19%|ββ | 836/4288 [55:53<4:08:56, 4.33s/ba]
^M#0: 20%|ββ | 837/4288 [55:57<4:01:52, 4.21s/ba]
^M#1: 8%|β | 357/4288 [55:48<10:38:09, 9.74s/ba]
^M#0: 20%|ββ | 838/4288 [56:01<4:02:56, 4.23s/ba]
^M#3: 4%|β | 155/4288 [55:43<24:41:20, 21.51s/ba]
^M#0: 20%|ββ | 839/4288 [56:05<4:04:48, 4.26s/ba]
^M#12: 1%| | 29/4288 [54:50<133:20:53, 112.72s/ba]
^M#2: 5%|β | 211/4288 [55:48<17:40:33, 15.61s/ba]
^M#14: 0%| | 2/4288 [04:24<157:17:50, 132.12s/ba]
^M#15: 0%| | 1/4288 [02:24<172:11:37, 144.60s/ba]
Hi !
As far as I can tell, there could be several reasons for your processes to have different speeds:
num_proc
is higher than the number of actual processes that you can run in parallel at full speed.So I'd suggest you to check that you have nothing else running in parallel to your processing job, and also maybe take a look at the slow parts of the datasets.
When doing multiprocessing, the dataset is sharded in num_proc
contiguous parts that are processed individually in each process. If you want to take a look at the dataset processed in the 12th shard of 16 for example, you can do:
my_shard = dataset.shard(num_shards=16, index=12, contiguous=True)
Hope this helps, let me know if you find what is causing this slow down.
Do you use a fast or a slow tokenizer from the transformers
library @chiyuzhang94?
Do you use a fast or a slow tokenizer from the
transformers
library @chiyuzhang94?
Hi @thomwolf , I use this:
from transformers import
AutoTokenizer.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
I guess this is a slow one, let me explore the fast tokenizer.
Hi !
As far as I can tell, there could be several reasons for your processes to have different speeds:
- some parts of your dataset have short passages while some have longer passages, that take more time to be processed
- OR there are other processes running that prevent some of them to run at full speed
- OR the value of
num_proc
is higher than the number of actual processes that you can run in parallel at full speed.So I'd suggest you to check that you have nothing else running in parallel to your processing job, and also maybe take a look at the slow parts of the datasets. When doing multiprocessing, the dataset is sharded in
num_proc
contiguous parts that are processed individually in each process. If you want to take a look at the dataset processed in the 12th shard of 16 for example, you can do:my_shard = dataset.shard(num_shards=16, index=12, contiguous=True)
Hope this helps, let me know if you find what is causing this slow down.
Hi @lhoestq ,
Thanks for your suggestions. I don't think my problem is due to any one of these seasons.
I have 1,123,870,657 lines totally in the path. I split the large file into 440 small files. Each file has 2,560,000 lines. The last file is smaller a little bit. But they are similar. I randomly shuffled all the 1,123,870,657 lines. Hence, the sequences should also be similar across all the files.
I run this script on the entire node. I requested all the resources on the nodes (40 CPUs, 384GB memory). Hence, these were not any other processes.
Hi @thomwolf
I am using RobertaTokenizerFast
now.
But the speed is still imbalanced, some processors are still slow. Here is the part of the log. #0 is always much fast than lower rank processors.
#15: 3%|β | 115/3513 [3:18:36<98:01:33, 103.85s/ba]
#2: 24%|βββ | 847/3513 [3:20:43<11:06:49, 15.01s/ba]
#1: 37%|ββββ | 1287/3513 [3:20:52<6:19:02, 10.22s/ba]
#0: 72%|ββββββββ | 2546/3513 [3:20:52<1:51:03, 6.89s/ba]
#3: 18%|ββ | 617/3513 [3:20:36<15:50:30, 19.69s/ba]
#0: 73%|ββββββββ | 2547/3513 [3:20:59<1:50:25, 6.86s/ba]
#1: 37%|ββββ | 1288/3513 [3:21:02<6:21:13, 10.28s/ba]
#7: 7%|β | 252/3513 [3:20:09<44:09:03, 48.74s/ba]
#12: 4%|β | 144/3513 [3:19:19<78:00:54, 83.36s/ba]
#4: 14%|ββ | 494/3513 [3:20:37<20:46:06, 24.77s/ba]
#0: 73%|ββββββββ | 2548/3513 [3:21:06<1:49:26, 6.80s/ba]
#2: 24%|βββ | 848/3513 [3:20:58<11:06:17, 15.00s/ba]
Here is my script related to the datasets processing,
tokenizer = RobertaTokenizerFast.from_pretrained(args.model_name_or_path, cache_dir=args.cache_dir)
def token_encode(examples):
tokenizer_out = tokenizer(examples['text'], truncation=True, padding="max_length", add_special_tokens=True, max_length=128)
return tokenizer_out
def HG_Datasets(tokenizer, file_path, args):
path = Path(file_path)
files = sorted(path.glob('*'))
dataset = load_dataset('./text.py', data_files=files, cache_dir = ""./, split="train")
dataset = dataset.map(token_encode, batched=True, batch_size = 20000, num_proc = 16)
dataset.set_format(type='torch', columns=['input_ids', 'attention_mask'])
return dataset
I have 1,123,870,657 lines totally in the path. I split the large file into 440 small files. Each file has 2,560,000 lines.
Could you please give any suggestion? Thanks very much!!
Hi @thomwolf @lhoestq ,
Thanks for your help.
Finally, the preprocess is completed. I used 32 CPUs. The preprocess took 55 hours. The dataset.arrow
size is 1.18TB.
I saved the processed dataset and used it for RoBERTa pre-training. You can find my scripts here.
I loaded the dataset by load_from_disk()
.
dataset = load_from_disk(file_path)
But the job runs out of memory when it loads batch samples at the first step. Here is the error log. I can find that the dataset was loaded successfully. The job was terminated at the first step. I expect the datasets
library can handle the memory issue effectively. But it still runs out of memory.
I1017 00:22:08.109311 140737353971520 run_language_modeling.py:344] Total optimization steps = 228650
I1017 00:22:08.109332 140737353971520 run_language_modeling.py:343] Gradient Accumulation steps = 4
I1017 00:22:08.109332 140737353971520 run_language_modeling.py:343] Gradient Accumulation steps = 4
I1017 00:22:08.109340 140737353971520 run_language_modeling.py:343] Gradient Accumulation steps = 4
I1017 00:22:08.109347 140737353971520 run_language_modeling.py:345] Warm-up steps = 10000
I1017 00:22:08.109370 140737353971520 run_language_modeling.py:344] Total optimization steps = 228650
I1017 00:22:08.109369 140737353971520 run_language_modeling.py:344] Total optimization steps = 228650
I1017 00:22:08.109378 140737353971520 run_language_modeling.py:344] Total optimization steps = 228650
I1017 00:22:08.109405 140737353971520 run_language_modeling.py:345] Warm-up steps = 10000
I1017 00:22:08.109406 140737353971520 run_language_modeling.py:345] Warm-up steps = 10000
I1017 00:22:08.109414 140737353971520 run_language_modeling.py:345] Warm-up steps = 10000
I1017 00:22:08.109496 140737353971520 run_language_modeling.py:359] Continuing training from checkpoint, will skip to saved global_step
I1017 00:22:08.109497 140737353971520 run_language_modeling.py:359] Continuing training from checkpoint, will skip to saved global_step
I1017 00:22:08.109499 140737353971520 run_language_modeling.py:359] Continuing training from checkpoint, will skip to saved global_step
I1017 00:22:08.109500 140737353971520 run_language_modeling.py:359] Continuing training from checkpoint, will skip to saved global_step
I1017 00:22:08.109534 140737353971520 run_language_modeling.py:360] Continuing training from epoch 0
I1017 00:22:08.109534 140737353971520 run_language_modeling.py:360] Continuing training from epoch 0
I1017 00:22:08.109535 140737353971520 run_language_modeling.py:360] Continuing training from epoch 0
I1017 00:22:08.109537 140737353971520 run_language_modeling.py:360] Continuing training from epoch 0
I1017 00:22:08.109573 140737353971520 run_language_modeling.py:361] Continuing training from global step 1
I1017 00:22:08.109574 140737353971520 run_language_modeling.py:361] Continuing training from global step 1
I1017 00:22:08.109574 140737353971520 run_language_modeling.py:361] Continuing training from global step 1
I1017 00:22:08.109575 140737353971520 run_language_modeling.py:361] Continuing training from global step 1
I1017 00:22:08.109614 140737353971520 run_language_modeling.py:362] Will skip the first 1 steps in the first epoch
I1017 00:22:08.109613 140737353971520 run_language_modeling.py:362] Will skip the first 1 steps in the first epoch
I1017 00:22:08.109613 140737353971520 run_language_modeling.py:362] Will skip the first 1 steps in the first epoch
I1017 00:22:08.109620 140737353971520 run_language_modeling.py:362] Will skip the first 1 steps in the first epoch
^MEpoch: 0%| | 0/5 [00:00<?, ?it/s]
^MIteration: 0%| | 0/182922 [00:00<?, ?it/s]^[[ATraceback (most recent call last):
File "/project/evn_py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/project/evn_py36/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/project/evn_py36/lib/python3.6/site-packages/torch/distributed/launch.py", line 261, in <module>
main()
File "/project/evn_py36/lib/python3.6/site-packages/torch/distributed/launch.py", line 257, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/project/evn_py36/bin/python3', '-u', '/scratch/roberta_og/run_language_modeling.py', '--local_rank=3', '--gradient_accumulation_steps', '4', '--train_data_file', '/tmp/pbs.1286839.pbsha.ib.sockeye/data/', '--output_dir', '/scratch/roberta_og/ckpt_model_hg/', '--model_type', 'roberta', '--logging_dir', '/scratch/roberta_og/runs/', '--mlm', '--fp16', '--data_cache_dir', '/tmp/pbs.1286839.pbsha.ib.sockeye/cache/', '--num_workers', '0', '--warmup_steps', '10000', '--lazy_loading', '--model_name_or_path', '/scratch/roberta_og/roberta/', '--config_name', '/scratch/roberta_og/roberta', '--tokenizer_name', '/scratch/roberta_og/roberta', '--do_train', '--block_size', '60', '--learning_rate', '5e-5', '--num_train_epochs', '5', '--save_total_limit', '5', '--per_gpu_train_batch_size', '256', '--seed', '42']' died with <Signals.SIGKILL: 9>.
Could you please investigate this OOM issue? Do you have any suggestions?
I run this job with 6 nodes, each node has 4 32-GB GPUs, 24 CPUs, and 192GB CPU memory. I use distributed training.
Versions: Python version 3.6.8 PyTorch version 1.6.0 TensorFlow version 2.3.0 Transformers: 3.0.1 datasets version: 1.1.2
I might have a related issue:
My custom arrow file created by load_dataset
is around 260 GB large. The corresponding dataset is wrapped in a torch.utils.data.Dataset
.
When spawning 8
processes with xla_multiprocessing
but 0
num_workers
for the DataLoader
, the memory peaks at around 300 GB RAM in the beginning and then levels off at 250 GB RAM. Since the number of workers for the data loading is 0
, there shouldn't be prefetched batches AFAIK.
When looping over the same dataset without the DataLoader
in a single process and fetching the same batch size, less than 1 GB of RAM is requested (certainly due to some additional tensor creation in __getitem__
).
Is this expected?
The library does not load the dataset in memory. Not sure why your script using xla_multiprocessing seems to load everything unfortunately :/
Pytorch DataLoader use __getitem__
to iterate through the dataset batch by batch without loading everything indeed.
Do you think you can share a script that reproduces the issue with a smaller dataset (a few GB) to experiment/debug with ?
I minimalistically put together the relevant steps in this colab. Maybe you spot a problem? I'm really wondering what causes this huge memory consumption. The DataLoader
just produces tensors of shape (batch_size, seq_len) = (128, 128)
which accounts for < 1 GB. Apart from the dataset, the only large component is the model which consumes around 12 GB per process but of the devices' (TPUs') memory.
Multiprocess-training with TPU leads to another problem: out of space in memory space smem
(https://github.com/pytorch/xla/issues/2628). I'm not sure, though, whether this is related.
I spotted the problem: it's the sampler's list of indices (e.g. RandomSampler) which consumes lots of memory.
Try out and watch your memory, indices
takes about 3.8 G RAM. Compare it to e.g. n=int(1e7)
.
indices = torch.randperm(int(1e8), generator=None).tolist()
Not converting to a list is much more memory-efficient.
This might be interesting for the TO @chiyuzhang94.
Oh good catch ! I'm wondering if there's a memory-efficient pytorch sampler out there ?
Maybe one based on the knuth algorithm to yield indices on the fly ?
@phtephanx How did you solve your problem? e.g. have you defined your own RandomSampler in the Trainer class?
I am having same issues. Thanks!
I solved it in a hacky way since I don't have so much time for this. As I use preemptible engines anyway, I split my dataset into one chunk per day in the data-loading logic.
You could try to change this line from ...
yield from torch.randperm(n, generator=self.generator).tolist()
to
yield from torch.randperm(n, generator=self.generator)
and, thereby, avoid converting to list, then build pytorch
from source.
If you use DistributedSampler, then, of course, adapt it there.
As pointed out in this issue not converting to lists saves lots of memory already. Let me know if this works!
Thanks a lot @phtephanx, will try this solution.
Hey @chiyuzhang94, could you upload your modified gist somewhere, please? It would be beneficial for the community, thanks.
Hey @chiyuzhang94, could you upload your modified gist somewhere, please? It would be beneficial for the community, thanks.
Hi @eloukas , I can find the link to my script here https://github.com/huggingface/datasets/issues/610#issuecomment-711067895
Hi, Iβve made a PR in pytorch which replaces RandomSampler implementation with an algorithm that doesnβt use memory. Please try it out and contribute to the PR.
Closing this one since the original issue has been discussed
I migrate my question from https://github.com/huggingface/transformers/pull/4009#issuecomment-690039444
I tried to train a Roberta from scratch using transformers. But I got OOM issues with loading a large text file. According to the suggestion from @thomwolf , I tried to implement
datasets
to load my text file. This test.txt is a simple sample where each line is a sentence.But dataload cannot yield sample and error is:
dataset.set_format(type='torch',columns=["text"])
returns a log says:I noticed the dataset is
DatasetDict({'train': Dataset(features: {'text': Value(dtype='string', id=None)}, num_rows: 44)})
. Each sample can be accessed bydataset["train"]["text"]
instead ofdataset["text"]
.Could you please give me any suggestions on how to modify this code to load the text file?
Versions: Python version 3.7.3 PyTorch version 1.6.0 TensorFlow version 2.3.0 datasets version: 1.0.1