huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.15k stars 2.67k forks source link

how to limit the size of memory mapped file? #6176

Open williamium3000 opened 1 year ago

williamium3000 commented 1 year ago

Describe the bug

Huggingface datasets use memory-mapped file to map large datasets in memory for fast access. However, it seems like huggingface will occupy all the memory for memory-mapped files, which makes a troublesome situation since we cluster will distribute a small portion of memory to me (once it's over the limit, memory cannot be allocated), however, when the dataset checks the total memory, all of the memory will be taken into account which makes huggingface dataset try to allocate more memory than allowed. So is there a way to explicitly limit the size of memory mapped file?

Steps to reproduce the bug

python

from datasets import load_dataset dataset = load_dataset("c4", "en", streaming=True)

Expected behavior

In a normal environment, this will not have any problem. However, when the system allocates a portion of the memory to the program and when the dataset checks the total memory, all of the memory will be taken into account which makes huggingface dataset try to allocate more memory than allowed.

Environment info

linux cluster with SGE(Sun Grid Engine)

mariosasko commented 1 year ago

Hi! Can you share the error this reproducer throws in your environment? streaming=True streams the dataset as it's iterated over without creating a memory-map file.

williamium3000 commented 1 year ago

The trace of the error. Streaming works but is slower.

Root Cause (first observed failure):
[0]:
  time      : 2023-08-24_06:06:01
  host      : compute-126.cm.cluster
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 48442)
  error_file: /tmp/torchelastic_4fqzcuuz/none_rx2470jl/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/users/yli7/.conda/envs/pytorch2.0/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
      return f(*args, **kwargs)
    File "Pretrain.py", line 214, in main
      pair_dataset, c4_dataset = create_dataset('pretrain', config)
    File "/dcs05/qiao/data/william/project/DaVinci/dataset/__init__.py", line 109, in create_dataset
      c4_dataset = load_dataset("c4", "en", split="train").to_iterable_dataset(num_shards=1024).map(pre_caption_huggingface)
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/load.py", line 1810, in load_dataset
      ds = builder_instance.as_dataset(split=split, verification_mode=verification_mode, in_memory=keep_in_memory)
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/builder.py", line 1145, in as_dataset
      datasets = map_nested(
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 436, in map_nested
      return function(data_struct)
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/builder.py", line 1175, in _build_single_dataset
      ds = self._as_dataset(
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/builder.py", line 1246, in _as_dataset
      dataset_kwargs = ArrowReader(cache_dir, self.info).read(
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/arrow_reader.py", line 244, in read
      return self.read_files(files=files, original_instructions=instructions, in_memory=in_memory)
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/arrow_reader.py", line 265, in read_files
      pa_table = self._read_files(files, in_memory=in_memory)
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/arrow_reader.py", line 200, in _read_files
      pa_table: Table = self._get_table_from_filename(f_dict, in_memory=in_memory)
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/arrow_reader.py", line 336, in _get_table_from_filename
      table = ArrowReader.read_table(filename, in_memory=in_memory)
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/arrow_reader.py", line 357, in read_table
      return table_cls.from_file(filename)
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/table.py", line 1059, in from_file
      table = _memory_mapped_arrow_table_from_file(filename)
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/table.py", line 65, in _memory_mapped_arrow_table_from_file
      opened_stream = _memory_mapped_record_batch_reader_from_file(filename)
    File "/users/yli7/.local/lib/python3.8/site-packages/datasets/table.py", line 50, in _memory_mapped_record_batch_reader_from_file
      memory_mapped_stream = pa.memory_map(filename)
    File "pyarrow/io.pxi", line 1009, in pyarrow.lib.memory_map
    File "pyarrow/io.pxi", line 956, in pyarrow.lib.MemoryMappedFile._open
    File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
    File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
  OSError: Memory mapping file failed: Cannot allocate memory
mariosasko commented 1 year ago

This issue has previously been reported here: https://github.com/huggingface/datasets/issues/5710. Reporting it in the Arrow repo makes more sense as they have control over memory mapping.

PS: this is the API to reduce the size of the generated Arrow file:

from datasets import load_dataset
builder = load_dataset_builder("c4", "en")
builder.download_and_prepare(max_shard_size="5GB")
dataset = builder.as_dataset()

If this resolves the issue, we can consider exposing max_shard_size in load_dataset.

williamium3000 commented 1 year ago

Thanks for the response. The problem seems not resolved. The memory I allocated to the environment is 64G and the following error still occurs `Python 3.8.16 (default, Jun 12 2023, 18:09:05) [GCC 11.2.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

from datasets import load_dataset_builder builder = load_dataset_builder("c4", "en") builder.download_and_prepare(max_shard_size="5GB") Found cached dataset c4 (/users/yli7/.cache/huggingface/datasets/c4/en/0.0.0/df532b158939272d032cc63ef19cd5b83e9b4d00c922b833e4cb18b2e9869b01) dataset = builder.as_dataset() 0%| | 0/2 [00:00<?, ?it/s]Traceback (most recent call last): File "", line 1, in File "/users/yli7/.local/lib/python3.8/site-packages/datasets/builder.py", line 1145, in as_dataset datasets = map_nested( File "/users/yli7/.local/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 444, in map_nested mapped = [ File "/users/yli7/.local/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 445, in _single_map_nested((function, obj, types, None, True, None)) File "/users/yli7/.local/lib/python3.8/site-packages/datasets/utils/py_utils.py", line 347, in _single_map_nested return function(data_struct) File "/users/yli7/.local/lib/python3.8/site-packages/datasets/builder.py", line 1175, in _build_single_dataset ds = self._as_dataset( File "/users/yli7/.local/lib/python3.8/site-packages/datasets/builder.py", line 1246, in _as_dataset dataset_kwargs = ArrowReader(cache_dir, self.info).read( File "/users/yli7/.local/lib/python3.8/site-packages/datasets/arrow_reader.py", line 244, in read return self.read_files(files=files, original_instructions=instructions, in_memory=in_memory) File "/users/yli7/.local/lib/python3.8/site-packages/datasets/arrow_reader.py", line 265, in read_files pa_table = self._read_files(files, in_memory=in_memory) File "/users/yli7/.local/lib/python3.8/site-packages/datasets/arrow_reader.py", line 200, in _read_files pa_table: Table = self._get_table_from_filename(f_dict, in_memory=in_memory) File "/users/yli7/.local/lib/python3.8/site-packages/datasets/arrow_reader.py", line 336, in _get_table_from_filename table = ArrowReader.read_table(filename, in_memory=in_memory) File "/users/yli7/.local/lib/python3.8/site-packages/datasets/arrow_reader.py", line 357, in read_table return table_cls.from_file(filename) File "/users/yli7/.local/lib/python3.8/site-packages/datasets/table.py", line 1059, in from_file table = _memory_mapped_arrow_table_from_file(filename) File "/users/yli7/.local/lib/python3.8/site-packages/datasets/table.py", line 65, in _memory_mapped_arrow_table_from_file opened_stream = _memory_mapped_record_batch_reader_from_file(filename) File "/users/yli7/.local/lib/python3.8/site-packages/datasets/table.py", line 50, in _memory_mapped_record_batch_reader_from_file memory_mapped_stream = pa.memory_map(filename) File "pyarrow/io.pxi", line 1009, in pyarrow.lib.memory_map File "pyarrow/io.pxi", line 956, in pyarrow.lib.MemoryMappedFile._open File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status OSError: Memory mapping file failed: Cannot allocate memory`

ruaruaruabick commented 1 year ago

Have you solved the problem?

williamium3000 commented 1 year ago

Nope. Streaming works but is slower.