huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.14k stars 2.66k forks source link

PyArrow 'Memory mapping file failed: Cannot allocate memory' bug #6790

Open lasuomela opened 6 months ago

lasuomela commented 6 months ago

Describe the bug

Hello,

I've been struggling with a problem using Huggingface datasets caused by PyArrow memory allocation. I finally managed to solve it, and thought to document it since similar issues have been raised here before (https://github.com/huggingface/datasets/issues/5710, https://github.com/huggingface/datasets/issues/6176).

In my case, I was trying to load ~70k dataset files from disk using datasets.load_from_disk(data_path) (meaning 70k repeated calls to load_from_disk). This triggered an (uninformative) exception around 64k loaded files:

  File "pyarrow/io.pxi", line 1053, in pyarrow.lib.memory_map
  File "pyarrow/io.pxi", line 1000, in pyarrow.lib.MemoryMappedFile._open
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
OSError: Memory mapping file failed: Cannot allocate memory

Despite system RAM usage being very low. After a lot of digging around, I discovered that my Ubuntu machine had a limit on the maximum number of memory mapped files in /proc/sys/vm/max_map_count set to 65530, which was causing my data loader to crash. Increasing the limit in the file (echo <new_mmap_size> | sudo tee /proc/sys/vm/max_map_count) made the issue go away.

While this isn't a bug as such in either Datasets or PyArrow, this behavior can be very confusing to users. Maybe this should be mentioned in documentation? I suspect the other issues raised here about memory mapping OOM errors could actually be consequence of system configuration.

Br, Lauri

Steps to reproduce the bug

import numpy as np
import pyarrow as pa
import tqdm

# Write some data to disk
arr = pa.array(np.arange(100))
schema = pa.schema([
    pa.field('nums', arr.type)
])
with pa.OSFile('arraydata.arrow', 'wb') as sink:
    with pa.ipc.new_file(sink, schema=schema) as writer:
        batch = pa.record_batch([arr], schema=schema)
        writer.write(batch)

# Number of times to open the memory map
nums = 70000

# Read the data back
arrays = [pa.memory_map('arraydata.arrow', 'r') for _ in tqdm.tqdm(range(nums))]

Expected behavior

No errors.

Environment info

datasets: 2.18.0 pyarrow: 15.0.0

jxmorris12 commented 1 month ago

Thanks for a very clean explanation. This happened to me too, and I don't have sudo access to update the value. I wonder if there might be another workaround.

lasuomela commented 1 month ago

One option is to just have more data in each file - /proc/sys/vm/max_map_count limits the maximum number of concurrently open files, but I don't know if the size of a single file is restricted in any way. E.g. 5000 files with 1GB each is 5TB of data. https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.concatenate_datasets can come in handy.