Open lasuomela opened 6 months ago
Thanks for a very clean explanation. This happened to me too, and I don't have sudo access to update the value. I wonder if there might be another workaround.
One option is to just have more data in each file - /proc/sys/vm/max_map_count limits the maximum number of concurrently open files, but I don't know if the size of a single file is restricted in any way. E.g. 5000 files with 1GB each is 5TB of data. https://huggingface.co/docs/datasets/v2.18.0/en/package_reference/main_classes#datasets.concatenate_datasets can come in handy.
Describe the bug
Hello,
I've been struggling with a problem using Huggingface datasets caused by PyArrow memory allocation. I finally managed to solve it, and thought to document it since similar issues have been raised here before (https://github.com/huggingface/datasets/issues/5710, https://github.com/huggingface/datasets/issues/6176).
In my case, I was trying to load ~70k dataset files from disk using
datasets.load_from_disk(data_path)
(meaning 70k repeated calls to load_from_disk). This triggered an (uninformative) exception around 64k loaded files:Despite system RAM usage being very low. After a lot of digging around, I discovered that my Ubuntu machine had a limit on the maximum number of memory mapped files in
/proc/sys/vm/max_map_count
set to 65530, which was causing my data loader to crash. Increasing the limit in the file (echo <new_mmap_size> | sudo tee /proc/sys/vm/max_map_count
) made the issue go away.While this isn't a bug as such in either Datasets or PyArrow, this behavior can be very confusing to users. Maybe this should be mentioned in documentation? I suspect the other issues raised here about memory mapping OOM errors could actually be consequence of system configuration.
Br, Lauri
Steps to reproduce the bug
Expected behavior
No errors.
Environment info
datasets: 2.18.0 pyarrow: 15.0.0