Closed geodra closed 5 months ago
Can you try setting "ulimit -n unlimited" in your bash terminal?
The reason why you are seeing this is that each streaming dataset instantiation creates a shared memory file descriptor and it only gets cleaned up during python interpreter exit.
Closing as stale. Please reopen if you are still encountering problems.
I would like to reopen this issue as I am seeing the same error when running the provided example for C4 (https://github.com/mosaicml/llm-foundry/tree/main/scripts/data_prep#:~:text=%23%20Convert%20C4%20dataset%20to%20StreamingDataset%20format%0Apython%20convert_dataset_hf.py%20%5C%0A%20%20%2D%2Ddataset%20c4%20%2D%2Ddata_subset%20en%20%5C%0A%20%20%2D%2Dout_root%20my%2Dcopy%2Dc4%20%2D%2Dsplits%20train_small%20val_small%20%5C%0A%20%20%2D%2Dconcat_tokens%202048%20%2D%2Dtokenizer%20EleutherAI/gpt%2Dneox%2D20b%20%2D%2Deos_text%20%27%3C%7Cendoftext%7C%3E%27%20%5C%0A%20%20%2D%2Dcompression%20zstd):
OSError: [Errno 24] Too many open files
and ulimit
is set to unlimited
(at least I think so):
eldar@gpuserver9:~$ ulimit
unlimited
@karan6181
Hey @eldarkurtic, sorry for my super late response. Do you still see the issue? One recommendation is to call streaming.base.util.clean_stale_shared_memory()
at the beginning of the main script. The streaming library utilizes shared memory to coordinate some things between processes on the same node. In the event that a run is killed and there isn't an opportunity for a graceful cleanup, some artifacts might be left in shared memory on a node.
And if you still see the issue, can you share the full traceback?
Hey @karan6181 , I don't have this problem anymore. The solution was to edit /etc/security/limits.conf
and increase the limit for open files from 1024 (default) to 65535 (requires sudo access).
Thanks, @eldarkurtic, for the confirmation. I am closing this issue.
Environment
Collecting system information...
System Environment Report
Created: 2023-06-30 19:57:29 UTC
PyTorch information
PyTorch version: 2.0.0 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A
OS: Amazon Linux 2 (x86_64) GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-15) Clang version: Could not collect CMake version: version 3.26.3 Libc version: glibc-2.26
Python version: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.10.178-162.673.amzn2.x86_64-x86_64-with-glibc2.26 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G Nvidia driver version: 525.85.12 cuDNN version: Probably one of the following: /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7R32 Stepping: 0 CPU MHz: 2755.990 BogoMIPS: 5599.86 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 16384K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Versions of relevant libraries: [pip3] numpy==1.24.3 [pip3] pytorch-ranger==0.1.1 [pip3] torch==2.0.0 [pip3] torch-model-archiver==0.7.1b20230208 [pip3] torch-optimizer==0.3.0 [pip3] torch-workflow-archiver==0.2.7b20230208 [pip3] torchaudio==2.0.1 [pip3] torchdata==0.6.0 [pip3] torchmetrics==0.11.4 [pip3] torchserve==0.7.1b20230208 [pip3] torchtext==0.15.1 [pip3] torchvision==0.15.1 [conda] blas 2.116 mkl conda-forge [conda] blas-devel 3.9.0 16_linux64_mkl conda-forge [conda] libblas 3.9.0 16_linux64_mkl conda-forge [conda] libcblas 3.9.0 16_linux64_mkl conda-forge [conda] liblapack 3.9.0 16_linux64_mkl conda-forge [conda] liblapacke 3.9.0 16_linux64_mkl conda-forge [conda] mkl 2022.1.0 h84fe81f_915 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] mkl-devel 2022.1.0 ha770c72_916 conda-forge [conda] mkl-include 2022.1.0 h84fe81f_915 conda-forge [conda] numpy 1.24.3 py310ha4c1d20_0 conda-forge [conda] pytorch 2.0.0 aws_py3.10_cuda11.8_cudnn8.7.0_0 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] pytorch-cuda 11.8 h7e8668a_3 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] pytorch-mutex 1.0 cuda https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] pytorch-ranger 0.1.1 pypi_0 pypi [conda] torch-model-archiver 0.7.1 py310_0 pytorch [conda] torch-optimizer 0.3.0 pypi_0 pypi [conda] torch-workflow-archiver 0.2.7 py310_0 pytorch [conda] torchaudio 2.0.1 py310_cu118 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] torchdata 0.6.0 py310 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] torchmetrics 0.11.4 pypi_0 pypi [conda] torchserve 0.7.1 py310_0 pytorch [conda] torchtext 0.15.1 py310 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] torchtriton 2.0.0 py310 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] torchvision 0.15.1 py310_cu118 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com
Composer information
Composer version: 0.15.0 Composer commit hash: None Host processor model name: AMD EPYC 7R32 Host processor core count: 4 Number of nodes: 1 Accelerator model name: NVIDIA A10G Accelerators per node: 1 CUDA Device Count: 1
To reproduce
Steps to reproduce the behavior:
cd llm-foundry/scripts/data_prep/
Convert C4 dataset to StreamingDataset format
python convert_dataset_hf.py \ --dataset c4 --data_subset en \ --out_root my-copy-c4 --splits train_small val_small \ --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' \ --compression zstd
Expected behavior
To prepare the dataset as per the link here -> https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md#workflow-4-i-want-to-train-a-new-hf-model-from-scratch
Additional context
The error that I get is: OSError: [Errno 24] Too many open files
RuntimeError: unable to open shared memory object in read-write mode: Too many open files (24)