mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.84k stars 503 forks source link

Error Data Preparation #406

Closed geodra closed 5 months ago

geodra commented 1 year ago

Environment

Collecting system information...

System Environment Report
Created: 2023-06-30 19:57:29 UTC

PyTorch information

PyTorch version: 2.0.0 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A

OS: Amazon Linux 2 (x86_64) GCC version: (GCC) 7.3.1 20180712 (Red Hat 7.3.1-15) Clang version: Could not collect CMake version: version 3.26.3 Libc version: glibc-2.26

Python version: 3.10.10 | packaged by conda-forge | (main, Mar 24 2023, 20:08:06) [GCC 11.3.0] (64-bit runtime) Python platform: Linux-5.10.178-162.673.amzn2.x86_64-x86_64-with-glibc2.26 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10G Nvidia driver version: 525.85.12 cuDNN version: Probably one of the following: /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5 /usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.1.1 /usr/local/cuda-11.2/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.1.1 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 8 On-line CPU(s) list: 0-7 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 1 NUMA node(s): 1 Vendor ID: AuthenticAMD CPU family: 23 Model: 49 Model name: AMD EPYC 7R32 Stepping: 0 CPU MHz: 2755.990 BogoMIPS: 5599.86 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 512K L3 cache: 16384K NUMA node0 CPU(s): 0-7 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid

Versions of relevant libraries: [pip3] numpy==1.24.3 [pip3] pytorch-ranger==0.1.1 [pip3] torch==2.0.0 [pip3] torch-model-archiver==0.7.1b20230208 [pip3] torch-optimizer==0.3.0 [pip3] torch-workflow-archiver==0.2.7b20230208 [pip3] torchaudio==2.0.1 [pip3] torchdata==0.6.0 [pip3] torchmetrics==0.11.4 [pip3] torchserve==0.7.1b20230208 [pip3] torchtext==0.15.1 [pip3] torchvision==0.15.1 [conda] blas 2.116 mkl conda-forge [conda] blas-devel 3.9.0 16_linux64_mkl conda-forge [conda] libblas 3.9.0 16_linux64_mkl conda-forge [conda] libcblas 3.9.0 16_linux64_mkl conda-forge [conda] liblapack 3.9.0 16_linux64_mkl conda-forge [conda] liblapacke 3.9.0 16_linux64_mkl conda-forge [conda] mkl 2022.1.0 h84fe81f_915 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] mkl-devel 2022.1.0 ha770c72_916 conda-forge [conda] mkl-include 2022.1.0 h84fe81f_915 conda-forge [conda] numpy 1.24.3 py310ha4c1d20_0 conda-forge [conda] pytorch 2.0.0 aws_py3.10_cuda11.8_cudnn8.7.0_0 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] pytorch-cuda 11.8 h7e8668a_3 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] pytorch-mutex 1.0 cuda https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] pytorch-ranger 0.1.1 pypi_0 pypi [conda] torch-model-archiver 0.7.1 py310_0 pytorch [conda] torch-optimizer 0.3.0 pypi_0 pypi [conda] torch-workflow-archiver 0.2.7 py310_0 pytorch [conda] torchaudio 2.0.1 py310_cu118 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] torchdata 0.6.0 py310 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] torchmetrics 0.11.4 pypi_0 pypi [conda] torchserve 0.7.1 py310_0 pytorch [conda] torchtext 0.15.1 py310 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] torchtriton 2.0.0 py310 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com [conda] torchvision 0.15.1 py310_cu118 https://aws-ml-conda-pre-prod-ec2.s3.us-west-2.amazonaws.com

Composer information

Composer version: 0.15.0 Composer commit hash: None Host processor model name: AMD EPYC 7R32 Host processor core count: 4 Number of nodes: 1 Accelerator model name: NVIDIA A10G Accelerators per node: 1 CUDA Device Count: 1

To reproduce

Steps to reproduce the behavior:

cd llm-foundry/scripts/data_prep/

Convert C4 dataset to StreamingDataset format

python convert_dataset_hf.py \ --dataset c4 --data_subset en \ --out_root my-copy-c4 --splits train_small val_small \ --concat_tokens 2048 --tokenizer EleutherAI/gpt-neox-20b --eos_text '<|endoftext|>' \ --compression zstd

Expected behavior

To prepare the dataset as per the link here -> https://github.com/mosaicml/llm-foundry/blob/main/TUTORIAL.md#workflow-4-i-want-to-train-a-new-hf-model-from-scratch

Additional context

The error that I get is: OSError: [Errno 24] Too many open files

RuntimeError: unable to open shared memory object in read-write mode: Too many open files (24)

tbarton16 commented 11 months ago

Can you try setting "ulimit -n unlimited" in your bash terminal?

tbarton16 commented 11 months ago

The reason why you are seeing this is that each streaming dataset instantiation creates a shared memory file descriptor and it only gets cleaned up during python interpreter exit.

dakinggg commented 10 months ago

Closing as stale. Please reopen if you are still encountering problems.

eldarkurtic commented 8 months ago

I would like to reopen this issue as I am seeing the same error when running the provided example for C4 (https://github.com/mosaicml/llm-foundry/tree/main/scripts/data_prep#:~:text=%23%20Convert%20C4%20dataset%20to%20StreamingDataset%20format%0Apython%20convert_dataset_hf.py%20%5C%0A%20%20%2D%2Ddataset%20c4%20%2D%2Ddata_subset%20en%20%5C%0A%20%20%2D%2Dout_root%20my%2Dcopy%2Dc4%20%2D%2Dsplits%20train_small%20val_small%20%5C%0A%20%20%2D%2Dconcat_tokens%202048%20%2D%2Dtokenizer%20EleutherAI/gpt%2Dneox%2D20b%20%2D%2Deos_text%20%27%3C%7Cendoftext%7C%3E%27%20%5C%0A%20%20%2D%2Dcompression%20zstd):

OSError: [Errno 24] Too many open files

and ulimit is set to unlimited (at least I think so):

eldar@gpuserver9:~$ ulimit
unlimited
dakinggg commented 8 months ago

@karan6181

karan6181 commented 5 months ago

Hey @eldarkurtic, sorry for my super late response. Do you still see the issue? One recommendation is to call streaming.base.util.clean_stale_shared_memory() at the beginning of the main script. The streaming library utilizes shared memory to coordinate some things between processes on the same node. In the event that a run is killed and there isn't an opportunity for a graceful cleanup, some artifacts might be left in shared memory on a node.

And if you still see the issue, can you share the full traceback?

eldarkurtic commented 5 months ago

Hey @karan6181 , I don't have this problem anymore. The solution was to edit /etc/security/limits.conf and increase the limit for open files from 1024 (default) to 65535 (requires sudo access).

karan6181 commented 5 months ago

Thanks, @eldarkurtic, for the confirmation. I am closing this issue.