Closed yudonglee closed 1 year ago
me too
me too
me too
me too. GPU: 1x A100 40G
cat training.log:
OutOfMemoryError: CUDA out of memory. Tried to allocate 786.00 MiB (GPU 0; 39.56 GiB total capacity; 38.49 GiB already allocated; 96.56 MiB free; 38.78 GiB reserved in total by PyTorch) If reserved memory is allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-04-13 08:39:16,676] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2525 [2023-04-13 08:39:16,677] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', './output'] exits with return code = 1
Maybe I have resolved the error by reducing batch size.
I modified the step-1 script training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh,
just add --per_device_train_batch_size 8
and --per_device_eval_batch_size 8
:
deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-1.3b \
--gradient_accumulation_steps 2 --lora_dim 128 --zero_stage $ZERO_STAGE \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--deepspeed --output_dir $OUTPUT &> $OUTPUT/training.log
It works without cuda out-of-memory error now.
solved. check log for more error info.
Here is the log from : DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b/training.log [2023-04-13 22:52:19,362] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-13 22:52:19,416] [INFO] [runner.py:540:main] cmd = /home/ps/anaconda3/envs/pt2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage 0 --deepspeed --output_dir /data/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b [2023-04-13 22:52:21,351] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]} [2023-04-13 22:52:21,351] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-04-13 22:52:21,351] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-04-13 22:52:21,351] [INFO] [launch.py:247:main] dist_world_size=1 [2023-04-13 22:52:21,351] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-04-13 22:52:24,221] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Traceback (most recent call last): File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/home/ps/anaconda3/envs/pt2/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/home/ps/anaconda3/envs/pt2/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:997)
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/requests/adapters.py", line 489, in send resp = conn.urlopen( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /repos/07/3d/073de108a2c59896a27d14fab4481eb23b2158f96739f10e132b57dd7e2f23fe/cf7d5c970d6ddbd3b03009b397c0422e147edd5c8020d47a8d2fac0b11a3b08d?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1681656749&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZG4tbGZzLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzA3LzNkLzA3M2RlMTA4YTJjNTk4OTZhMjdkMTRmYWI0NDgxZWIyM2IyMTU4Zjk2NzM5ZjEwZTEzMmI1N2RkN2UyZjIzZmUvY2Y3ZDVjOTcwZDZkZGJkM2IwMzAwOWIzOTdjMDQyMmUxNDdlZGQ1YzgwMjBkNDdhOGQyZmFjMGIxMWEzYjA4ZD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2ODE2NTY3NDl9fX1dfQ&Signature=bsLk8A7ZYuAz5RwoKScwoCbM7WUE4xLdKfthWaEY6UC46sSLpc0eFL93eW7CcbvI1jaMziP0od6dvaPic6hoZNuHAfRfMXA5O1WN-TLw~2ptXoFbzzfXfJhnEJevslINF4B2pg8xRoswAid730cDJY8z-pJiQD0cF3AmI2G666W2OXJ0yMnIATLqLUEjIBSUZgNJ67bV3LjaMdpbl3YRGd~yL9ROMWM4KvUvLRx~c3wIGRsCSbYkyXobtwjoLe8jLrI6G3L70m-cmqiynm38zjwhJBE1Bo2UwC~hMOJ8eANU7Opn-1WuiWhPprRbMj4-Z9G67cyfVhLiN1oVZ0dirg&Key-Pair-Id=KVTP0A1DKRTAX (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:997)')))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 328, in
And I check that my Python supports TLS 1.1 or above:
from urllib.request import urlopen urlopen('https://www.howsmyssl.com/a/check').read()
and it output:
b'{"given_cipher_suites":["TLS_AES_256_GCM_SHA384","TLS_CHACHA20_POLY1305_SHA256","TLS_AES_128_GCM_SHA256","TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256","TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256","TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384","TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384","TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256","TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256","TLS_DHE_RSA_WITH_AES_256_GCM_SHA384","TLS_DHE_RSA_WITH_AES_128_GCM_SHA256","TLS_DHE_RSA_WITH_AES_256_CBC_SHA256","TLS_DHE_RSA_WITH_AES_128_CBC_SHA256","TLS_EMPTY_RENEGOTIATION_INFO_SCSV"],"ephemeral_keys_supported":true,"session_ticket_supported":true,"tls_compression_supported":false,"unknown_cipher_suite_supported":false,"beast_vuln":false,"able_to_detect_n_minus_one_splitting":false,"insecure_cipher_suites":{},"tls_version":"TLS 1.3","rating":"Probably Okay"}'
So what's the problem ?
@yudonglee This looks to be a problem with transformers
and huggingface_hub
. Can you try installing the latest master of transformers
and update huggingface_hub
to the latest release?
pip install git+https://github.com/huggingface/transformers
pip install -U huggingface_hub
Please take a look at the requirements.txt
for DS-chat: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/requirements.txt
Everyone else getting OOM on the 1.3b example: Could you please share information about your environment with ds_report
?
Everyone else getting OOM on the 1.3b example: Could you please share information about your environment with
ds_report
?
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/mnt/c/Users//Documents/AIGC/DeepSpeedExamples/applications/DeepSpeed-Chat/venv/lib/python3.10/site-packages/torch'] torch version .................... 2.0.0+cu117 deepspeed install path ........... ['/mnt/c/Users//Documents/AIGC/DeepSpeedExamples/applications/DeepSpeed-Chat/venv/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.9.1+a8f999e3, a8f999e3, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 12.0 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
Anyone got this:
RuntimeError: Error building extension 'fused_adam'
not sure how to debug. Too little trace.
Anyone got idea how to enable more trace for this issue?
Maybe I have resolved the error by reducing batch size.
@hikerell , would you mind sharing your hardware spec? I have single A100 40GB and run the same script as well. but in my training log, there's only showing process got kill and no OOM messages at all. I even set batch size to 1. I can only imagine the reason is that there's hardware difference between us or there are different versions of dependency I have... training log:
[2023-04-14 16:42:18,341] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-14 16:42:18,350] [INFO] [runner.py:540:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage 0 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --deepspeed --output_dir /workspace/sharing/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
[2023-04-14 16:42:21,769] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-14 16:42:21,769] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-14 16:42:21,769] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-14 16:42:21,769] [INFO] [launch.py:247:main] dist_world_size=1
[2023-04-14 16:42:21,769] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-14 16:42:26,266] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Downloading (…)okenizer_config.json: 0%| | 0.00/685 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|████████████| 685/685 [00:00<00:00, 738kB/s]
Downloading (…)lve/main/config.json: 0%| | 0.00/653 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|████████████| 653/653 [00:00<00:00, 616kB/s]
Downloading (…)olve/main/vocab.json: 0%| | 0.00/899k [00:00<?, ?B/s]
Downloading (…)olve/main/vocab.json: 100%|█████████| 899k/899k [00:00<00:00, 11.3MB/s]
Downloading (…)olve/main/merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]
Downloading (…)olve/main/merges.txt: 100%|█████████| 456k/456k [00:00<00:00, 5.96MB/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/441 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|████████████| 441/441 [00:00<00:00, 484kB/s]
Downloading pytorch_model.bin: 0%| | 0.00/2.63G [00:00<?, ?B/s]
Downloading pytorch_model.bin: 0%| | 10.5M/2.63G [00:00<01:10, 37.4MB/s]
Downloading pytorch_model.bin: 1%|▏ | 31.5M/2.63G [00:00<00:31, 83.6MB/s]
Downloading pytorch_model.bin: 2%|▎ | 62.9M/2.63G [00:00<00:17, 146MB/s]
Downloading pytorch_model.bin: 4%|▌ | 105M/2.63G [00:00<00:11, 219MB/s]
Downloading pytorch_model.bin: 6%|▊ | 147M/2.63G [00:00<00:09, 269MB/s]
Downloading pytorch_model.bin: 7%|█ | 189M/2.63G [00:00<00:08, 302MB/s]
Downloading pytorch_model.bin: 9%|█▎ | 231M/2.63G [00:00<00:07, 324MB/s]
Downloading pytorch_model.bin: 10%|█▌ | 273M/2.63G [00:01<00:06, 340MB/s]
Downloading pytorch_model.bin: 12%|█▊ | 315M/2.63G [00:01<00:06, 352MB/s]
Downloading pytorch_model.bin: 14%|██ | 357M/2.63G [00:01<00:06, 361MB/s]
Downloading pytorch_model.bin: 15%|██▎ | 398M/2.63G [00:01<00:06, 367MB/s]
Downloading pytorch_model.bin: 17%|██▌ | 440M/2.63G [00:01<00:05, 366MB/s]
Downloading pytorch_model.bin: 18%|██▋ | 482M/2.63G [00:01<00:05, 370MB/s]
Downloading pytorch_model.bin: 20%|██▉ | 524M/2.63G [00:01<00:05, 372MB/s]
Downloading pytorch_model.bin: 22%|███▏ | 566M/2.63G [00:01<00:05, 374MB/s]
Downloading pytorch_model.bin: 23%|███▍ | 608M/2.63G [00:01<00:05, 377MB/s]
Downloading pytorch_model.bin: 25%|███▋ | 650M/2.63G [00:02<00:05, 379MB/s]
Downloading pytorch_model.bin: 26%|███▉ | 692M/2.63G [00:02<00:05, 378MB/s]
Downloading pytorch_model.bin: 28%|████▏ | 734M/2.63G [00:02<00:05, 378MB/s]
Downloading pytorch_model.bin: 29%|████▍ | 776M/2.63G [00:02<00:04, 378MB/s]
Downloading pytorch_model.bin: 31%|████▋ | 818M/2.63G [00:02<00:05, 354MB/s]
Downloading pytorch_model.bin: 33%|████▉ | 860M/2.63G [00:02<00:04, 362MB/s]
Downloading pytorch_model.bin: 34%|█████▏ | 902M/2.63G [00:02<00:04, 367MB/s]
Downloading pytorch_model.bin: 36%|█████▍ | 944M/2.63G [00:02<00:04, 370MB/s]
Downloading pytorch_model.bin: 37%|█████▌ | 986M/2.63G [00:02<00:04, 373MB/s]
Downloading pytorch_model.bin: 39%|█████▍ | 1.03G/2.63G [00:03<00:04, 374MB/s]
Downloading pytorch_model.bin: 41%|█████▋ | 1.07G/2.63G [00:03<00:04, 375MB/s]
Downloading pytorch_model.bin: 42%|█████▉ | 1.11G/2.63G [00:03<00:04, 377MB/s]
Downloading pytorch_model.bin: 44%|██████▏ | 1.15G/2.63G [00:03<00:03, 377MB/s]
Downloading pytorch_model.bin: 45%|██████▎ | 1.20G/2.63G [00:03<00:03, 364MB/s]
Downloading pytorch_model.bin: 47%|██████▌ | 1.24G/2.63G [00:03<00:03, 363MB/s]
Downloading pytorch_model.bin: 49%|██████▊ | 1.28G/2.63G [00:03<00:03, 364MB/s]
Downloading pytorch_model.bin: 50%|███████ | 1.32G/2.63G [00:03<00:03, 363MB/s]
Downloading pytorch_model.bin: 52%|███████▎ | 1.36G/2.63G [00:04<00:03, 362MB/s]
Downloading pytorch_model.bin: 53%|███████▍ | 1.41G/2.63G [00:04<00:03, 360MB/s]
Downloading pytorch_model.bin: 55%|███████▋ | 1.45G/2.63G [00:04<00:03, 360MB/s]
Downloading pytorch_model.bin: 57%|███████▉ | 1.49G/2.63G [00:04<00:03, 360MB/s]
Downloading pytorch_model.bin: 58%|████████▏ | 1.53G/2.63G [00:04<00:03, 359MB/s]
Downloading pytorch_model.bin: 60%|████████▎ | 1.57G/2.63G [00:04<00:02, 356MB/s]
Downloading pytorch_model.bin: 61%|████████▌ | 1.61G/2.63G [00:04<00:02, 359MB/s]
Downloading pytorch_model.bin: 63%|████████▊ | 1.66G/2.63G [00:04<00:02, 361MB/s]
Downloading pytorch_model.bin: 65%|█████████ | 1.70G/2.63G [00:04<00:02, 362MB/s]
Downloading pytorch_model.bin: 66%|█████████▎ | 1.74G/2.63G [00:05<00:02, 365MB/s]
Downloading pytorch_model.bin: 68%|█████████▍ | 1.78G/2.63G [00:05<00:02, 369MB/s]
Downloading pytorch_model.bin: 69%|█████████▋ | 1.82G/2.63G [00:05<00:02, 368MB/s]
Downloading pytorch_model.bin: 71%|█████████▉ | 1.87G/2.63G [00:05<00:02, 370MB/s]
Downloading pytorch_model.bin: 73%|██████████▏ | 1.91G/2.63G [00:05<00:01, 367MB/s]
Downloading pytorch_model.bin: 74%|██████████▍ | 1.95G/2.63G [00:05<00:01, 370MB/s]
Downloading pytorch_model.bin: 76%|██████████▌ | 1.99G/2.63G [00:05<00:01, 371MB/s]
Downloading pytorch_model.bin: 77%|██████████▊ | 2.03G/2.63G [00:05<00:01, 371MB/s]
Downloading pytorch_model.bin: 79%|███████████ | 2.08G/2.63G [00:05<00:01, 371MB/s]
Downloading pytorch_model.bin: 80%|███████████▎ | 2.12G/2.63G [00:06<00:01, 372MB/s]
Downloading pytorch_model.bin: 82%|███████████▍ | 2.16G/2.63G [00:06<00:01, 373MB/s]
Downloading pytorch_model.bin: 84%|███████████▋ | 2.20G/2.63G [00:06<00:01, 374MB/s]
Downloading pytorch_model.bin: 85%|███████████▉ | 2.24G/2.63G [00:06<00:01, 372MB/s]
Downloading pytorch_model.bin: 87%|████████████▏ | 2.29G/2.63G [00:06<00:00, 361MB/s]
Downloading pytorch_model.bin: 88%|████████████▍ | 2.33G/2.63G [00:06<00:00, 366MB/s]
Downloading pytorch_model.bin: 90%|████████████▌ | 2.37G/2.63G [00:06<00:00, 367MB/s]
Downloading pytorch_model.bin: 92%|████████████▊ | 2.41G/2.63G [00:06<00:00, 369MB/s]
Downloading pytorch_model.bin: 93%|█████████████ | 2.45G/2.63G [00:07<00:00, 371MB/s]
Downloading pytorch_model.bin: 95%|█████████████▎| 2.50G/2.63G [00:07<00:00, 372MB/s]
Downloading pytorch_model.bin: 96%|█████████████▍| 2.54G/2.63G [00:07<00:00, 372MB/s]
Downloading pytorch_model.bin: 98%|█████████████▋| 2.58G/2.63G [00:07<00:00, 373MB/s]
Downloading pytorch_model.bin: 100%|█████████████▉| 2.62G/2.63G [00:07<00:00, 374MB/s]
Downloading pytorch_model.bin: 100%|██████████████| 2.63G/2.63G [00:07<00:00, 351MB/s]
Downloading (…)neration_config.json: 0%| | 0.00/137 [00:00<?, ?B/s]
Downloading (…)neration_config.json: 100%|████████████| 137/137 [00:00<00:00, 132kB/s]
Downloading metadata: 0%| | 0.00/926 [00:00<?, ?B/s]
Downloading metadata: 100%|███████████████████████████| 926/926 [00:00<00:00, 879kB/s]
Downloading readme: 0%| | 0.00/530 [00:00<?, ?B/s]
Downloading readme: 100%|█████████████████████████████| 530/530 [00:00<00:00, 541kB/s]
Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 0%| | 0/2 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/68.4M [00:00<?, ?B/s][A
Downloading data: 3%|▊ | 1.98M/68.4M [00:00<00:03, 19.6MB/s][A
Downloading data: 9%|██▍ | 6.48M/68.4M [00:00<00:01, 34.4MB/s][A
Downloading data: 22%|█████▋ | 14.9M/68.4M [00:00<00:00, 57.0MB/s][A
Downloading data: 39%|██████████▏ | 26.9M/68.4M [00:00<00:00, 81.8MB/s][A
Downloading data: 57%|██████████████▊ | 39.0M/68.4M [00:00<00:00, 95.8MB/s][A
Downloading data: 73%|███████████████████▊ | 50.2M/68.4M [00:00<00:00, 102MB/s][A
Downloading data: 90%|████████████████████████▎ | 61.5M/68.4M [00:00<00:00, 105MB/s][A
Downloading data: 100%|██████████████████████████| 68.4M/68.4M [00:00<00:00, 90.2MB/s]
Downloading data files: 50%|█████████████▌ | 1/2 [00:01<00:01, 1.16s/it]
Downloading data: 0%| | 0.00/4.61M [00:00<?, ?B/s][A
Downloading data: 48%|████████████▍ | 2.20M/4.61M [00:00<00:00, 22.0MB/s][A
Downloading data: 100%|██████████████████████████| 4.61M/4.61M [00:00<00:00, 28.4MB/s]
Downloading data files: 100%|███████████████████████████| 2/2 [00:01<00:00, 1.36it/s]
Downloading data files: 100%|███████████████████████████| 2/2 [00:01<00:00, 1.25it/s]
Extracting data files: 0%| | 0/2 [00:00<?, ?it/s]
Extracting data files: 100%|██████████████████████████| 2/2 [00:00<00:00, 1875.39it/s]
Generating train split: 0%| | 0/76256 [00:00<?, ? examples/s]
Generating train split: 52%|████▏ | 40000/76256 [00:00<00:00, 288147.02 examples/s]
Generating train split: 100%|████████| 76256/76256 [00:00<00:00, 307697.72 examples/s]
Generating test split: 0%| | 0/5103 [00:00<?, ? examples/s]
Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
0%| | 0/2 [00:00<?, ?it/s]
100%|██████████████████████████████████████████████████| 2/2 [00:00<00:00, 578.17it/s]
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu116/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /opt/conda/bin/nvcc -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -std=c++14 -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/opt/conda/lib64 -lcudart -o fused_adam.so
Loading extension module fused_adam...
Time to load fused_adam op: 28.788597345352173 seconds
[2023-04-14 16:46:28,025] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6112
[2023-04-14 16:46:28,025] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--deepspeed', '--output_dir', '/workspace/sharing/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = -11
or anyone facing the same problem (no OOM message popped out but seems like an OOM issue....)?
Anyone got this:
RuntimeError: Error building extension 'fused_adam'
not sure how to debug. Too little trace.
Anyone got idea how to enable more trace for this issue?
Could you please try installing deepspeed with the following and share the output?
DS_BUILD_FUSED_ADAM=1 pip install deepspeed==0.9.0
@ChaoChungWu-Johnson Could you launch the script and watch the memory usage at the same time (using watch -n 1 nvidia-smi
) to confirm if this is related to an OOM error?
hi @mrwyattii ! I found this error is somehow weird. I got sigkilled after Distributed backend already initialized
message was shown.
And during the time from start to end, the nvidia-smi showed the same ram usage as about 1G for each gpu core.
So I think it's not related to CPU or GPU OOM but may be some other reasons, and btw if I enable core dump (writing file) of python, it will dump a great amount of python core file (>50GB) which directly caused my system hard disk oom.
Is that message coming from torch? Can you share more about your environment (e.g., the output of ds_report
)?
@mrwyattii , yes, I found my case is similar to this: https://github.com/microsoft/DeepSpeedExamples/issues/313
the message [INFO] [comm.py:580:init_distributed] Distributed backend already initialized
showed and then my process was sigkilled.
ds_report:
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/workspace/sharing/johnsonwu/DeepSpeed/deepspeed/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/workspace/sharing/johnsonwu/DeepSpeed/deepspeed/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.1+036c5d6d, 036c5d6d, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
Thank you very much!
have the same bug,same problem how to deal with it ?
@yudonglee This looks to be a problem with
transformers
andhuggingface_hub
. Can you try installing the latest master oftransformers
and updatehuggingface_hub
to the latest release?pip install git+https://github.com/huggingface/transformers
pip install -U huggingface_hub
Please take a look at the
requirements.txt
for DS-chat: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/requirements.txt
@mrwyattii Thanks~ It works for my case
Hi @mrwyattii ,
I also faced this error with A6000 48gb vram. This is my log:
[2023-04-21 14:47:16,239] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-21 14:47:16,256] [INFO] [runner.py:540:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage 0 --deepspeed --output_dir /workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
[2023-04-21 14:47:18,501] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-21 14:47:18,501] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-21 14:47:18,501] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-21 14:47:18,502] [INFO] [launch.py:247:main] dist_world_size=1
[2023-04-21 14:47:18,502] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-21 14:47:21,917] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
Downloading (…)okenizer_config.json: 0%| | 0.00/685 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|██████████| 685/685 [00:00<00:00, 648kB/s]
Downloading (…)lve/main/config.json: 0%| | 0.00/653 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|██████████| 653/653 [00:00<00:00, 690kB/s]
Downloading (…)olve/main/vocab.json: 0%| | 0.00/899k [00:00<?, ?B/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 12.4MB/s]
Downloading (…)olve/main/merges.txt: 0%| | 0.00/456k [00:00<?, ?B/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.04MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.03MB/s]
Downloading (…)cial_tokens_map.json: 0%| | 0.00/441 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 441/441 [00:00<00:00, 440kB/s]
Downloading pytorch_model.bin: 0%| | 0.00/2.63G [00:00<?, ?B/s]
Downloading pytorch_model.bin: 0%| | 10.5M/2.63G [00:00<02:27, 17.8MB/s]
Downloading pytorch_model.bin: 1%| | 21.0M/2.63G [00:00<01:26, 30.3MB/s]
Downloading pytorch_model.bin: 1%| | 31.5M/2.63G [00:00<00:59, 43.7MB/s]
Downloading pytorch_model.bin: 2%|▏ | 41.9M/2.63G [00:00<00:45, 56.4MB/s]
Downloading pytorch_model.bin: 2%|▏ | 52.4M/2.63G [00:01<00:39, 65.8MB/s]
Downloading pytorch_model.bin: 2%|▏ | 62.9M/2.63G [00:01<00:35, 73.0MB/s]
Downloading pytorch_model.bin: 3%|▎ | 73.4M/2.63G [00:01<00:32, 79.0MB/s]
Downloading pytorch_model.bin: 3%|▎ | 83.9M/2.63G [00:01<00:30, 84.2MB/s]
Downloading pytorch_model.bin: 4%|▎ | 94.4M/2.63G [00:01<00:28, 88.3MB/s]
Downloading pytorch_model.bin: 4%|▍ | 105M/2.63G [00:01<00:27, 91.2MB/s]
Downloading pytorch_model.bin: 4%|▍ | 115M/2.63G [00:01<00:27, 92.5MB/s]
Downloading pytorch_model.bin: 5%|▍ | 126M/2.63G [00:01<00:26, 95.3MB/s]
Downloading pytorch_model.bin: 5%|▌ | 136M/2.63G [00:01<00:25, 97.2MB/s]
Downloading pytorch_model.bin: 6%|▌ | 147M/2.63G [00:02<00:25, 98.7MB/s]
Downloading pytorch_model.bin: 6%|▌ | 157M/2.63G [00:02<00:24, 99.3MB/s]
Downloading pytorch_model.bin: 6%|▋ | 168M/2.63G [00:02<00:24, 100MB/s]
Downloading pytorch_model.bin: 7%|▋ | 178M/2.63G [00:02<00:24, 101MB/s]
Downloading pytorch_model.bin: 8%|▊ | 199M/2.63G [00:02<00:23, 103MB/s]
Downloading pytorch_model.bin: 8%|▊ | 220M/2.63G [00:02<00:23, 104MB/s]
Downloading pytorch_model.bin: 9%|▉ | 241M/2.63G [00:02<00:22, 106MB/s]
Downloading pytorch_model.bin: 10%|▉ | 262M/2.63G [00:03<00:22, 106MB/s]
Downloading pytorch_model.bin: 11%|█ | 283M/2.63G [00:03<00:21, 108MB/s]
Downloading pytorch_model.bin: 12%|█▏ | 304M/2.63G [00:03<00:21, 109MB/s]
Downloading pytorch_model.bin: 12%|█▏ | 325M/2.63G [00:03<00:20, 110MB/s]
Downloading pytorch_model.bin: 13%|█▎ | 346M/2.63G [00:03<00:20, 111MB/s]
Downloading pytorch_model.bin: 14%|█▍ | 367M/2.63G [00:04<00:20, 112MB/s]
Downloading pytorch_model.bin: 15%|█▍ | 388M/2.63G [00:04<00:19, 113MB/s]
Downloading pytorch_model.bin: 16%|█▌ | 409M/2.63G [00:04<00:19, 113MB/s]
Downloading pytorch_model.bin: 16%|█▋ | 430M/2.63G [00:04<00:19, 114MB/s]
Downloading pytorch_model.bin: 17%|█▋ | 451M/2.63G [00:04<00:19, 114MB/s]
Downloading pytorch_model.bin: 18%|█▊ | 472M/2.63G [00:04<00:18, 114MB/s]
Downloading pytorch_model.bin: 19%|█▊ | 493M/2.63G [00:05<00:18, 115MB/s]
Downloading pytorch_model.bin: 20%|█▉ | 514M/2.63G [00:05<00:19, 107MB/s]
Downloading pytorch_model.bin: 20%|██ | 535M/2.63G [00:05<00:22, 94.8MB/s]
Downloading pytorch_model.bin: 21%|██ | 545M/2.63G [00:05<00:23, 87.7MB/s]
Downloading pytorch_model.bin: 21%|██ | 556M/2.63G [00:06<00:25, 82.6MB/s]
Downloading pytorch_model.bin: 22%|██▏ | 566M/2.63G [00:06<00:26, 79.1MB/s]
Downloading pytorch_model.bin: 22%|██▏ | 577M/2.63G [00:06<00:27, 75.8MB/s]
Downloading pytorch_model.bin: 22%|██▏ | 587M/2.63G [00:06<00:27, 73.8MB/s]
Downloading pytorch_model.bin: 23%|██▎ | 598M/2.63G [00:06<00:27, 73.3MB/s]
Downloading pytorch_model.bin: 23%|██▎ | 608M/2.63G [00:06<00:28, 72.1MB/s]
Downloading pytorch_model.bin: 24%|██▎ | 619M/2.63G [00:06<00:27, 72.8MB/s]
Downloading pytorch_model.bin: 24%|██▍ | 629M/2.63G [00:07<00:27, 73.6MB/s]
Downloading pytorch_model.bin: 24%|██▍ | 640M/2.63G [00:07<00:27, 73.3MB/s]
Downloading pytorch_model.bin: 25%|██▍ | 650M/2.63G [00:07<00:26, 74.6MB/s]
Downloading pytorch_model.bin: 25%|██▌ | 661M/2.63G [00:07<00:26, 74.3MB/s]
Downloading pytorch_model.bin: 26%|██▌ | 671M/2.63G [00:07<00:26, 73.7MB/s]
Downloading pytorch_model.bin: 26%|██▌ | 682M/2.63G [00:07<00:26, 73.7MB/s]
Downloading pytorch_model.bin: 26%|██▋ | 692M/2.63G [00:07<00:25, 75.1MB/s]
Downloading pytorch_model.bin: 27%|██▋ | 703M/2.63G [00:08<00:25, 76.7MB/s]
Downloading pytorch_model.bin: 27%|██▋ | 713M/2.63G [00:08<00:24, 78.2MB/s]
Downloading pytorch_model.bin: 27%|██▋ | 724M/2.63G [00:08<00:24, 79.4MB/s]
Downloading pytorch_model.bin: 28%|██▊ | 734M/2.63G [00:08<00:23, 79.6MB/s]
Downloading pytorch_model.bin: 28%|██▊ | 744M/2.63G [00:08<00:23, 80.2MB/s]
Downloading pytorch_model.bin: 29%|██▊ | 755M/2.63G [00:08<00:23, 80.1MB/s]
Downloading pytorch_model.bin: 29%|██▉ | 765M/2.63G [00:08<00:22, 81.5MB/s]
Downloading pytorch_model.bin: 29%|██▉ | 776M/2.63G [00:08<00:22, 81.0MB/s]
Downloading pytorch_model.bin: 30%|██▉ | 786M/2.63G [00:09<00:22, 81.6MB/s]
Downloading pytorch_model.bin: 30%|███ | 797M/2.63G [00:09<00:22, 81.0MB/s]
Downloading pytorch_model.bin: 31%|███ | 807M/2.63G [00:09<00:22, 81.2MB/s]
Downloading pytorch_model.bin: 31%|███ | 818M/2.63G [00:09<00:22, 82.2MB/s]
Downloading pytorch_model.bin: 31%|███▏ | 828M/2.63G [00:09<00:22, 81.7MB/s]
Downloading pytorch_model.bin: 32%|███▏ | 839M/2.63G [00:09<00:21, 83.0MB/s]
Downloading pytorch_model.bin: 32%|███▏ | 849M/2.63G [00:09<00:21, 83.1MB/s]
Downloading pytorch_model.bin: 33%|███▎ | 860M/2.63G [00:09<00:21, 82.8MB/s]
Downloading pytorch_model.bin: 33%|███▎ | 870M/2.63G [00:10<00:20, 84.0MB/s]
Downloading pytorch_model.bin: 33%|███▎ | 881M/2.63G [00:10<00:20, 84.8MB/s]
Downloading pytorch_model.bin: 34%|███▍ | 891M/2.63G [00:10<00:20, 83.8MB/s]
Downloading pytorch_model.bin: 34%|███▍ | 902M/2.63G [00:10<00:20, 85.0MB/s]
Downloading pytorch_model.bin: 35%|███▍ | 912M/2.63G [00:10<00:20, 84.6MB/s]
Downloading pytorch_model.bin: 35%|███▌ | 923M/2.63G [00:10<00:20, 84.7MB/s]
Downloading pytorch_model.bin: 35%|███▌ | 933M/2.63G [00:10<00:19, 86.0MB/s]
Downloading pytorch_model.bin: 36%|███▌ | 944M/2.63G [00:10<00:19, 85.6MB/s]
Downloading pytorch_model.bin: 36%|███▋ | 954M/2.63G [00:11<00:19, 86.1MB/s]
Downloading pytorch_model.bin: 37%|███▋ | 965M/2.63G [00:11<00:19, 85.3MB/s]
Downloading pytorch_model.bin: 37%|███▋ | 975M/2.63G [00:11<00:19, 84.8MB/s]
Downloading pytorch_model.bin: 37%|███▋ | 986M/2.63G [00:11<00:19, 85.4MB/s]
Downloading pytorch_model.bin: 38%|███▊ | 996M/2.63G [00:11<00:19, 85.7MB/s]
Downloading pytorch_model.bin: 38%|███▊ | 1.01G/2.63G [00:11<00:18, 85.5MB/s]
Downloading pytorch_model.bin: 39%|███▊ | 1.02G/2.63G [00:11<00:18, 86.0MB/s]
Downloading pytorch_model.bin: 39%|███▉ | 1.03G/2.63G [00:11<00:18, 85.9MB/s]
Downloading pytorch_model.bin: 39%|███▉ | 1.04G/2.63G [00:12<00:18, 86.9MB/s]
Downloading pytorch_model.bin: 40%|███▉ | 1.05G/2.63G [00:12<00:18, 87.1MB/s]
Downloading pytorch_model.bin: 40%|████ | 1.06G/2.63G [00:12<00:18, 87.3MB/s]
Downloading pytorch_model.bin: 41%|████ | 1.07G/2.63G [00:12<00:17, 88.0MB/s]
Downloading pytorch_model.bin: 41%|████ | 1.08G/2.63G [00:12<00:17, 88.3MB/s]
Downloading pytorch_model.bin: 41%|████▏ | 1.09G/2.63G [00:12<00:17, 89.1MB/s]
Downloading pytorch_model.bin: 42%|████▏ | 1.10G/2.63G [00:12<00:17, 87.7MB/s]
Downloading pytorch_model.bin: 42%|████▏ | 1.11G/2.63G [00:12<00:17, 87.9MB/s]
Downloading pytorch_model.bin: 43%|████▎ | 1.12G/2.63G [00:12<00:17, 87.8MB/s]
Downloading pytorch_model.bin: 43%|████▎ | 1.13G/2.63G [00:13<00:17, 87.5MB/s]
Downloading pytorch_model.bin: 43%|████▎ | 1.14G/2.63G [00:13<00:16, 87.7MB/s]
Downloading pytorch_model.bin: 44%|████▍ | 1.15G/2.63G [00:13<00:16, 87.8MB/s]
Downloading pytorch_model.bin: 44%|████▍ | 1.16G/2.63G [00:13<00:16, 88.3MB/s]
Downloading pytorch_model.bin: 45%|████▍ | 1.17G/2.63G [00:13<00:16, 89.4MB/s]
Downloading pytorch_model.bin: 45%|████▌ | 1.18G/2.63G [00:13<00:15, 90.9MB/s]
Downloading pytorch_model.bin: 45%|████▌ | 1.20G/2.63G [00:13<00:15, 92.1MB/s]
Downloading pytorch_model.bin: 46%|████▌ | 1.21G/2.63G [00:13<00:15, 92.3MB/s]
Downloading pytorch_model.bin: 46%|████▌ | 1.22G/2.63G [00:13<00:15, 93.9MB/s]
Downloading pytorch_model.bin: 47%|████▋ | 1.23G/2.63G [00:14<00:14, 94.5MB/s]
Downloading pytorch_model.bin: 47%|████▋ | 1.24G/2.63G [00:14<00:14, 94.9MB/s]
Downloading pytorch_model.bin: 47%|████▋ | 1.25G/2.63G [00:14<00:14, 94.3MB/s]
Downloading pytorch_model.bin: 48%|████▊ | 1.26G/2.63G [00:14<00:14, 93.9MB/s]
Downloading pytorch_model.bin: 48%|████▊ | 1.27G/2.63G [00:14<00:14, 94.7MB/s]
Downloading pytorch_model.bin: 49%|████▊ | 1.28G/2.63G [00:14<00:14, 94.6MB/s]
Downloading pytorch_model.bin: 49%|████▉ | 1.29G/2.63G [00:14<00:14, 95.3MB/s]
Downloading pytorch_model.bin: 49%|████▉ | 1.30G/2.63G [00:14<00:14, 94.7MB/s]
Downloading pytorch_model.bin: 50%|████▉ | 1.31G/2.63G [00:14<00:13, 94.9MB/s]
Downloading pytorch_model.bin: 50%|█████ | 1.32G/2.63G [00:15<00:13, 95.9MB/s]
Downloading pytorch_model.bin: 51%|█████ | 1.33G/2.63G [00:15<00:13, 97.0MB/s]
Downloading pytorch_model.bin: 51%|█████ | 1.34G/2.63G [00:15<00:13, 97.1MB/s]
Downloading pytorch_model.bin: 51%|█████▏ | 1.35G/2.63G [00:15<00:13, 97.3MB/s]
Downloading pytorch_model.bin: 52%|█████▏ | 1.36G/2.63G [00:15<00:13, 96.4MB/s]
Downloading pytorch_model.bin: 52%|█████▏ | 1.37G/2.63G [00:15<00:12, 97.3MB/s]
Downloading pytorch_model.bin: 53%|█████▎ | 1.38G/2.63G [00:15<00:12, 98.1MB/s]
Downloading pytorch_model.bin: 53%|█████▎ | 1.39G/2.63G [00:15<00:12, 99.7MB/s]
Downloading pytorch_model.bin: 53%|█████▎ | 1.41G/2.63G [00:15<00:12, 101MB/s]
Downloading pytorch_model.bin: 54%|█████▍ | 1.42G/2.63G [00:16<00:12, 101MB/s]
Downloading pytorch_model.bin: 54%|█████▍ | 1.43G/2.63G [00:16<00:12, 100MB/s]
Downloading pytorch_model.bin: 55%|█████▍ | 1.44G/2.63G [00:16<00:11, 100MB/s]
Downloading pytorch_model.bin: 55%|█████▍ | 1.45G/2.63G [00:16<00:11, 99.8MB/s]
Downloading pytorch_model.bin: 55%|█████▌ | 1.46G/2.63G [00:16<00:11, 99.9MB/s]
Downloading pytorch_model.bin: 56%|█████▌ | 1.47G/2.63G [00:16<00:11, 101MB/s]
Downloading pytorch_model.bin: 56%|█████▌ | 1.48G/2.63G [00:16<00:11, 102MB/s]
Downloading pytorch_model.bin: 57%|█████▋ | 1.49G/2.63G [00:16<00:11, 102MB/s]
Downloading pytorch_model.bin: 57%|█████▋ | 1.50G/2.63G [00:16<00:11, 103MB/s]
Downloading pytorch_model.bin: 57%|█████▋ | 1.51G/2.63G [00:16<00:10, 103MB/s]
Downloading pytorch_model.bin: 58%|█████▊ | 1.52G/2.63G [00:17<00:10, 103MB/s]
Downloading pytorch_model.bin: 58%|█████▊ | 1.53G/2.63G [00:17<00:10, 102MB/s]
Downloading pytorch_model.bin: 59%|█████▉ | 1.55G/2.63G [00:17<00:10, 103MB/s]
Downloading pytorch_model.bin: 59%|█████▉ | 1.56G/2.63G [00:17<00:10, 103MB/s]
Downloading pytorch_model.bin: 60%|██████ | 1.58G/2.63G [00:17<00:10, 104MB/s]
Downloading pytorch_model.bin: 61%|██████ | 1.60G/2.63G [00:17<00:09, 105MB/s]
Downloading pytorch_model.bin: 62%|██████▏ | 1.63G/2.63G [00:18<00:09, 106MB/s]
Downloading pytorch_model.bin: 63%|██████▎ | 1.65G/2.63G [00:18<00:09, 107MB/s]
Downloading pytorch_model.bin: 63%|██████▎ | 1.67G/2.63G [00:18<00:08, 108MB/s]
Downloading pytorch_model.bin: 64%|██████▍ | 1.69G/2.63G [00:18<00:08, 108MB/s]
Downloading pytorch_model.bin: 65%|██████▍ | 1.71G/2.63G [00:18<00:08, 108MB/s]
Downloading pytorch_model.bin: 66%|██████▌ | 1.73G/2.63G [00:19<00:08, 109MB/s]
Downloading pytorch_model.bin: 67%|██████▋ | 1.75G/2.63G [00:19<00:08, 107MB/s]
Downloading pytorch_model.bin: 67%|██████▋ | 1.77G/2.63G [00:19<00:08, 96.1MB/s]
Downloading pytorch_model.bin: 68%|██████▊ | 1.78G/2.63G [00:19<00:09, 92.9MB/s]
Downloading pytorch_model.bin: 68%|██████▊ | 1.79G/2.63G [00:19<00:09, 91.0MB/s]
Downloading pytorch_model.bin: 69%|██████▊ | 1.80G/2.63G [00:19<00:09, 88.8MB/s]
Downloading pytorch_model.bin: 69%|██████▉ | 1.81G/2.63G [00:19<00:09, 88.4MB/s]
Downloading pytorch_model.bin: 69%|██████▉ | 1.82G/2.63G [00:20<00:09, 89.0MB/s]
Downloading pytorch_model.bin: 70%|██████▉ | 1.84G/2.63G [00:20<00:08, 88.7MB/s]
Downloading pytorch_model.bin: 70%|███████ | 1.85G/2.63G [00:20<00:08, 88.8MB/s]
Downloading pytorch_model.bin: 71%|███████ | 1.86G/2.63G [00:20<00:08, 88.0MB/s]
Downloading pytorch_model.bin: 71%|███████ | 1.87G/2.63G [00:20<00:08, 88.1MB/s]
Downloading pytorch_model.bin: 71%|███████▏ | 1.88G/2.63G [00:20<00:08, 89.4MB/s]
Downloading pytorch_model.bin: 72%|███████▏ | 1.89G/2.63G [00:20<00:08, 90.4MB/s]
Downloading pytorch_model.bin: 72%|███████▏ | 1.90G/2.63G [00:20<00:08, 91.1MB/s]
Downloading pytorch_model.bin: 73%|███████▎ | 1.91G/2.63G [00:21<00:07, 92.9MB/s]
Downloading pytorch_model.bin: 73%|███████▎ | 1.92G/2.63G [00:21<00:07, 94.4MB/s]
Downloading pytorch_model.bin: 73%|███████▎ | 1.93G/2.63G [00:21<00:07, 95.2MB/s]
Downloading pytorch_model.bin: 74%|███████▎ | 1.94G/2.63G [00:21<00:07, 96.4MB/s]
Downloading pytorch_model.bin: 74%|███████▍ | 1.95G/2.63G [00:21<00:06, 98.2MB/s]
Downloading pytorch_model.bin: 75%|███████▍ | 1.96G/2.63G [00:21<00:06, 97.1MB/s]
Downloading pytorch_model.bin: 75%|███████▍ | 1.97G/2.63G [00:21<00:06, 98.0MB/s]
Downloading pytorch_model.bin: 75%|███████▌ | 1.98G/2.63G [00:21<00:06, 98.7MB/s]
Downloading pytorch_model.bin: 76%|███████▌ | 1.99G/2.63G [00:21<00:06, 98.8MB/s]
Downloading pytorch_model.bin: 76%|███████▌ | 2.00G/2.63G [00:21<00:06, 98.9MB/s]
Downloading pytorch_model.bin: 77%|███████▋ | 2.01G/2.63G [00:22<00:06, 99.0MB/s]
Downloading pytorch_model.bin: 77%|███████▋ | 2.02G/2.63G [00:22<00:06, 99.3MB/s]
Downloading pytorch_model.bin: 77%|███████▋ | 2.03G/2.63G [00:22<00:05, 99.7MB/s]
Downloading pytorch_model.bin: 78%|███████▊ | 2.06G/2.63G [00:22<00:05, 101MB/s]
Downloading pytorch_model.bin: 78%|███████▊ | 2.07G/2.63G [00:22<00:05, 101MB/s]
Downloading pytorch_model.bin: 79%|███████▉ | 2.08G/2.63G [00:22<00:05, 101MB/s]
Downloading pytorch_model.bin: 80%|███████▉ | 2.10G/2.63G [00:22<00:05, 103MB/s]
Downloading pytorch_model.bin: 80%|████████ | 2.11G/2.63G [00:23<00:05, 103MB/s]
Downloading pytorch_model.bin: 80%|████████ | 2.12G/2.63G [00:23<00:04, 103MB/s]
Downloading pytorch_model.bin: 81%|████████ | 2.13G/2.63G [00:23<00:04, 103MB/s]
Downloading pytorch_model.bin: 81%|████████▏ | 2.14G/2.63G [00:23<00:04, 103MB/s]
Downloading pytorch_model.bin: 82%|████████▏ | 2.15G/2.63G [00:23<00:04, 103MB/s]
Downloading pytorch_model.bin: 82%|████████▏ | 2.17G/2.63G [00:23<00:04, 105MB/s]
Downloading pytorch_model.bin: 83%|████████▎ | 2.19G/2.63G [00:23<00:04, 106MB/s]
Downloading pytorch_model.bin: 84%|████████▍ | 2.21G/2.63G [00:24<00:03, 107MB/s]
Downloading pytorch_model.bin: 85%|████████▍ | 2.23G/2.63G [00:24<00:03, 107MB/s]
Downloading pytorch_model.bin: 86%|████████▌ | 2.25G/2.63G [00:24<00:03, 107MB/s]
Downloading pytorch_model.bin: 86%|████████▋ | 2.28G/2.63G [00:24<00:03, 108MB/s]
Downloading pytorch_model.bin: 87%|████████▋ | 2.30G/2.63G [00:24<00:03, 108MB/s]
Downloading pytorch_model.bin: 88%|████████▊ | 2.32G/2.63G [00:24<00:02, 109MB/s]
Downloading pytorch_model.bin: 89%|████████▉ | 2.34G/2.63G [00:25<00:02, 109MB/s]
Downloading pytorch_model.bin: 90%|████████▉ | 2.36G/2.63G [00:25<00:02, 109MB/s]
Downloading pytorch_model.bin: 90%|█████████ | 2.38G/2.63G [00:25<00:02, 110MB/s]
Downloading pytorch_model.bin: 91%|█████████ | 2.40G/2.63G [00:25<00:02, 111MB/s]
Downloading pytorch_model.bin: 92%|█████████▏| 2.42G/2.63G [00:25<00:01, 110MB/s]
Downloading pytorch_model.bin: 93%|█████████▎| 2.44G/2.63G [00:26<00:01, 110MB/s]
Downloading pytorch_model.bin: 94%|█████████▎| 2.46G/2.63G [00:26<00:01, 110MB/s]
Downloading pytorch_model.bin: 94%|█████████▍| 2.49G/2.63G [00:26<00:01, 111MB/s]
Downloading pytorch_model.bin: 95%|█████████▌| 2.51G/2.63G [00:26<00:01, 111MB/s]
Downloading pytorch_model.bin: 96%|█████████▌| 2.53G/2.63G [00:26<00:00, 111MB/s]
Downloading pytorch_model.bin: 97%|█████████▋| 2.55G/2.63G [00:27<00:00, 111MB/s]
Downloading pytorch_model.bin: 98%|█████████▊| 2.57G/2.63G [00:27<00:00, 110MB/s]
Downloading pytorch_model.bin: 98%|█████████▊| 2.59G/2.63G [00:27<00:00, 111MB/s]
Downloading pytorch_model.bin: 99%|█████████▉| 2.61G/2.63G [00:27<00:00, 111MB/s]
Downloading pytorch_model.bin: 100%|██████████| 2.63G/2.63G [00:27<00:00, 111MB/s]
Downloading pytorch_model.bin: 100%|██████████| 2.63G/2.63G [00:27<00:00, 94.6MB/s]
Downloading (…)neration_config.json: 0%| | 0.00/137 [00:00<?, ?B/s]
Downloading (…)neration_config.json: 100%|██████████| 137/137 [00:00<00:00, 74.6kB/s]
Downloading metadata: 0%| | 0.00/926 [00:00<?, ?B/s]
Downloading metadata: 100%|██████████| 926/926 [00:00<00:00, 885kB/s]
Downloading readme: 0%| | 0.00/530 [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 530/530 [00:00<00:00, 289kB/s]
Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...
Downloading data files: 0%| | 0/2 [00:00<?, ?it/s]
Downloading data: 0%| | 0.00/68.4M [00:00<?, ?B/s][A
Downloading data: 1%| | 851k/68.4M [00:00<00:08, 8.41MB/s][A
Downloading data: 3%|▎ | 2.21M/68.4M [00:00<00:05, 11.4MB/s][A
Downloading data: 6%|▌ | 4.17M/68.4M [00:00<00:04, 15.1MB/s][A
Downloading data: 10%|█ | 6.97M/68.4M [00:00<00:03, 20.1MB/s][A
Downloading data: 15%|█▌ | 10.3M/68.4M [00:00<00:02, 24.8MB/s][A
Downloading data: 20%|██ | 13.7M/68.4M [00:00<00:01, 28.0MB/s][A
Downloading data: 25%|██▌ | 17.2M/68.4M [00:00<00:01, 30.1MB/s][A
Downloading data: 30%|███ | 20.6M/68.4M [00:00<00:01, 31.4MB/s][A
Downloading data: 35%|███▌ | 24.1M/68.4M [00:00<00:01, 32.6MB/s][A
Downloading data: 40%|████ | 27.6M/68.4M [00:01<00:01, 33.3MB/s][A
Downloading data: 46%|████▌ | 31.1M/68.4M [00:01<00:01, 33.9MB/s][A
Downloading data: 51%|█████ | 34.7M/68.4M [00:01<00:00, 34.4MB/s][A
Downloading data: 56%|█████▌ | 38.3M/68.4M [00:01<00:00, 34.9MB/s][A
Downloading data: 61%|██████▏ | 42.1M/68.4M [00:01<00:00, 35.6MB/s][A
Downloading data: 67%|██████▋ | 45.8M/68.4M [00:01<00:00, 36.1MB/s][A
Downloading data: 73%|███████▎ | 49.6M/68.4M [00:01<00:00, 36.7MB/s][A
Downloading data: 78%|███████▊ | 53.5M/68.4M [00:01<00:00, 37.5MB/s][A
Downloading data: 84%|████████▍ | 57.5M/68.4M [00:01<00:00, 38.1MB/s][A
Downloading data: 90%|████████▉ | 61.4M/68.4M [00:01<00:00, 38.4MB/s][A
Downloading data: 96%|█████████▌| 65.4M/68.4M [00:02<00:00, 38.8MB/s][A
Downloading data: 100%|██████████| 68.4M/68.4M [00:02<00:00, 32.8MB/s]
Downloading data files: 50%|█████ | 1/2 [00:02<00:02, 2.58s/it]
Downloading data: 0%| | 0.00/4.61M [00:00<?, ?B/s][A
Downloading data: 20%|█▉ | 916k/4.61M [00:00<00:00, 9.07MB/s][A
Downloading data: 51%|█████ | 2.35M/4.61M [00:00<00:00, 12.1MB/s][A
Downloading data: 96%|█████████▌| 4.42M/4.61M [00:00<00:00, 16.0MB/s][A
Downloading data: 100%|██████████| 4.61M/4.61M [00:00<00:00, 14.9MB/s]
Downloading data files: 100%|██████████| 2/2 [00:03<00:00, 1.53s/it]
Downloading data files: 100%|██████████| 2/2 [00:03<00:00, 1.69s/it]
Extracting data files: 0%| | 0/2 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 1633.61it/s]
Generating train split: 0%| | 0/76256 [00:00<?, ? examples/s]
Generating train split: 52%|█████▏ | 40000/76256 [00:00<00:00, 354096.87 examples/s]
Generating test split: 0%| | 0/5103 [00:00<?, ? examples/s]
Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.
0%| | 0/2 [00:00<?, ?it/s]
100%|██████████| 2/2 [00:00<00:00, 488.16it/s]
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu116/fused_adam...
Traceback (most recent call last):
File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 339, in <module>
main()
File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 271, in main
optimizer = AdamOptimizer(optimizer_grouped_parameters,
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
return self.jit_load(verbose)
File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
op_module = load(name=self.name,
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
return _jit_compile(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
_write_ninja_file_and_build_library(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1597, in _write_ninja_file_and_build_library
get_compiler_abi_compatibility_and_version(compiler)
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 336, in get_compiler_abi_compatibility_and_version
if not check_compiler_ok_for_platform(compiler):
File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 290, in check_compiler_ok_for_platform
which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
File "/opt/conda/lib/python3.10/subprocess.py", line 421, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
[2023-04-21 14:49:09,614] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1337
[2023-04-21 14:49:09,614] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1
Can anyone help me? Thank you very much
@huynhthanh98 The error is subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
and it's originating from torch
, not deepspeed
. Torch is not able to find a C++ compiler in your environment to compile the kernels. I would recommend trying to re-install torch
.
@huynhthanh98 The error is
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
and it's originating fromtorch
, notdeepspeed
. Torch is not able to find a C++ compiler in your environment to compile the kernels. I would recommend trying to re-installtorch
.
Hi,
thank you very much, it worked.
I have another question. Do you know how i can train model if i have 2 GPUs?
@mrwyattii any idea about the exits with return code = -11
problem?
here is my 'ds_report' :
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-dev package with apt
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
[WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
[WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/workspace/sharing/johnsonwu/DeepSpeed/deepspeed/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/workspace/sharing/johnsonwu/DeepSpeed/deepspeed/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.1+036c5d6d, 036c5d6d, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
hmm, it seems to work now with the latest pull from the repo! (I still don't know why)
After running:
python3 train.py --step 1 --deployment-type single_node
I got:
Traceback (most recent call last):
File "/home/ltunix2/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 210, in <module>
main(args)
File "/home/ltunix2/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 195, in main
launch_cmd(args, step_num, cmd)
File "/home/ltunix2/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 175, in launch_cmd
raise RuntimeError('\n\n'.join((
RuntimeError: Step 1 exited with non-zero status 247
I am attempting to use an RTX 3070 with 8 GB of memory, so I have adjusted the batch size:
deepspeed main.py \
--data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \
--data_split 2,4,4 \
--model_name_or_path facebook/opt-1.3b \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 2 \
--max_seq_len 512 \
--learning_rate 9.65e-6 \
--weight_decay 0. \
--num_train_epochs 16 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--num_warmup_steps 0 \
The error from the log:
[2023-04-25 15:31:18,130] [ERROR] [launch.py:434:sigkill_handler] ['/home/ltunix2/anaconda3/envs/Test01/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '2', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', '/home/ltunix2/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = -9
So I guess the RTX 3070 just lacks too much memory to train the 1.3B model. I don't know if the 125M and 350M models would work.
@huynhthanh98 The error is
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
and it's originating fromtorch
, notdeepspeed
. Torch is not able to find a C++ compiler in your environment to compile the kernels. I would recommend trying to re-installtorch
.Hi,
thank you very much, it worked.
I have another question. Do you know how i can train model if i have 2 GPUs?
@huynhthanh98 you can run with 2 GPUs using the --deployment-type single_node
option. This will use all available GPUs on your local system.
So I guess the RTX 3070 just lacks too much memory to train the 1.3B model. I don't know if the 125M and 350M models would work.
@Nuclear-Beeper Yes the 3070 will have too little memory to train the 1.3B model. I think the lower limit on memory is ~12-13GB with a batch size of 1. You could likely fit those smaller models on 8GB of memory, but I think the quality of the trained model will be noticeably worse for those smaller models.
Anyone got this:
RuntimeError: Error building extension 'fused_adam'
not sure how to debug. Too little trace.
Anyone got idea how to enable more trace for this issue?
same error. how to deal with it ?
@nieallen Could you try pre-compiling fused_adam
and share the result of ds_report
?
DS_BUILD_FUSED_ADAM=1 pip install deepspeed
After finishing install successfully, i got this error when ran this command: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --num-gpus 1
---=== Running Step 1 ===--- Traceback (most recent call last): File "/data/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 218, in
main(args)
File "/data/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 203, in main
launch_cmd(cmd, step_num)
File "/data/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 192, in launch_cmd
raise RuntimeError(
RuntimeError: Step 1 exited with non-zero status 1
how to fix it please ?