microsoft / DeepSpeedExamples

Example models using DeepSpeed
Apache License 2.0
5.83k stars 990 forks source link

RuntimeError: Step 1 exited with non-zero status 1 #279

Closed yudonglee closed 1 year ago

yudonglee commented 1 year ago

After finishing install successfully, i got this error when ran this command: python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --num-gpus 1

---=== Running Step 1 ===--- Traceback (most recent call last): File "/data/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 218, in main(args) File "/data/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 203, in main launch_cmd(cmd, step_num) File "/data/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 192, in launch_cmd raise RuntimeError( RuntimeError: Step 1 exited with non-zero status 1

how to fix it please ?

guoswang commented 1 year ago

me too

aopolin-lv commented 1 year ago

me too

loveandbeepbrother commented 1 year ago

me too

hikerell commented 1 year ago

me too. GPU: 1x A100 40G

cat training.log:

OutOfMemoryError: CUDA out of memory. Tried to allocate 786.00 MiB (GPU 0; 39.56 GiB total capacity; 38.49 GiB already allocated; 96.56 MiB free; 38.78 GiB reserved in total by PyTorch) If reserved memory is allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2023-04-13 08:39:16,676] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 2525 [2023-04-13 08:39:16,677] [ERROR] [launch.py:434:sigkill_handler] ['/usr/bin/python3', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', './output'] exits with return code = 1

hikerell commented 1 year ago

Maybe I have resolved the error by reducing batch size.

I modified the step-1 script training/step1_supervised_finetuning/training_scripts/single_gpu/run_1.3b.sh, just add --per_device_train_batch_size 8 and --per_device_eval_batch_size 8:

deepspeed --num_gpus 1 main.py --model_name_or_path facebook/opt-1.3b \
   --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage $ZERO_STAGE \
   --per_device_train_batch_size 8 \
   --per_device_eval_batch_size 8 \
   --deepspeed --output_dir $OUTPUT &> $OUTPUT/training.log

It works without cuda out-of-memory error now.

SilentMoebuta commented 1 year ago

solved. check log for more error info.

yudonglee commented 1 year ago

Here is the log from : DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b/training.log [2023-04-13 22:52:19,362] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2023-04-13 22:52:19,416] [INFO] [runner.py:540:main] cmd = /home/ps/anaconda3/envs/pt2/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage 0 --deepspeed --output_dir /data/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b [2023-04-13 22:52:21,351] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]} [2023-04-13 22:52:21,351] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0 [2023-04-13 22:52:21,351] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]}) [2023-04-13 22:52:21,351] [INFO] [launch.py:247:main] dist_world_size=1 [2023-04-13 22:52:21,351] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0 [2023-04-13 22:52:24,221] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl Traceback (most recent call last): File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen httplib_response = self._make_request( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request self._validate_conn(conn) File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn conn.connect() File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/connection.py", line 419, in connect self.sock = ssl_wrapsocket( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/util/ssl.py", line 449, in ssl_wrap_socket ssl_sock = _ssl_wrap_socketimpl( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/util/ssl.py", line 493, in _ssl_wrap_socket_impl return ssl_context.wrap_socket(sock, server_hostname=server_hostname) File "/home/ps/anaconda3/envs/pt2/lib/python3.10/ssl.py", line 513, in wrap_socket return self.sslsocket_class._create( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/ssl.py", line 1071, in _create self.do_handshake() File "/home/ps/anaconda3/envs/pt2/lib/python3.10/ssl.py", line 1342, in do_handshake self._sslobj.do_handshake() ssl.SSLEOFError: EOF occurred in violation of protocol (_ssl.c:997)

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/requests/adapters.py", line 489, in send resp = conn.urlopen( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen retries = retries.increment( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/urllib3/util/retry.py", line 592, in increment raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /repos/07/3d/073de108a2c59896a27d14fab4481eb23b2158f96739f10e132b57dd7e2f23fe/cf7d5c970d6ddbd3b03009b397c0422e147edd5c8020d47a8d2fac0b11a3b08d?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1681656749&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZG4tbGZzLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzA3LzNkLzA3M2RlMTA4YTJjNTk4OTZhMjdkMTRmYWI0NDgxZWIyM2IyMTU4Zjk2NzM5ZjEwZTEzMmI1N2RkN2UyZjIzZmUvY2Y3ZDVjOTcwZDZkZGJkM2IwMzAwOWIzOTdjMDQyMmUxNDdlZGQ1YzgwMjBkNDdhOGQyZmFjMGIxMWEzYjA4ZD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2ODE2NTY3NDl9fX1dfQ&Signature=bsLk8A7ZYuAz5RwoKScwoCbM7WUE4xLdKfthWaEY6UC46sSLpc0eFL93eW7CcbvI1jaMziP0od6dvaPic6hoZNuHAfRfMXA5O1WN-TLw~2ptXoFbzzfXfJhnEJevslINF4B2pg8xRoswAid730cDJY8z-pJiQD0cF3AmI2G666W2OXJ0yMnIATLqLUEjIBSUZgNJ67bV3LjaMdpbl3YRGd~yL9ROMWM4KvUvLRx~c3wIGRsCSbYkyXobtwjoLe8jLrI6G3L70m-cmqiynm38zjwhJBE1Bo2UwC~hMOJ8eANU7Opn-1WuiWhPprRbMj4-Z9G67cyfVhLiN1oVZ0dirg&Key-Pair-Id=KVTP0A1DKRTAX (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:997)')))

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 328, in main() File "/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 202, in main model = create_hf_model(AutoModelForCausalLM, args.model_name_or_path, File "/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/utils/model/model_utils.py", line 35, in create_hf_model model = model_class.from_pretrained( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained return model_class.from_pretrained( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2450, in from_pretrained resolved_archive_file = cached_file(pretrained_model_name_or_path, filename, *cached_file_kwargs) File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/transformers/utils/hub.py", line 409, in cached_file resolved_file = hf_hub_download( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/huggingface_hub/utils/_validators.py", line 120, in _inner_fn return fn(args, kwargs) File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 1332, in hf_hub_download http_get( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 505, in http_get r = _request_wrapper( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/huggingface_hub/file_download.py", line 442, in _request_wrapper return http_backoff( File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/huggingface_hub/utils/_http.py", line 129, in http_backoff response = requests.request(method=method, url=url, kwargs) File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/requests/api.py", line 59, in request return session.request(method=method, url=url, kwargs) File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/requests/sessions.py", line 587, in request resp = self.send(prep, send_kwargs) File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/requests/sessions.py", line 701, in send r = adapter.send(request, *kwargs) File "/home/ps/anaconda3/envs/pt2/lib/python3.10/site-packages/requests/adapters.py", line 563, in send raise SSLError(e, request=request) requests.exceptions.SSLError: HTTPSConnectionPool(host='cdn-lfs.huggingface.co', port=443): Max retries exceeded with url: /repos/07/3d/073de108a2c59896a27d14fab4481eb23b2158f96739f10e132b57dd7e2f23fe/cf7d5c970d6ddbd3b03009b397c0422e147edd5c8020d47a8d2fac0b11a3b08d?response-content-disposition=attachment%3B+filename%3DUTF-8%27%27pytorch_model.bin%3B+filename%3D%22pytorch_model.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1681656749&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZG4tbGZzLmh1Z2dpbmdmYWNlLmNvL3JlcG9zLzA3LzNkLzA3M2RlMTA4YTJjNTk4OTZhMjdkMTRmYWI0NDgxZWIyM2IyMTU4Zjk2NzM5ZjEwZTEzMmI1N2RkN2UyZjIzZmUvY2Y3ZDVjOTcwZDZkZGJkM2IwMzAwOWIzOTdjMDQyMmUxNDdlZGQ1YzgwMjBkNDdhOGQyZmFjMGIxMWEzYjA4ZD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2ODE2NTY3NDl9fX1dfQ&Signature=bsLk8A7ZYuAz5RwoKScwoCbM7WUE4xLdKfthWaEY6UC46sSLpc0eFL93eW7CcbvI1jaMziP0od6dvaPic6hoZNuHAfRfMXA5O1WN-TLw~2ptXoFbzzfXfJhnEJevslINF4B2pg8xRoswAid730cDJY8z-pJiQD0cF3AmI2G666W2OXJ0yMnIATLqLUEjIBSUZgNJ67bV3LjaMdpbl3YRGd~yL9ROMWM4KvUvLRx~c3wIGRsCSbYkyXobtwjoLe8jLrI6G3L70m-cmqiynm38zjwhJBE1Bo2UwC~hMOJ8eANU7Opn-1WuiWhPprRbMj4-Z9G67cyfVhLiN1oVZ0dirg&Key-Pair-Id=KVTP0A1DKRTAX (Caused by SSLError(SSLEOFError(8, 'EOF occurred in violation of protocol (_ssl.c:997)'))) [2023-04-13 22:52:31,364] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 3448 [2023-04-13 22:52:31,365] [ERROR] [launch.py:434:sigkill_handler] ['/home/ps/anaconda3/envs/pt2/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/data/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1

And I check that my Python supports TLS 1.1 or above:

from urllib.request import urlopen urlopen('https://www.howsmyssl.com/a/check').read()

and it output:

b'{"given_cipher_suites":["TLS_AES_256_GCM_SHA384","TLS_CHACHA20_POLY1305_SHA256","TLS_AES_128_GCM_SHA256","TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384","TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256","TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256","TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256","TLS_ECDHE_ECDSA_WITH_AES_256_CBC_SHA384","TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384","TLS_ECDHE_ECDSA_WITH_AES_128_CBC_SHA256","TLS_ECDHE_RSA_WITH_AES_128_CBC_SHA256","TLS_DHE_RSA_WITH_AES_256_GCM_SHA384","TLS_DHE_RSA_WITH_AES_128_GCM_SHA256","TLS_DHE_RSA_WITH_AES_256_CBC_SHA256","TLS_DHE_RSA_WITH_AES_128_CBC_SHA256","TLS_EMPTY_RENEGOTIATION_INFO_SCSV"],"ephemeral_keys_supported":true,"session_ticket_supported":true,"tls_compression_supported":false,"unknown_cipher_suite_supported":false,"beast_vuln":false,"able_to_detect_n_minus_one_splitting":false,"insecure_cipher_suites":{},"tls_version":"TLS 1.3","rating":"Probably Okay"}'

So what's the problem ?

mrwyattii commented 1 year ago

@yudonglee This looks to be a problem with transformers and huggingface_hub. Can you try installing the latest master of transformers and update huggingface_hub to the latest release? pip install git+https://github.com/huggingface/transformers pip install -U huggingface_hub

Please take a look at the requirements.txt for DS-chat: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/requirements.txt

mrwyattii commented 1 year ago

Everyone else getting OOM on the 1.3b example: Could you please share information about your environment with ds_report?

kasimok commented 1 year ago

Everyone else getting OOM on the 1.3b example: Could you please share information about your environment with ds_report?


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] sparse_attn requires a torch version >= 1.5 but detected 2.0 [WARNING] using untested triton version (2.0.0), only 1.0.0 is known to be compatible sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/mnt/c/Users//Documents/AIGC/DeepSpeedExamples/applications/DeepSpeed-Chat/venv/lib/python3.10/site-packages/torch'] torch version .................... 2.0.0+cu117 deepspeed install path ........... ['/mnt/c/Users//Documents/AIGC/DeepSpeedExamples/applications/DeepSpeed-Chat/venv/lib/python3.10/site-packages/deepspeed'] deepspeed info ................... 0.9.1+a8f999e3, a8f999e3, master torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 12.0 deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

4t8dd commented 1 year ago

Anyone got this:

RuntimeError: Error building extension 'fused_adam'

not sure how to debug. Too little trace.

Anyone got idea how to enable more trace for this issue?

ChaoChungWu-Johnson commented 1 year ago

Maybe I have resolved the error by reducing batch size.

@hikerell , would you mind sharing your hardware spec? I have single A100 40GB and run the same script as well. but in my training log, there's only showing process got kill and no OOM messages at all. I even set batch size to 1. I can only imagine the reason is that there's hardware difference between us or there are different versions of dependency I have... training log:

[2023-04-14 16:42:18,341] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-14 16:42:18,350] [INFO] [runner.py:540:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage 0 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --deepspeed --output_dir /workspace/sharing/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
[2023-04-14 16:42:21,769] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-14 16:42:21,769] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-14 16:42:21,769] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-14 16:42:21,769] [INFO] [launch.py:247:main] dist_world_size=1
[2023-04-14 16:42:21,769] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-14 16:42:26,266] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl

Downloading (…)okenizer_config.json:   0%|                  | 0.00/685 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|████████████| 685/685 [00:00<00:00, 738kB/s]

Downloading (…)lve/main/config.json:   0%|                  | 0.00/653 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|████████████| 653/653 [00:00<00:00, 616kB/s]

Downloading (…)olve/main/vocab.json:   0%|                 | 0.00/899k [00:00<?, ?B/s]
Downloading (…)olve/main/vocab.json: 100%|█████████| 899k/899k [00:00<00:00, 11.3MB/s]

Downloading (…)olve/main/merges.txt:   0%|                 | 0.00/456k [00:00<?, ?B/s]
Downloading (…)olve/main/merges.txt: 100%|█████████| 456k/456k [00:00<00:00, 5.96MB/s]

Downloading (…)cial_tokens_map.json:   0%|                  | 0.00/441 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|████████████| 441/441 [00:00<00:00, 484kB/s]

Downloading pytorch_model.bin:   0%|                      | 0.00/2.63G [00:00<?, ?B/s]
Downloading pytorch_model.bin:   0%|             | 10.5M/2.63G [00:00<01:10, 37.4MB/s]
Downloading pytorch_model.bin:   1%|▏            | 31.5M/2.63G [00:00<00:31, 83.6MB/s]
Downloading pytorch_model.bin:   2%|▎             | 62.9M/2.63G [00:00<00:17, 146MB/s]
Downloading pytorch_model.bin:   4%|▌              | 105M/2.63G [00:00<00:11, 219MB/s]
Downloading pytorch_model.bin:   6%|▊              | 147M/2.63G [00:00<00:09, 269MB/s]
Downloading pytorch_model.bin:   7%|█              | 189M/2.63G [00:00<00:08, 302MB/s]
Downloading pytorch_model.bin:   9%|█▎             | 231M/2.63G [00:00<00:07, 324MB/s]
Downloading pytorch_model.bin:  10%|█▌             | 273M/2.63G [00:01<00:06, 340MB/s]
Downloading pytorch_model.bin:  12%|█▊             | 315M/2.63G [00:01<00:06, 352MB/s]
Downloading pytorch_model.bin:  14%|██             | 357M/2.63G [00:01<00:06, 361MB/s]
Downloading pytorch_model.bin:  15%|██▎            | 398M/2.63G [00:01<00:06, 367MB/s]
Downloading pytorch_model.bin:  17%|██▌            | 440M/2.63G [00:01<00:05, 366MB/s]
Downloading pytorch_model.bin:  18%|██▋            | 482M/2.63G [00:01<00:05, 370MB/s]
Downloading pytorch_model.bin:  20%|██▉            | 524M/2.63G [00:01<00:05, 372MB/s]
Downloading pytorch_model.bin:  22%|███▏           | 566M/2.63G [00:01<00:05, 374MB/s]
Downloading pytorch_model.bin:  23%|███▍           | 608M/2.63G [00:01<00:05, 377MB/s]
Downloading pytorch_model.bin:  25%|███▋           | 650M/2.63G [00:02<00:05, 379MB/s]
Downloading pytorch_model.bin:  26%|███▉           | 692M/2.63G [00:02<00:05, 378MB/s]
Downloading pytorch_model.bin:  28%|████▏          | 734M/2.63G [00:02<00:05, 378MB/s]
Downloading pytorch_model.bin:  29%|████▍          | 776M/2.63G [00:02<00:04, 378MB/s]
Downloading pytorch_model.bin:  31%|████▋          | 818M/2.63G [00:02<00:05, 354MB/s]
Downloading pytorch_model.bin:  33%|████▉          | 860M/2.63G [00:02<00:04, 362MB/s]
Downloading pytorch_model.bin:  34%|█████▏         | 902M/2.63G [00:02<00:04, 367MB/s]
Downloading pytorch_model.bin:  36%|█████▍         | 944M/2.63G [00:02<00:04, 370MB/s]
Downloading pytorch_model.bin:  37%|█████▌         | 986M/2.63G [00:02<00:04, 373MB/s]
Downloading pytorch_model.bin:  39%|█████▍        | 1.03G/2.63G [00:03<00:04, 374MB/s]
Downloading pytorch_model.bin:  41%|█████▋        | 1.07G/2.63G [00:03<00:04, 375MB/s]
Downloading pytorch_model.bin:  42%|█████▉        | 1.11G/2.63G [00:03<00:04, 377MB/s]
Downloading pytorch_model.bin:  44%|██████▏       | 1.15G/2.63G [00:03<00:03, 377MB/s]
Downloading pytorch_model.bin:  45%|██████▎       | 1.20G/2.63G [00:03<00:03, 364MB/s]
Downloading pytorch_model.bin:  47%|██████▌       | 1.24G/2.63G [00:03<00:03, 363MB/s]
Downloading pytorch_model.bin:  49%|██████▊       | 1.28G/2.63G [00:03<00:03, 364MB/s]
Downloading pytorch_model.bin:  50%|███████       | 1.32G/2.63G [00:03<00:03, 363MB/s]
Downloading pytorch_model.bin:  52%|███████▎      | 1.36G/2.63G [00:04<00:03, 362MB/s]
Downloading pytorch_model.bin:  53%|███████▍      | 1.41G/2.63G [00:04<00:03, 360MB/s]
Downloading pytorch_model.bin:  55%|███████▋      | 1.45G/2.63G [00:04<00:03, 360MB/s]
Downloading pytorch_model.bin:  57%|███████▉      | 1.49G/2.63G [00:04<00:03, 360MB/s]
Downloading pytorch_model.bin:  58%|████████▏     | 1.53G/2.63G [00:04<00:03, 359MB/s]
Downloading pytorch_model.bin:  60%|████████▎     | 1.57G/2.63G [00:04<00:02, 356MB/s]
Downloading pytorch_model.bin:  61%|████████▌     | 1.61G/2.63G [00:04<00:02, 359MB/s]
Downloading pytorch_model.bin:  63%|████████▊     | 1.66G/2.63G [00:04<00:02, 361MB/s]
Downloading pytorch_model.bin:  65%|█████████     | 1.70G/2.63G [00:04<00:02, 362MB/s]
Downloading pytorch_model.bin:  66%|█████████▎    | 1.74G/2.63G [00:05<00:02, 365MB/s]
Downloading pytorch_model.bin:  68%|█████████▍    | 1.78G/2.63G [00:05<00:02, 369MB/s]
Downloading pytorch_model.bin:  69%|█████████▋    | 1.82G/2.63G [00:05<00:02, 368MB/s]
Downloading pytorch_model.bin:  71%|█████████▉    | 1.87G/2.63G [00:05<00:02, 370MB/s]
Downloading pytorch_model.bin:  73%|██████████▏   | 1.91G/2.63G [00:05<00:01, 367MB/s]
Downloading pytorch_model.bin:  74%|██████████▍   | 1.95G/2.63G [00:05<00:01, 370MB/s]
Downloading pytorch_model.bin:  76%|██████████▌   | 1.99G/2.63G [00:05<00:01, 371MB/s]
Downloading pytorch_model.bin:  77%|██████████▊   | 2.03G/2.63G [00:05<00:01, 371MB/s]
Downloading pytorch_model.bin:  79%|███████████   | 2.08G/2.63G [00:05<00:01, 371MB/s]
Downloading pytorch_model.bin:  80%|███████████▎  | 2.12G/2.63G [00:06<00:01, 372MB/s]
Downloading pytorch_model.bin:  82%|███████████▍  | 2.16G/2.63G [00:06<00:01, 373MB/s]
Downloading pytorch_model.bin:  84%|███████████▋  | 2.20G/2.63G [00:06<00:01, 374MB/s]
Downloading pytorch_model.bin:  85%|███████████▉  | 2.24G/2.63G [00:06<00:01, 372MB/s]
Downloading pytorch_model.bin:  87%|████████████▏ | 2.29G/2.63G [00:06<00:00, 361MB/s]
Downloading pytorch_model.bin:  88%|████████████▍ | 2.33G/2.63G [00:06<00:00, 366MB/s]
Downloading pytorch_model.bin:  90%|████████████▌ | 2.37G/2.63G [00:06<00:00, 367MB/s]
Downloading pytorch_model.bin:  92%|████████████▊ | 2.41G/2.63G [00:06<00:00, 369MB/s]
Downloading pytorch_model.bin:  93%|█████████████ | 2.45G/2.63G [00:07<00:00, 371MB/s]
Downloading pytorch_model.bin:  95%|█████████████▎| 2.50G/2.63G [00:07<00:00, 372MB/s]
Downloading pytorch_model.bin:  96%|█████████████▍| 2.54G/2.63G [00:07<00:00, 372MB/s]
Downloading pytorch_model.bin:  98%|█████████████▋| 2.58G/2.63G [00:07<00:00, 373MB/s]
Downloading pytorch_model.bin: 100%|█████████████▉| 2.62G/2.63G [00:07<00:00, 374MB/s]
Downloading pytorch_model.bin: 100%|██████████████| 2.63G/2.63G [00:07<00:00, 351MB/s]

Downloading (…)neration_config.json:   0%|                  | 0.00/137 [00:00<?, ?B/s]
Downloading (…)neration_config.json: 100%|████████████| 137/137 [00:00<00:00, 132kB/s]

Downloading metadata:   0%|                                 | 0.00/926 [00:00<?, ?B/s]
Downloading metadata: 100%|███████████████████████████| 926/926 [00:00<00:00, 879kB/s]

Downloading readme:   0%|                                   | 0.00/530 [00:00<?, ?B/s]
Downloading readme: 100%|█████████████████████████████| 530/530 [00:00<00:00, 541kB/s]
Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...

Downloading data files:   0%|                                   | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|                                   | 0.00/68.4M [00:00<?, ?B/s]

Downloading data:   3%|▊                         | 1.98M/68.4M [00:00<00:03, 19.6MB/s]

Downloading data:   9%|██▍                       | 6.48M/68.4M [00:00<00:01, 34.4MB/s]

Downloading data:  22%|█████▋                    | 14.9M/68.4M [00:00<00:00, 57.0MB/s]

Downloading data:  39%|██████████▏               | 26.9M/68.4M [00:00<00:00, 81.8MB/s]

Downloading data:  57%|██████████████▊           | 39.0M/68.4M [00:00<00:00, 95.8MB/s]

Downloading data:  73%|███████████████████▊       | 50.2M/68.4M [00:00<00:00, 102MB/s]

Downloading data:  90%|████████████████████████▎  | 61.5M/68.4M [00:00<00:00, 105MB/s]
Downloading data: 100%|██████████████████████████| 68.4M/68.4M [00:00<00:00, 90.2MB/s]

Downloading data files:  50%|█████████████▌             | 1/2 [00:01<00:01,  1.16s/it]

Downloading data:   0%|                                   | 0.00/4.61M [00:00<?, ?B/s]

Downloading data:  48%|████████████▍             | 2.20M/4.61M [00:00<00:00, 22.0MB/s]
Downloading data: 100%|██████████████████████████| 4.61M/4.61M [00:00<00:00, 28.4MB/s]

Downloading data files: 100%|███████████████████████████| 2/2 [00:01<00:00,  1.36it/s]
Downloading data files: 100%|███████████████████████████| 2/2 [00:01<00:00,  1.25it/s]

Extracting data files:   0%|                                    | 0/2 [00:00<?, ?it/s]
Extracting data files: 100%|██████████████████████████| 2/2 [00:00<00:00, 1875.39it/s]

Generating train split:   0%|                        | 0/76256 [00:00<?, ? examples/s]
Generating train split:  52%|████▏   | 40000/76256 [00:00<00:00, 288147.02 examples/s]
Generating train split: 100%|████████| 76256/76256 [00:00<00:00, 307697.72 examples/s]

Generating test split:   0%|                          | 0/5103 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.

  0%|                                                           | 0/2 [00:00<?, ?it/s]
100%|██████████████████████████████████████████████████| 2/2 [00:00<00:00, 578.17it/s]
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu116/fused_adam...
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py310_cu116/fused_adam/build.ninja...
Building extension module fused_adam...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
[1/3] /opt/conda/bin/nvcc  -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_80,code=compute_80 -gencode=arch=compute_80,code=sm_80 --compiler-options '-fPIC' -O3 -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -lineinfo --use_fast_math -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_80,code=compute_80 -std=c++14 -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/multi_tensor_adam.cu -o multi_tensor_adam.cuda.o 
[2/3] c++ -MMD -MF fused_adam_frontend.o.d -DTORCH_EXTENSION_NAME=fused_adam -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/includes -I/opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam -isystem /opt/conda/lib/python3.10/site-packages/torch/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /opt/conda/lib/python3.10/site-packages/torch/include/TH -isystem /opt/conda/lib/python3.10/site-packages/torch/include/THC -isystem /opt/conda/include -isystem /opt/conda/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -fPIC -std=c++14 -O3 -std=c++14 -g -Wno-reorder -DVERSION_GE_1_1 -DVERSION_GE_1_3 -DVERSION_GE_1_5 -c /opt/conda/lib/python3.10/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp -o fused_adam_frontend.o 
[3/3] c++ fused_adam_frontend.o multi_tensor_adam.cuda.o -shared -L/opt/conda/lib/python3.10/site-packages/torch/lib -lc10 -lc10_cuda -ltorch_cpu -ltorch_cuda_cu -ltorch_cuda_cpp -ltorch -ltorch_python -L/opt/conda/lib64 -lcudart -o fused_adam.so
Loading extension module fused_adam...
Time to load fused_adam op: 28.788597345352173 seconds
[2023-04-14 16:46:28,025] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 6112
[2023-04-14 16:46:28,025] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--per_device_train_batch_size', '1', '--per_device_eval_batch_size', '1', '--deepspeed', '--output_dir', '/workspace/sharing/DeepSpeed/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = -11

or anyone facing the same problem (no OOM message popped out but seems like an OOM issue....)?

mrwyattii commented 1 year ago

Anyone got this:

RuntimeError: Error building extension 'fused_adam'

not sure how to debug. Too little trace.

Anyone got idea how to enable more trace for this issue?

Could you please try installing deepspeed with the following and share the output? DS_BUILD_FUSED_ADAM=1 pip install deepspeed==0.9.0

mrwyattii commented 1 year ago

@ChaoChungWu-Johnson Could you launch the script and watch the memory usage at the same time (using watch -n 1 nvidia-smi) to confirm if this is related to an OOM error?

ChaoChungWu-Johnson commented 1 year ago

hi @mrwyattii ! I found this error is somehow weird. I got sigkilled after Distributed backend already initialized message was shown. And during the time from start to end, the nvidia-smi showed the same ram usage as about 1G for each gpu core. So I think it's not related to CPU or GPU OOM but may be some other reasons, and btw if I enable core dump (writing file) of python, it will dump a great amount of python core file (>50GB) which directly caused my system hard disk oom.

mrwyattii commented 1 year ago

Is that message coming from torch? Can you share more about your environment (e.g., the output of ds_report)?

ChaoChungWu-Johnson commented 1 year ago

@mrwyattii , yes, I found my case is similar to this: https://github.com/microsoft/DeepSpeedExamples/issues/313 the message [INFO] [comm.py:580:init_distributed] Distributed backend already initialized showed and then my process was sigkilled.

ds_report:

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/workspace/sharing/johnsonwu/DeepSpeed/deepspeed/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/workspace/sharing/johnsonwu/DeepSpeed/deepspeed/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.1+036c5d6d, 036c5d6d, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7

Thank you very much!

ucas010 commented 1 year ago

have the same bug,same problem how to deal with it ?

yudonglee commented 1 year ago

@yudonglee This looks to be a problem with transformers and huggingface_hub. Can you try installing the latest master of transformers and update huggingface_hub to the latest release? pip install git+https://github.com/huggingface/transformers pip install -U huggingface_hub

Please take a look at the requirements.txt for DS-chat: https://github.com/microsoft/DeepSpeedExamples/blob/master/applications/DeepSpeed-Chat/requirements.txt

@mrwyattii Thanks~ It works for my case

huynhthanh98 commented 1 year ago

Hi @mrwyattii ,

I also faced this error with A6000 48gb vram. This is my log:

[2023-04-21 14:47:16,239] [WARNING] [runner.py:190:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2023-04-21 14:47:16,256] [INFO] [runner.py:540:main] cmd = /opt/conda/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None main.py --model_name_or_path facebook/opt-1.3b --gradient_accumulation_steps 2 --lora_dim 128 --zero_stage 0 --deepspeed --output_dir /workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b
[2023-04-21 14:47:18,501] [INFO] [launch.py:229:main] WORLD INFO DICT: {'localhost': [0]}
[2023-04-21 14:47:18,501] [INFO] [launch.py:235:main] nnodes=1, num_local_procs=1, node_rank=0
[2023-04-21 14:47:18,501] [INFO] [launch.py:246:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2023-04-21 14:47:18,502] [INFO] [launch.py:247:main] dist_world_size=1
[2023-04-21 14:47:18,502] [INFO] [launch.py:249:main] Setting CUDA_VISIBLE_DEVICES=0
[2023-04-21 14:47:21,917] [INFO] [comm.py:586:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl

Downloading (…)okenizer_config.json:   0%|          | 0.00/685 [00:00<?, ?B/s]
Downloading (…)okenizer_config.json: 100%|██████████| 685/685 [00:00<00:00, 648kB/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]
Downloading (…)lve/main/config.json: 100%|██████████| 653/653 [00:00<00:00, 690kB/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]
Downloading (…)olve/main/vocab.json: 100%|██████████| 899k/899k [00:00<00:00, 12.4MB/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.04MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████| 456k/456k [00:00<00:00, 2.03MB/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/441 [00:00<?, ?B/s]
Downloading (…)cial_tokens_map.json: 100%|██████████| 441/441 [00:00<00:00, 440kB/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.63G [00:00<?, ?B/s]
Downloading pytorch_model.bin:   0%|          | 10.5M/2.63G [00:00<02:27, 17.8MB/s]
Downloading pytorch_model.bin:   1%|          | 21.0M/2.63G [00:00<01:26, 30.3MB/s]
Downloading pytorch_model.bin:   1%|          | 31.5M/2.63G [00:00<00:59, 43.7MB/s]
Downloading pytorch_model.bin:   2%|▏         | 41.9M/2.63G [00:00<00:45, 56.4MB/s]
Downloading pytorch_model.bin:   2%|▏         | 52.4M/2.63G [00:01<00:39, 65.8MB/s]
Downloading pytorch_model.bin:   2%|▏         | 62.9M/2.63G [00:01<00:35, 73.0MB/s]
Downloading pytorch_model.bin:   3%|▎         | 73.4M/2.63G [00:01<00:32, 79.0MB/s]
Downloading pytorch_model.bin:   3%|▎         | 83.9M/2.63G [00:01<00:30, 84.2MB/s]
Downloading pytorch_model.bin:   4%|▎         | 94.4M/2.63G [00:01<00:28, 88.3MB/s]
Downloading pytorch_model.bin:   4%|▍         | 105M/2.63G [00:01<00:27, 91.2MB/s] 
Downloading pytorch_model.bin:   4%|▍         | 115M/2.63G [00:01<00:27, 92.5MB/s]
Downloading pytorch_model.bin:   5%|▍         | 126M/2.63G [00:01<00:26, 95.3MB/s]
Downloading pytorch_model.bin:   5%|▌         | 136M/2.63G [00:01<00:25, 97.2MB/s]
Downloading pytorch_model.bin:   6%|▌         | 147M/2.63G [00:02<00:25, 98.7MB/s]
Downloading pytorch_model.bin:   6%|▌         | 157M/2.63G [00:02<00:24, 99.3MB/s]
Downloading pytorch_model.bin:   6%|▋         | 168M/2.63G [00:02<00:24, 100MB/s] 
Downloading pytorch_model.bin:   7%|▋         | 178M/2.63G [00:02<00:24, 101MB/s]
Downloading pytorch_model.bin:   8%|▊         | 199M/2.63G [00:02<00:23, 103MB/s]
Downloading pytorch_model.bin:   8%|▊         | 220M/2.63G [00:02<00:23, 104MB/s]
Downloading pytorch_model.bin:   9%|▉         | 241M/2.63G [00:02<00:22, 106MB/s]
Downloading pytorch_model.bin:  10%|▉         | 262M/2.63G [00:03<00:22, 106MB/s]
Downloading pytorch_model.bin:  11%|█         | 283M/2.63G [00:03<00:21, 108MB/s]
Downloading pytorch_model.bin:  12%|█▏        | 304M/2.63G [00:03<00:21, 109MB/s]
Downloading pytorch_model.bin:  12%|█▏        | 325M/2.63G [00:03<00:20, 110MB/s]
Downloading pytorch_model.bin:  13%|█▎        | 346M/2.63G [00:03<00:20, 111MB/s]
Downloading pytorch_model.bin:  14%|█▍        | 367M/2.63G [00:04<00:20, 112MB/s]
Downloading pytorch_model.bin:  15%|█▍        | 388M/2.63G [00:04<00:19, 113MB/s]
Downloading pytorch_model.bin:  16%|█▌        | 409M/2.63G [00:04<00:19, 113MB/s]
Downloading pytorch_model.bin:  16%|█▋        | 430M/2.63G [00:04<00:19, 114MB/s]
Downloading pytorch_model.bin:  17%|█▋        | 451M/2.63G [00:04<00:19, 114MB/s]
Downloading pytorch_model.bin:  18%|█▊        | 472M/2.63G [00:04<00:18, 114MB/s]
Downloading pytorch_model.bin:  19%|█▊        | 493M/2.63G [00:05<00:18, 115MB/s]
Downloading pytorch_model.bin:  20%|█▉        | 514M/2.63G [00:05<00:19, 107MB/s]
Downloading pytorch_model.bin:  20%|██        | 535M/2.63G [00:05<00:22, 94.8MB/s]
Downloading pytorch_model.bin:  21%|██        | 545M/2.63G [00:05<00:23, 87.7MB/s]
Downloading pytorch_model.bin:  21%|██        | 556M/2.63G [00:06<00:25, 82.6MB/s]
Downloading pytorch_model.bin:  22%|██▏       | 566M/2.63G [00:06<00:26, 79.1MB/s]
Downloading pytorch_model.bin:  22%|██▏       | 577M/2.63G [00:06<00:27, 75.8MB/s]
Downloading pytorch_model.bin:  22%|██▏       | 587M/2.63G [00:06<00:27, 73.8MB/s]
Downloading pytorch_model.bin:  23%|██▎       | 598M/2.63G [00:06<00:27, 73.3MB/s]
Downloading pytorch_model.bin:  23%|██▎       | 608M/2.63G [00:06<00:28, 72.1MB/s]
Downloading pytorch_model.bin:  24%|██▎       | 619M/2.63G [00:06<00:27, 72.8MB/s]
Downloading pytorch_model.bin:  24%|██▍       | 629M/2.63G [00:07<00:27, 73.6MB/s]
Downloading pytorch_model.bin:  24%|██▍       | 640M/2.63G [00:07<00:27, 73.3MB/s]
Downloading pytorch_model.bin:  25%|██▍       | 650M/2.63G [00:07<00:26, 74.6MB/s]
Downloading pytorch_model.bin:  25%|██▌       | 661M/2.63G [00:07<00:26, 74.3MB/s]
Downloading pytorch_model.bin:  26%|██▌       | 671M/2.63G [00:07<00:26, 73.7MB/s]
Downloading pytorch_model.bin:  26%|██▌       | 682M/2.63G [00:07<00:26, 73.7MB/s]
Downloading pytorch_model.bin:  26%|██▋       | 692M/2.63G [00:07<00:25, 75.1MB/s]
Downloading pytorch_model.bin:  27%|██▋       | 703M/2.63G [00:08<00:25, 76.7MB/s]
Downloading pytorch_model.bin:  27%|██▋       | 713M/2.63G [00:08<00:24, 78.2MB/s]
Downloading pytorch_model.bin:  27%|██▋       | 724M/2.63G [00:08<00:24, 79.4MB/s]
Downloading pytorch_model.bin:  28%|██▊       | 734M/2.63G [00:08<00:23, 79.6MB/s]
Downloading pytorch_model.bin:  28%|██▊       | 744M/2.63G [00:08<00:23, 80.2MB/s]
Downloading pytorch_model.bin:  29%|██▊       | 755M/2.63G [00:08<00:23, 80.1MB/s]
Downloading pytorch_model.bin:  29%|██▉       | 765M/2.63G [00:08<00:22, 81.5MB/s]
Downloading pytorch_model.bin:  29%|██▉       | 776M/2.63G [00:08<00:22, 81.0MB/s]
Downloading pytorch_model.bin:  30%|██▉       | 786M/2.63G [00:09<00:22, 81.6MB/s]
Downloading pytorch_model.bin:  30%|███       | 797M/2.63G [00:09<00:22, 81.0MB/s]
Downloading pytorch_model.bin:  31%|███       | 807M/2.63G [00:09<00:22, 81.2MB/s]
Downloading pytorch_model.bin:  31%|███       | 818M/2.63G [00:09<00:22, 82.2MB/s]
Downloading pytorch_model.bin:  31%|███▏      | 828M/2.63G [00:09<00:22, 81.7MB/s]
Downloading pytorch_model.bin:  32%|███▏      | 839M/2.63G [00:09<00:21, 83.0MB/s]
Downloading pytorch_model.bin:  32%|███▏      | 849M/2.63G [00:09<00:21, 83.1MB/s]
Downloading pytorch_model.bin:  33%|███▎      | 860M/2.63G [00:09<00:21, 82.8MB/s]
Downloading pytorch_model.bin:  33%|███▎      | 870M/2.63G [00:10<00:20, 84.0MB/s]
Downloading pytorch_model.bin:  33%|███▎      | 881M/2.63G [00:10<00:20, 84.8MB/s]
Downloading pytorch_model.bin:  34%|███▍      | 891M/2.63G [00:10<00:20, 83.8MB/s]
Downloading pytorch_model.bin:  34%|███▍      | 902M/2.63G [00:10<00:20, 85.0MB/s]
Downloading pytorch_model.bin:  35%|███▍      | 912M/2.63G [00:10<00:20, 84.6MB/s]
Downloading pytorch_model.bin:  35%|███▌      | 923M/2.63G [00:10<00:20, 84.7MB/s]
Downloading pytorch_model.bin:  35%|███▌      | 933M/2.63G [00:10<00:19, 86.0MB/s]
Downloading pytorch_model.bin:  36%|███▌      | 944M/2.63G [00:10<00:19, 85.6MB/s]
Downloading pytorch_model.bin:  36%|███▋      | 954M/2.63G [00:11<00:19, 86.1MB/s]
Downloading pytorch_model.bin:  37%|███▋      | 965M/2.63G [00:11<00:19, 85.3MB/s]
Downloading pytorch_model.bin:  37%|███▋      | 975M/2.63G [00:11<00:19, 84.8MB/s]
Downloading pytorch_model.bin:  37%|███▋      | 986M/2.63G [00:11<00:19, 85.4MB/s]
Downloading pytorch_model.bin:  38%|███▊      | 996M/2.63G [00:11<00:19, 85.7MB/s]
Downloading pytorch_model.bin:  38%|███▊      | 1.01G/2.63G [00:11<00:18, 85.5MB/s]
Downloading pytorch_model.bin:  39%|███▊      | 1.02G/2.63G [00:11<00:18, 86.0MB/s]
Downloading pytorch_model.bin:  39%|███▉      | 1.03G/2.63G [00:11<00:18, 85.9MB/s]
Downloading pytorch_model.bin:  39%|███▉      | 1.04G/2.63G [00:12<00:18, 86.9MB/s]
Downloading pytorch_model.bin:  40%|███▉      | 1.05G/2.63G [00:12<00:18, 87.1MB/s]
Downloading pytorch_model.bin:  40%|████      | 1.06G/2.63G [00:12<00:18, 87.3MB/s]
Downloading pytorch_model.bin:  41%|████      | 1.07G/2.63G [00:12<00:17, 88.0MB/s]
Downloading pytorch_model.bin:  41%|████      | 1.08G/2.63G [00:12<00:17, 88.3MB/s]
Downloading pytorch_model.bin:  41%|████▏     | 1.09G/2.63G [00:12<00:17, 89.1MB/s]
Downloading pytorch_model.bin:  42%|████▏     | 1.10G/2.63G [00:12<00:17, 87.7MB/s]
Downloading pytorch_model.bin:  42%|████▏     | 1.11G/2.63G [00:12<00:17, 87.9MB/s]
Downloading pytorch_model.bin:  43%|████▎     | 1.12G/2.63G [00:12<00:17, 87.8MB/s]
Downloading pytorch_model.bin:  43%|████▎     | 1.13G/2.63G [00:13<00:17, 87.5MB/s]
Downloading pytorch_model.bin:  43%|████▎     | 1.14G/2.63G [00:13<00:16, 87.7MB/s]
Downloading pytorch_model.bin:  44%|████▍     | 1.15G/2.63G [00:13<00:16, 87.8MB/s]
Downloading pytorch_model.bin:  44%|████▍     | 1.16G/2.63G [00:13<00:16, 88.3MB/s]
Downloading pytorch_model.bin:  45%|████▍     | 1.17G/2.63G [00:13<00:16, 89.4MB/s]
Downloading pytorch_model.bin:  45%|████▌     | 1.18G/2.63G [00:13<00:15, 90.9MB/s]
Downloading pytorch_model.bin:  45%|████▌     | 1.20G/2.63G [00:13<00:15, 92.1MB/s]
Downloading pytorch_model.bin:  46%|████▌     | 1.21G/2.63G [00:13<00:15, 92.3MB/s]
Downloading pytorch_model.bin:  46%|████▌     | 1.22G/2.63G [00:13<00:15, 93.9MB/s]
Downloading pytorch_model.bin:  47%|████▋     | 1.23G/2.63G [00:14<00:14, 94.5MB/s]
Downloading pytorch_model.bin:  47%|████▋     | 1.24G/2.63G [00:14<00:14, 94.9MB/s]
Downloading pytorch_model.bin:  47%|████▋     | 1.25G/2.63G [00:14<00:14, 94.3MB/s]
Downloading pytorch_model.bin:  48%|████▊     | 1.26G/2.63G [00:14<00:14, 93.9MB/s]
Downloading pytorch_model.bin:  48%|████▊     | 1.27G/2.63G [00:14<00:14, 94.7MB/s]
Downloading pytorch_model.bin:  49%|████▊     | 1.28G/2.63G [00:14<00:14, 94.6MB/s]
Downloading pytorch_model.bin:  49%|████▉     | 1.29G/2.63G [00:14<00:14, 95.3MB/s]
Downloading pytorch_model.bin:  49%|████▉     | 1.30G/2.63G [00:14<00:14, 94.7MB/s]
Downloading pytorch_model.bin:  50%|████▉     | 1.31G/2.63G [00:14<00:13, 94.9MB/s]
Downloading pytorch_model.bin:  50%|█████     | 1.32G/2.63G [00:15<00:13, 95.9MB/s]
Downloading pytorch_model.bin:  51%|█████     | 1.33G/2.63G [00:15<00:13, 97.0MB/s]
Downloading pytorch_model.bin:  51%|█████     | 1.34G/2.63G [00:15<00:13, 97.1MB/s]
Downloading pytorch_model.bin:  51%|█████▏    | 1.35G/2.63G [00:15<00:13, 97.3MB/s]
Downloading pytorch_model.bin:  52%|█████▏    | 1.36G/2.63G [00:15<00:13, 96.4MB/s]
Downloading pytorch_model.bin:  52%|█████▏    | 1.37G/2.63G [00:15<00:12, 97.3MB/s]
Downloading pytorch_model.bin:  53%|█████▎    | 1.38G/2.63G [00:15<00:12, 98.1MB/s]
Downloading pytorch_model.bin:  53%|█████▎    | 1.39G/2.63G [00:15<00:12, 99.7MB/s]
Downloading pytorch_model.bin:  53%|█████▎    | 1.41G/2.63G [00:15<00:12, 101MB/s] 
Downloading pytorch_model.bin:  54%|█████▍    | 1.42G/2.63G [00:16<00:12, 101MB/s]
Downloading pytorch_model.bin:  54%|█████▍    | 1.43G/2.63G [00:16<00:12, 100MB/s]
Downloading pytorch_model.bin:  55%|█████▍    | 1.44G/2.63G [00:16<00:11, 100MB/s]
Downloading pytorch_model.bin:  55%|█████▍    | 1.45G/2.63G [00:16<00:11, 99.8MB/s]
Downloading pytorch_model.bin:  55%|█████▌    | 1.46G/2.63G [00:16<00:11, 99.9MB/s]
Downloading pytorch_model.bin:  56%|█████▌    | 1.47G/2.63G [00:16<00:11, 101MB/s] 
Downloading pytorch_model.bin:  56%|█████▌    | 1.48G/2.63G [00:16<00:11, 102MB/s]
Downloading pytorch_model.bin:  57%|█████▋    | 1.49G/2.63G [00:16<00:11, 102MB/s]
Downloading pytorch_model.bin:  57%|█████▋    | 1.50G/2.63G [00:16<00:11, 103MB/s]
Downloading pytorch_model.bin:  57%|█████▋    | 1.51G/2.63G [00:16<00:10, 103MB/s]
Downloading pytorch_model.bin:  58%|█████▊    | 1.52G/2.63G [00:17<00:10, 103MB/s]
Downloading pytorch_model.bin:  58%|█████▊    | 1.53G/2.63G [00:17<00:10, 102MB/s]
Downloading pytorch_model.bin:  59%|█████▉    | 1.55G/2.63G [00:17<00:10, 103MB/s]
Downloading pytorch_model.bin:  59%|█████▉    | 1.56G/2.63G [00:17<00:10, 103MB/s]
Downloading pytorch_model.bin:  60%|██████    | 1.58G/2.63G [00:17<00:10, 104MB/s]
Downloading pytorch_model.bin:  61%|██████    | 1.60G/2.63G [00:17<00:09, 105MB/s]
Downloading pytorch_model.bin:  62%|██████▏   | 1.63G/2.63G [00:18<00:09, 106MB/s]
Downloading pytorch_model.bin:  63%|██████▎   | 1.65G/2.63G [00:18<00:09, 107MB/s]
Downloading pytorch_model.bin:  63%|██████▎   | 1.67G/2.63G [00:18<00:08, 108MB/s]
Downloading pytorch_model.bin:  64%|██████▍   | 1.69G/2.63G [00:18<00:08, 108MB/s]
Downloading pytorch_model.bin:  65%|██████▍   | 1.71G/2.63G [00:18<00:08, 108MB/s]
Downloading pytorch_model.bin:  66%|██████▌   | 1.73G/2.63G [00:19<00:08, 109MB/s]
Downloading pytorch_model.bin:  67%|██████▋   | 1.75G/2.63G [00:19<00:08, 107MB/s]
Downloading pytorch_model.bin:  67%|██████▋   | 1.77G/2.63G [00:19<00:08, 96.1MB/s]
Downloading pytorch_model.bin:  68%|██████▊   | 1.78G/2.63G [00:19<00:09, 92.9MB/s]
Downloading pytorch_model.bin:  68%|██████▊   | 1.79G/2.63G [00:19<00:09, 91.0MB/s]
Downloading pytorch_model.bin:  69%|██████▊   | 1.80G/2.63G [00:19<00:09, 88.8MB/s]
Downloading pytorch_model.bin:  69%|██████▉   | 1.81G/2.63G [00:19<00:09, 88.4MB/s]
Downloading pytorch_model.bin:  69%|██████▉   | 1.82G/2.63G [00:20<00:09, 89.0MB/s]
Downloading pytorch_model.bin:  70%|██████▉   | 1.84G/2.63G [00:20<00:08, 88.7MB/s]
Downloading pytorch_model.bin:  70%|███████   | 1.85G/2.63G [00:20<00:08, 88.8MB/s]
Downloading pytorch_model.bin:  71%|███████   | 1.86G/2.63G [00:20<00:08, 88.0MB/s]
Downloading pytorch_model.bin:  71%|███████   | 1.87G/2.63G [00:20<00:08, 88.1MB/s]
Downloading pytorch_model.bin:  71%|███████▏  | 1.88G/2.63G [00:20<00:08, 89.4MB/s]
Downloading pytorch_model.bin:  72%|███████▏  | 1.89G/2.63G [00:20<00:08, 90.4MB/s]
Downloading pytorch_model.bin:  72%|███████▏  | 1.90G/2.63G [00:20<00:08, 91.1MB/s]
Downloading pytorch_model.bin:  73%|███████▎  | 1.91G/2.63G [00:21<00:07, 92.9MB/s]
Downloading pytorch_model.bin:  73%|███████▎  | 1.92G/2.63G [00:21<00:07, 94.4MB/s]
Downloading pytorch_model.bin:  73%|███████▎  | 1.93G/2.63G [00:21<00:07, 95.2MB/s]
Downloading pytorch_model.bin:  74%|███████▎  | 1.94G/2.63G [00:21<00:07, 96.4MB/s]
Downloading pytorch_model.bin:  74%|███████▍  | 1.95G/2.63G [00:21<00:06, 98.2MB/s]
Downloading pytorch_model.bin:  75%|███████▍  | 1.96G/2.63G [00:21<00:06, 97.1MB/s]
Downloading pytorch_model.bin:  75%|███████▍  | 1.97G/2.63G [00:21<00:06, 98.0MB/s]
Downloading pytorch_model.bin:  75%|███████▌  | 1.98G/2.63G [00:21<00:06, 98.7MB/s]
Downloading pytorch_model.bin:  76%|███████▌  | 1.99G/2.63G [00:21<00:06, 98.8MB/s]
Downloading pytorch_model.bin:  76%|███████▌  | 2.00G/2.63G [00:21<00:06, 98.9MB/s]
Downloading pytorch_model.bin:  77%|███████▋  | 2.01G/2.63G [00:22<00:06, 99.0MB/s]
Downloading pytorch_model.bin:  77%|███████▋  | 2.02G/2.63G [00:22<00:06, 99.3MB/s]
Downloading pytorch_model.bin:  77%|███████▋  | 2.03G/2.63G [00:22<00:05, 99.7MB/s]
Downloading pytorch_model.bin:  78%|███████▊  | 2.06G/2.63G [00:22<00:05, 101MB/s] 
Downloading pytorch_model.bin:  78%|███████▊  | 2.07G/2.63G [00:22<00:05, 101MB/s]
Downloading pytorch_model.bin:  79%|███████▉  | 2.08G/2.63G [00:22<00:05, 101MB/s]
Downloading pytorch_model.bin:  80%|███████▉  | 2.10G/2.63G [00:22<00:05, 103MB/s]
Downloading pytorch_model.bin:  80%|████████  | 2.11G/2.63G [00:23<00:05, 103MB/s]
Downloading pytorch_model.bin:  80%|████████  | 2.12G/2.63G [00:23<00:04, 103MB/s]
Downloading pytorch_model.bin:  81%|████████  | 2.13G/2.63G [00:23<00:04, 103MB/s]
Downloading pytorch_model.bin:  81%|████████▏ | 2.14G/2.63G [00:23<00:04, 103MB/s]
Downloading pytorch_model.bin:  82%|████████▏ | 2.15G/2.63G [00:23<00:04, 103MB/s]
Downloading pytorch_model.bin:  82%|████████▏ | 2.17G/2.63G [00:23<00:04, 105MB/s]
Downloading pytorch_model.bin:  83%|████████▎ | 2.19G/2.63G [00:23<00:04, 106MB/s]
Downloading pytorch_model.bin:  84%|████████▍ | 2.21G/2.63G [00:24<00:03, 107MB/s]
Downloading pytorch_model.bin:  85%|████████▍ | 2.23G/2.63G [00:24<00:03, 107MB/s]
Downloading pytorch_model.bin:  86%|████████▌ | 2.25G/2.63G [00:24<00:03, 107MB/s]
Downloading pytorch_model.bin:  86%|████████▋ | 2.28G/2.63G [00:24<00:03, 108MB/s]
Downloading pytorch_model.bin:  87%|████████▋ | 2.30G/2.63G [00:24<00:03, 108MB/s]
Downloading pytorch_model.bin:  88%|████████▊ | 2.32G/2.63G [00:24<00:02, 109MB/s]
Downloading pytorch_model.bin:  89%|████████▉ | 2.34G/2.63G [00:25<00:02, 109MB/s]
Downloading pytorch_model.bin:  90%|████████▉ | 2.36G/2.63G [00:25<00:02, 109MB/s]
Downloading pytorch_model.bin:  90%|█████████ | 2.38G/2.63G [00:25<00:02, 110MB/s]
Downloading pytorch_model.bin:  91%|█████████ | 2.40G/2.63G [00:25<00:02, 111MB/s]
Downloading pytorch_model.bin:  92%|█████████▏| 2.42G/2.63G [00:25<00:01, 110MB/s]
Downloading pytorch_model.bin:  93%|█████████▎| 2.44G/2.63G [00:26<00:01, 110MB/s]
Downloading pytorch_model.bin:  94%|█████████▎| 2.46G/2.63G [00:26<00:01, 110MB/s]
Downloading pytorch_model.bin:  94%|█████████▍| 2.49G/2.63G [00:26<00:01, 111MB/s]
Downloading pytorch_model.bin:  95%|█████████▌| 2.51G/2.63G [00:26<00:01, 111MB/s]
Downloading pytorch_model.bin:  96%|█████████▌| 2.53G/2.63G [00:26<00:00, 111MB/s]
Downloading pytorch_model.bin:  97%|█████████▋| 2.55G/2.63G [00:27<00:00, 111MB/s]
Downloading pytorch_model.bin:  98%|█████████▊| 2.57G/2.63G [00:27<00:00, 110MB/s]
Downloading pytorch_model.bin:  98%|█████████▊| 2.59G/2.63G [00:27<00:00, 111MB/s]
Downloading pytorch_model.bin:  99%|█████████▉| 2.61G/2.63G [00:27<00:00, 111MB/s]
Downloading pytorch_model.bin: 100%|██████████| 2.63G/2.63G [00:27<00:00, 111MB/s]
Downloading pytorch_model.bin: 100%|██████████| 2.63G/2.63G [00:27<00:00, 94.6MB/s]

Downloading (…)neration_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]
Downloading (…)neration_config.json: 100%|██████████| 137/137 [00:00<00:00, 74.6kB/s]

Downloading metadata:   0%|          | 0.00/926 [00:00<?, ?B/s]
Downloading metadata: 100%|██████████| 926/926 [00:00<00:00, 885kB/s]

Downloading readme:   0%|          | 0.00/530 [00:00<?, ?B/s]
Downloading readme: 100%|██████████| 530/530 [00:00<00:00, 289kB/s]
Downloading and preparing dataset None/None to /root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec...

Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/68.4M [00:00<?, ?B/s]

Downloading data:   1%|          | 851k/68.4M [00:00<00:08, 8.41MB/s]

Downloading data:   3%|▎         | 2.21M/68.4M [00:00<00:05, 11.4MB/s]

Downloading data:   6%|▌         | 4.17M/68.4M [00:00<00:04, 15.1MB/s]

Downloading data:  10%|█         | 6.97M/68.4M [00:00<00:03, 20.1MB/s]

Downloading data:  15%|█▌        | 10.3M/68.4M [00:00<00:02, 24.8MB/s]

Downloading data:  20%|██        | 13.7M/68.4M [00:00<00:01, 28.0MB/s]

Downloading data:  25%|██▌       | 17.2M/68.4M [00:00<00:01, 30.1MB/s]

Downloading data:  30%|███       | 20.6M/68.4M [00:00<00:01, 31.4MB/s]

Downloading data:  35%|███▌      | 24.1M/68.4M [00:00<00:01, 32.6MB/s]

Downloading data:  40%|████      | 27.6M/68.4M [00:01<00:01, 33.3MB/s]

Downloading data:  46%|████▌     | 31.1M/68.4M [00:01<00:01, 33.9MB/s]

Downloading data:  51%|█████     | 34.7M/68.4M [00:01<00:00, 34.4MB/s]

Downloading data:  56%|█████▌    | 38.3M/68.4M [00:01<00:00, 34.9MB/s]

Downloading data:  61%|██████▏   | 42.1M/68.4M [00:01<00:00, 35.6MB/s]

Downloading data:  67%|██████▋   | 45.8M/68.4M [00:01<00:00, 36.1MB/s]

Downloading data:  73%|███████▎  | 49.6M/68.4M [00:01<00:00, 36.7MB/s]

Downloading data:  78%|███████▊  | 53.5M/68.4M [00:01<00:00, 37.5MB/s]

Downloading data:  84%|████████▍ | 57.5M/68.4M [00:01<00:00, 38.1MB/s]

Downloading data:  90%|████████▉ | 61.4M/68.4M [00:01<00:00, 38.4MB/s]

Downloading data:  96%|█████████▌| 65.4M/68.4M [00:02<00:00, 38.8MB/s]
Downloading data: 100%|██████████| 68.4M/68.4M [00:02<00:00, 32.8MB/s]

Downloading data files:  50%|█████     | 1/2 [00:02<00:02,  2.58s/it]

Downloading data:   0%|          | 0.00/4.61M [00:00<?, ?B/s]

Downloading data:  20%|█▉        | 916k/4.61M [00:00<00:00, 9.07MB/s]

Downloading data:  51%|█████     | 2.35M/4.61M [00:00<00:00, 12.1MB/s]

Downloading data:  96%|█████████▌| 4.42M/4.61M [00:00<00:00, 16.0MB/s]
Downloading data: 100%|██████████| 4.61M/4.61M [00:00<00:00, 14.9MB/s]

Downloading data files: 100%|██████████| 2/2 [00:03<00:00,  1.53s/it]
Downloading data files: 100%|██████████| 2/2 [00:03<00:00,  1.69s/it]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]
Extracting data files: 100%|██████████| 2/2 [00:00<00:00, 1633.61it/s]

Generating train split:   0%|          | 0/76256 [00:00<?, ? examples/s]
Generating train split:  52%|█████▏    | 40000/76256 [00:00<00:00, 354096.87 examples/s]

Generating test split:   0%|          | 0/5103 [00:00<?, ? examples/s]

Dataset parquet downloaded and prepared to /root/.cache/huggingface/datasets/Dahoas___parquet/default-b9d2c4937d617106/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec. Subsequent calls will reuse this data.

  0%|          | 0/2 [00:00<?, ?it/s]
100%|██████████| 2/2 [00:00<00:00, 488.16it/s]
Using /root/.cache/torch_extensions/py310_cu116 as PyTorch extensions root...
Creating extension directory /root/.cache/torch_extensions/py310_cu116/fused_adam...
Traceback (most recent call last):
  File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 339, in <module>
    main()
  File "/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 271, in main
    optimizer = AdamOptimizer(optimizer_grouped_parameters,
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/adam/fused_adam.py", line 71, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 445, in load
    return self.jit_load(verbose)
  File "/opt/conda/lib/python3.10/site-packages/deepspeed/ops/op_builder/builder.py", line 480, in jit_load
    op_module = load(name=self.name,
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1284, in load
    return _jit_compile(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1597, in _write_ninja_file_and_build_library
    get_compiler_abi_compatibility_and_version(compiler)
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 336, in get_compiler_abi_compatibility_and_version
    if not check_compiler_ok_for_platform(compiler):
  File "/opt/conda/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 290, in check_compiler_ok_for_platform
    which = subprocess.check_output(['which', compiler], stderr=subprocess.STDOUT)
  File "/opt/conda/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/opt/conda/lib/python3.10/subprocess.py", line 526, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1.
[2023-04-21 14:49:09,614] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 1337
[2023-04-21 14:49:09,614] [ERROR] [launch.py:434:sigkill_handler] ['/opt/conda/bin/python', '-u', 'main.py', '--local_rank=0', '--model_name_or_path', 'facebook/opt-1.3b', '--gradient_accumulation_steps', '2', '--lora_dim', '128', '--zero_stage', '0', '--deepspeed', '--output_dir', '/workspace/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = 1

Can anyone help me? Thank you very much

mrwyattii commented 1 year ago

@huynhthanh98 The error is subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1. and it's originating from torch, not deepspeed. Torch is not able to find a C++ compiler in your environment to compile the kernels. I would recommend trying to re-install torch.

huynhthanh98 commented 1 year ago

@huynhthanh98 The error is subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1. and it's originating from torch, not deepspeed. Torch is not able to find a C++ compiler in your environment to compile the kernels. I would recommend trying to re-install torch.

Hi,

thank you very much, it worked.

I have another question. Do you know how i can train model if i have 2 GPUs?

ChaoChungWu-Johnson commented 1 year ago

@mrwyattii any idea about the exits with return code = -11 problem? here is my 'ds_report' :

--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.0
 [WARNING]  using untested triton version (2.0.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/workspace/sharing/johnsonwu/DeepSpeed/deepspeed/lib/python3.10/site-packages/torch']
torch version .................... 2.0.0+cu117
deepspeed install path ........... ['/workspace/sharing/johnsonwu/DeepSpeed/deepspeed/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.9.1+036c5d6d, 036c5d6d, master
torch cuda version ............... 11.7
torch hip version ................ None
nvcc version ..................... 11.7
deepspeed wheel compiled w. ...... torch 2.0, cuda 11.7
ChaoChungWu-Johnson commented 1 year ago

hmm, it seems to work now with the latest pull from the repo! (I still don't know why)

Nuclear-Beeper commented 1 year ago

After running:

python3 train.py --step 1 --deployment-type single_node

I got:

Traceback (most recent call last):
  File "/home/ltunix2/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 210, in <module>
    main(args)
  File "/home/ltunix2/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 195, in main
    launch_cmd(args, step_num, cmd)
  File "/home/ltunix2/DeepSpeedExamples/applications/DeepSpeed-Chat/train.py", line 175, in launch_cmd
    raise RuntimeError('\n\n'.join((
RuntimeError: Step 1 exited with non-zero status 247

I am attempting to use an RTX 3070 with 8 GB of memory, so I have adjusted the batch size:

deepspeed main.py \
   --data_path Dahoas/rm-static Dahoas/full-hh-rlhf Dahoas/synthetic-instruct-gptj-pairwise yitingxie/rlhf-reward-datasets \
   --data_split 2,4,4 \
   --model_name_or_path facebook/opt-1.3b \
   --per_device_train_batch_size 2 \
   --per_device_eval_batch_size 2 \
   --max_seq_len 512 \
   --learning_rate 9.65e-6 \
   --weight_decay 0. \
   --num_train_epochs 16 \
   --gradient_accumulation_steps 1 \
   --lr_scheduler_type cosine \
   --num_warmup_steps 0 \

The error from the log:

[2023-04-25 15:31:18,130] [ERROR] [launch.py:434:sigkill_handler] ['/home/ltunix2/anaconda3/envs/Test01/bin/python', '-u', 'main.py', '--local_rank=0', '--data_path', 'Dahoas/rm-static', 'Dahoas/full-hh-rlhf', 'Dahoas/synthetic-instruct-gptj-pairwise', 'yitingxie/rlhf-reward-datasets', '--data_split', '2,4,4', '--model_name_or_path', 'facebook/opt-1.3b', '--per_device_train_batch_size', '2', '--per_device_eval_batch_size', '2', '--max_seq_len', '512', '--learning_rate', '9.65e-6', '--weight_decay', '0.', '--num_train_epochs', '16', '--gradient_accumulation_steps', '1', '--lr_scheduler_type', 'cosine', '--num_warmup_steps', '0', '--seed', '1234', '--zero_stage', '2', '--deepspeed', '--output_dir', '/home/ltunix2/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/1.3b'] exits with return code = -9

So I guess the RTX 3070 just lacks too much memory to train the 1.3B model. I don't know if the 125M and 350M models would work.

mrwyattii commented 1 year ago

@huynhthanh98 The error is subprocess.CalledProcessError: Command '['which', 'c++']' returned non-zero exit status 1. and it's originating from torch, not deepspeed. Torch is not able to find a C++ compiler in your environment to compile the kernels. I would recommend trying to re-install torch.

Hi,

thank you very much, it worked.

I have another question. Do you know how i can train model if i have 2 GPUs?

@huynhthanh98 you can run with 2 GPUs using the --deployment-type single_node option. This will use all available GPUs on your local system.

mrwyattii commented 1 year ago

So I guess the RTX 3070 just lacks too much memory to train the 1.3B model. I don't know if the 125M and 350M models would work.

@Nuclear-Beeper Yes the 3070 will have too little memory to train the 1.3B model. I think the lower limit on memory is ~12-13GB with a batch size of 1. You could likely fit those smaller models on 8GB of memory, but I think the quality of the trained model will be noticeably worse for those smaller models.

nieallen commented 1 year ago

Anyone got this:

RuntimeError: Error building extension 'fused_adam'

not sure how to debug. Too little trace.

Anyone got idea how to enable more trace for this issue?

same error. how to deal with it ?

mrwyattii commented 1 year ago

@nieallen Could you try pre-compiling fused_adam and share the result of ds_report? DS_BUILD_FUSED_ADAM=1 pip install deepspeed