Closed ozbillwang closed 4 weeks ago
Fixed with install cuda toolkit
sudo apt install nvidia-cuda-toolkit
Got new error
$ deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
[2024-09-28 00:02:04,410] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 00:02:06,256] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-09-28 00:02:06,256] [INFO] [runner.py:585:main] cmd = /home/bill/study/github/jingyaogong/minimind/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None 1-pretrain.py
[2024-09-28 00:02:08,637] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 00:02:10,402] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-28 00:02:10,402] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-28 00:02:10,402] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-28 00:02:10,402] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-28 00:02:10,402] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-28 00:02:10,403] [INFO] [launch.py:256:main] process 99931 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0']
[2024-09-28 00:02:10,403] [INFO] [launch.py:256:main] process 99932 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1']
usage: 1-pretrain.py [-h] [--out_dir OUT_DIR] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE] [--device DEVICE] [--dtype DTYPE]
[--use_wandb] [--wandb_project WANDB_PROJECT] [--num_workers NUM_WORKERS] [--data_path DATA_PATH] [--ddp]
[--accumulation_steps ACCUMULATION_STEPS] [--grad_clip GRAD_CLIP] [--warmup_iters WARMUP_ITERS] [--log_interval LOG_INTERVAL]
[--save_interval SAVE_INTERVAL]
1-pretrain.py: error: unrecognized arguments: --local_rank=1
usage: 1-pretrain.py [-h] [--out_dir OUT_DIR] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE] [--device DEVICE] [--dtype DTYPE]
[--use_wandb] [--wandb_project WANDB_PROJECT] [--num_workers NUM_WORKERS] [--data_path DATA_PATH] [--ddp]
[--accumulation_steps ACCUMULATION_STEPS] [--grad_clip GRAD_CLIP] [--warmup_iters WARMUP_ITERS] [--log_interval LOG_INTERVAL]
[--save_interval SAVE_INTERVAL]
1-pretrain.py: error: unrecognized arguments: --local_rank=0
[2024-09-28 00:02:25,405] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 99931
[2024-09-28 00:02:25,406] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 99932
[2024-09-28 00:02:25,415] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] exits with return code = 2
Got new error
$ deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py [2024-09-28 00:02:04,410] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-28 00:02:06,256] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. [2024-09-28 00:02:06,256] [INFO] [runner.py:585:main] cmd = /home/bill/study/github/jingyaogong/minimind/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None 1-pretrain.py [2024-09-28 00:02:08,637] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-28 00:02:10,402] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]} [2024-09-28 00:02:10,402] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0 [2024-09-28 00:02:10,402] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2024-09-28 00:02:10,402] [INFO] [launch.py:164:main] dist_world_size=2 [2024-09-28 00:02:10,402] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2024-09-28 00:02:10,403] [INFO] [launch.py:256:main] process 99931 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0'] [2024-09-28 00:02:10,403] [INFO] [launch.py:256:main] process 99932 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] usage: 1-pretrain.py [-h] [--out_dir OUT_DIR] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE] [--device DEVICE] [--dtype DTYPE] [--use_wandb] [--wandb_project WANDB_PROJECT] [--num_workers NUM_WORKERS] [--data_path DATA_PATH] [--ddp] [--accumulation_steps ACCUMULATION_STEPS] [--grad_clip GRAD_CLIP] [--warmup_iters WARMUP_ITERS] [--log_interval LOG_INTERVAL] [--save_interval SAVE_INTERVAL] 1-pretrain.py: error: unrecognized arguments: --local_rank=1 usage: 1-pretrain.py [-h] [--out_dir OUT_DIR] [--epochs EPOCHS] [--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE] [--device DEVICE] [--dtype DTYPE] [--use_wandb] [--wandb_project WANDB_PROJECT] [--num_workers NUM_WORKERS] [--data_path DATA_PATH] [--ddp] [--accumulation_steps ACCUMULATION_STEPS] [--grad_clip GRAD_CLIP] [--warmup_iters WARMUP_ITERS] [--log_interval LOG_INTERVAL] [--save_interval SAVE_INTERVAL] 1-pretrain.py: error: unrecognized arguments: --local_rank=0 [2024-09-28 00:02:25,405] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 99931 [2024-09-28 00:02:25,406] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 99932 [2024-09-28 00:02:25,415] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] exits with return code = 2
The issue is that the script cannot recognize the --local_rank
parameter. DeepSpeed automatically adds the --local_rank
parameter when starting distributed training, but the script, which was updated a few days ago, forgot to handle this parameter.
parser.add_argument('--local_rank', type=int, default=-1, help='local rank for distributed training')
This has been added in the fix bug commit.
You can pull the latest code and try again.
Thank you for identifying this potential bug; there was indeed an oversight.
Thanks!
I'm not sure if you have a Chinese background. If you do, that's perfect, and there's no need to change the training set or anything else. Otherwise, you will need to find English training corpora (including both the pretrain and full_sft stages) to replace the current default settings.
The format can be referenced as written in the data_process.py
code. You just need to clean and produce jsonl files in the same format, or adjust the code to fit the new dataset.
Only the data preprocessing part of data_process.py
needs to be modified; nothing else needs to be changed (of course, the test questions in 0-eval-pretrain.py
and 2-eval.py
also need to be in English, but that's not a big issue).
Even if changes are needed, they will be minimal, and I believe you can fully understand and handle them easily.
Feel free to reach out if you have any further questions~
(can't type Chinese yet, since I just installed the ubuntu yesterday)
Thanks for the fix, after pull the latest codes, I got another issue
$ deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
[2024-09-28 15:20:42,067] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 15:20:43,849] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-09-28 15:20:43,850] [INFO] [runner.py:585:main] cmd = /home/bill/study/github/jingyaogong/minimind/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None 1-pretrain.py
[2024-09-28 15:20:46,162] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 15:20:47,893] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-28 15:20:47,893] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-28 15:20:47,893] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-28 15:20:47,893] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-28 15:20:47,893] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-28 15:20:47,893] [INFO] [launch.py:256:main] process 6147 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0']
[2024-09-28 15:20:47,894] [INFO] [launch.py:256:main] process 6148 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1']
Traceback (most recent call last):
File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 170, in <module>
init_distributed_mode()
File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 127, in init_distributed_mode
torch.cuda.set_device(DEVICE)
File "/home/bill/study/github/jingyaogong/minimind/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
[2024-09-28 15:20:51,895] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6147
[2024-09-28 15:20:52,048] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6148
[2024-09-28 15:20:52,049] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] exits with return code = 1
After I set export CUDA_VISIBLE_DEVICES=0,1
, seem the code moved on, but stop with another similar issue
LLM总参数量:26.878 百万
Full logs:
$ deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
[2024-09-28 15:38:32,551] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 15:38:34,404] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
Detected VISIBLE_DEVICES=0,1 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2024-09-28 15:38:34,404] [INFO] [runner.py:585:main] cmd = /home/bill/study/github/jingyaogong/minimind/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None 1-pretrain.py
[2024-09-28 15:38:36,744] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-09-28 15:38:38,635] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2024-09-28 15:38:38,635] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0
[2024-09-28 15:38:38,636] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2024-09-28 15:38:38,636] [INFO] [launch.py:164:main] dist_world_size=2
[2024-09-28 15:38:38,636] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2024-09-28 15:38:38,636] [INFO] [launch.py:256:main] process 9923 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0']
[2024-09-28 15:38:38,637] [INFO] [launch.py:256:main] process 9924 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1']
Traceback (most recent call last):
File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 170, in <module>
init_distributed_mode()
File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 127, in init_distributed_mode
torch.cuda.set_device(DEVICE)
File "/home/bill/study/github/jingyaogong/minimind/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 404, in set_device
torch._C._cuda_setDevice(device)
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
LLM总参数量:26.878 百万
[2024-09-28 15:38:43,637] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 9923
[2024-09-28 15:38:43,756] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 9924
[2024-09-28 15:38:43,757] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] exits with return code = 1
After I set
export CUDA_VISIBLE_DEVICES=0,1
, seem the code moved on, but stop with another similar issue在我设置export CUDA_VISIBLE_DEVICES=0,1
之后,代码似乎继续前进,但因另一个类似的问题而停止LLM总参数量:26.878 百万
Full logs: 完整日志:
$ deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py [2024-09-28 15:38:32,551] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-28 15:38:34,404] [WARNING] [runner.py:212:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only. Detected VISIBLE_DEVICES=0,1 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed. [2024-09-28 15:38:34,404] [INFO] [runner.py:585:main] cmd = /home/bill/study/github/jingyaogong/minimind/venv/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None 1-pretrain.py [2024-09-28 15:38:36,744] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [2024-09-28 15:38:38,635] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]} [2024-09-28 15:38:38,635] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0 [2024-09-28 15:38:38,636] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2024-09-28 15:38:38,636] [INFO] [launch.py:164:main] dist_world_size=2 [2024-09-28 15:38:38,636] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2024-09-28 15:38:38,636] [INFO] [launch.py:256:main] process 9923 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0'] [2024-09-28 15:38:38,637] [INFO] [launch.py:256:main] process 9924 spawned with command: ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] Traceback (most recent call last): File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 170, in <module> init_distributed_mode() File "/home/bill/study/github/jingyaogong/minimind/1-pretrain.py", line 127, in init_distributed_mode torch.cuda.set_device(DEVICE) File "/home/bill/study/github/jingyaogong/minimind/venv/lib/python3.11/site-packages/torch/cuda/__init__.py", line 404, in set_device torch._C._cuda_setDevice(device) RuntimeError: CUDA error: invalid device ordinal CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. LLM总参数量:26.878 百万 [2024-09-28 15:38:43,637] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 9923 [2024-09-28 15:38:43,756] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 9924 [2024-09-28 15:38:43,757] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=1'] exits with return code = 1
Everything is working fine, no issues were reproduced.
Are you sure there are two GPUs on the device? If not, set --num_gpus=1
.
If --num_gpus=1
, it will be no different from running python 1-pretrain.py
directly. Why use DeepSpeed to launch the script?
😊
Thanks, I follow https://github.com/jingyaogong/minimind/issues/26, and not realized the author has 2 GPUs.
Now, got OutOfMemory
error, my GPU is 4060, which has 8GB memory only, seems it doesn't work with this project.
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 49.56 MiB is free. Including non-PyTorch memory, this process has 7.43 GiB memory in use. Of the allocated memory 7.16 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[2024-09-28 16:57:33,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6656
[2024-09-28 16:57:34,932] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0'] exits with return code = 1
Thanks, I follow #26, and not realized the author has 2 GPUs.
Now, got
OutOfMemory
error, my GPU is 4060, which has 8GB memory only, seems it doesn't work with this project.torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 49.56 MiB is free. Including non-PyTorch memory, this process has 7.43 GiB memory in use. Of the allocated memory 7.16 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-09-28 16:57:33,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6656 [2024-09-28 16:57:34,932] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0'] exits with return code = 1
parser.add_argument("--batch_size", type=int, default=64, help="Batch size")
You can try adjusting the batch size to 32/16/8 or even 4, and experiment with running it using a batch size smaller than 64. Thank you.
If you start it by running python 1-pretrain.py
, Ubuntu is not necessary; Windows will suffice.
Thanks, I follow #26, and not realized the author has 2 GPUs. Now, got
OutOfMemory
error, my GPU is 4060, which has 8GB memory only, seems it doesn't work with this project.torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 49.56 MiB is free. Including non-PyTorch memory, this process has 7.43 GiB memory in use. Of the allocated memory 7.16 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-09-28 16:57:33,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6656 [2024-09-28 16:57:34,932] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0'] exits with return code = 1
parser.add_argument("--batch_size", type=int, default=64, help="Batch size")
You can try adjusting the batch size to 32/16/8 or even 4, and experiment with running it using a batch size smaller than 64. Thank you.
If you start it by running
python 1-pretrain.py
, Ubuntu is not necessary; Windows will suffice.
It is better, after adjust to --batch_size=16
, but after run about 2 hours, it crashed again.
Later, I adjusted the SWAP to 100GB, but still crash in the middle
Thanks, I follow #26, and not realized the author has 2 GPUs. Now, got
OutOfMemory
error, my GPU is 4060, which has 8GB memory only, seems it doesn't work with this project.torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 88.00 MiB. GPU 0 has a total capacty of 7.75 GiB of which 49.56 MiB is free. Including non-PyTorch memory, this process has 7.43 GiB memory in use. Of the allocated memory 7.16 GiB is allocated by PyTorch, and 4.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF [2024-09-28 16:57:33,448] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 6656 [2024-09-28 16:57:34,932] [ERROR] [launch.py:325:sigkill_handler] ['/home/bill/study/github/jingyaogong/minimind/venv/bin/python', '-u', '1-pretrain.py', '--local_rank=0'] exits with return code = 1
parser.add_argument("--batch_size", type=int, default=64, help="Batch size")
You can try adjusting the batch size to 32/16/8 or even 4, and experiment with running it using a batch size smaller than 64. Thank you. If you start it by running
python 1-pretrain.py
, Ubuntu is not necessary; Windows will suffice.It is better, after adjust to
--batch_size=16
, but after run about 2 hours, it crashed again.Later, I adjusted the SWAP to 100GB, but still crash in the middle
@ozbillwang Another approach is to try setting max_seq_len
to 200 in the model/LMConfig.py
file. Reducing the context length will significantly save on GPU memory, allowing you to maintain a batch size of 32 or even higher.
Thanks, still same issue.
Got this issue when run the command
deepspeed --master_port 29500 --num_gpus=2 1-pretrain.py
Here is the full log
I installed all python packages via Virtualenv .
Notes:
one of packages in
rrequirements.txt
doesn't support latest python 3.12.x, so I have to usepyenv
to install Python 3.11.xneed install
nvidia-cuda-toolkit
(https://github.com/jingyaogong/minimind/issues/48#issuecomment-2379356422)(not required) set
export CUDA_VISIBLE_DEVICES=0
(https://github.com/jingyaogong/minimind/issues/48#issuecomment-2380418021)follow #26, but adjust the command to
--num_gpus=1
, since I have only one GPUGot
Out of Memory
error, as recommended, feed--batch-size
, but commanddeepspeed
doesn't support--batch-size
yet, so I adjust and run python directlySwap is not enable, add a new /swapfile2 with 64GB to
/etc/fstab
Run a while now, but
killed
in the middle, then recommend to adjustmax_seq_len
to200
in the filemodel/LMConfig.py